Evad 008

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

GBE

Deep Learning in Population Genetics


Kevin Korfmann1, Oscar E. Gaggiotti2, and Matteo Fumagalli 3,∗

1
Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, Germany
2
Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife KY16 9TF, UK
3
Department of Biological and Behavioural Sciences, Queen Mary University of London, UK

*Corresponding author: E-mail: m.fumagalli@qmul.ac.uk.

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


Accepted: 16 January 2023

Abstract
Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to
study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or com­
putationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for popu­
lation genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the
model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently em­
ployed in the field comprise discriminative and generative models with fully connected, convolutional, or recurrent layers.
Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The appli­
cation of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of nat­
ural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched
architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive perform­
ance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and
creative representation of population genetic data, will provide further opportunities for technological advancements in the field.
Key words: population genetics, machine learning, artificial neural networks, simulations, balancing selection.

Significance
Deep learning, a powerful class of supervised machine learning, is emerging as a promising inferential framework in
evolutionary genomics. In this review, we introduce all deep learning algorithms currently used in population genetic
studies, highlighting their strengths, limitations, and empirical applications. We provide perspectives on their interpret­
ability and usage in face of data uncertainty, whilst suggesting new directions and guidelines for making the field ac­
cessible and inclusive.

From Model-Based to Data-Driven more efficient computational algorithms. Therefore, the


Discipline field of population genetics has been dominated by model-
based statistical approaches. One could even say that many
Population genetics arose in the early 20th century as a con­
population geneticists would agree to the proposition of
ceptual framework aimed at unifying two opposing views
slightly modifying George E.P. Box’s aphorism so as to say
of evolution (Provine 2020). As such, it developed a rich that in our field, all models are wrong but many are useful.
body of theory that became a vast treasure trove of prob­ The preeminence of model-based statistical inference
abilistic models to develop sophisticated statistical methods may explain the fact that our field has lagged behind other
when molecular data became available. This body of theory life-science disciplines in the adoption of machine learning
has continued to grow in complexity in order to consider methods and, in particular, deep learning approaches.
more realistic evolutionary and genetic scenarios as well as Clearly, the black-box nature of deep learning is an

© The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse,
distribution, and reproduction in any medium, provided the original work is properly cited.

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 1
Korfmann et al. GBE

important obstacle to applications in the domain of popu­ with vast amounts of genomes and metadata at hand in
lation genetics, which main objective is to uncover the gen­ the past few years. For instance, in human population gen­
etic and evolutionary mechanisms responsible for the etics, scientists have access to high-quality whole-genome
diversity of life on our planet. Another deterrent is the ap­ sequencing data from more than 150,000 individuals
parent difference in foci between the fields of statistics from the UK Biobank (Halldorsson et al. 2022), and more
and machine learning. Statistics is focused on inference than 3,000 individuals distributed world-wide (Byrska-
through the creation and fitting of a probabilistic model Bishop et al. 2022), or to hundreds of genomic data from
while machine learning is focused on prediction using ancient samples (https://reich.hms.harvard.edu/datasets).
general-purpose algorithms that capture patterns present In this review, we will focus on a particular subset of su­
in complex and large data sets (Bzdok et al. 2018). pervised machine learning algorithms, namely deep neural

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


However, population geneticists are interested in both in­ networks. Although such methods can be considered as
ference and prediction, as clearly illustrated by the general the epitome of a black box, we will argue that new ad­
interest in making inferences about demographic history of vances in this field are providing the tools we need to un­
species on the one hand and detecting signatures of natural cover the mechanisms underlying the complex patterns
selection or assigning individuals to populations on the present in population genomic data. Moreover, deep learn­
other. Nevertheless, most genetic clustering methods and ing can be implemented to analyse raw genetic data as well
so-called genome scans of selection are based on probabil­ as summary statistics. Additionally, it has been used to carry
istic models, in some cases mechanistic [e.g. Bayescan out statistical inference about the demographic history of
(Foll and Gaggiotti 2008) and STRUCTURE (Pritchard populations as well as to carry out selection scans and
et al. 2000)] and in others phenomenological [e.g. LFMM assign individuals to geographic locations. Applications
(Frichot et al. 2013) and DAPC (Jombart et al. 2010)]. to demographic history inference embrace the model-
The focus on model-based statistical inference in based tradition of population genetics in that the training
population genetics has been challenged by the massive set (see Glossary) is usually generated through simula­
data sets generated by next-generation sequencing tions of specific evolutionary scenarios. Applications to
technologies (Levy and Myers 2016). This is particularly genome scan methods on the other hand, rely on new
the case for maximum-likelihood and Bayesian methods, techniques for evaluating the importance of features, in
which are implemented using expensive computational this case loci, in predicting an outcome such as a pheno­
methods such as Monte Carlo Markov Chain and type or an environmental factor that may exert a selective
Expectation-Maximization. In principle, the computational pressure.
cost of calculating the likelihood function of very complex We will first provide a definition of supervised machine
models, can be overcome using Approximate Bayesian learning and its applications in population genetics. We will
Computation (ABC), which relies on the use of summary then focus our attention on various deep learning algorithms
statistics to capture the information present in raw popula­ currently used in the field, with a discussion on efforts to
tion genetic data (Bertorelle et al. 2010). In ABC, the poster­ “open the black box” of said algorithms. We will finally dis­
ior distribution of the parameter(s) to be estimated is cuss ongoing challenges of deep learning applications in
approximated without the calculation of a likelihood func­ population genetics, and highlight future research directions.
tion. Instead, a model fit is obtained by the collection of si­
mulated summary statistics matching the observed values
(Beaumont et al. 2002). ABC has been widely and success­ Machine Learning in Population Genetics
fully used for population genetic inferences (Lopes and Machine learning, a subset of artificial intelligence, refers to
Beaumont 2010). However, capturing enough information a class of operations using data to perform inferential tasks
requires large numbers of summary statistics which lead to without explicit mathematical models. To do so, machine
a “curse of dimensionality” because, as the number of learning algorithms identify informative patterns which can
summary statistics increases, the error in the approximation be then used to predict unknown outcomes. Typically, the
increases (Prangle 2015). This problem has led to an in­ performance of machine learning algorithms increases with
creasing interest in machine learning approaches the amount of available data. Machine learning comprises
(Schrider and Kern 2018). The underlying rationale here is both supervised and unsupervised algorithms. Unsupervised
that analysing genomic data with machine learning meth­ machine learning aims at finding patterns and clusters within
ods can uncover signatures of evolutionary and genetic the data, and does not have a notion of prediction. On the
processes in a model agnostic way and in doing so teach other hand, supervised machine learning algorithms automat­
us something new about nature (Schrider and Kern ically tune their internal parameters to maximize the predic­
2018). But a major motivation for the shift is the practical tion accuracy and, as such, require a known data set (called
reality that population genetics has been transitioning training set) to learn the relationship between input and
from a theory-driven discipline into a data-driven field output.

2 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023
Deep Learning in Population Genetics GBE

Glossary
• Accuracy: proportion of correct predictions made by a model
• Activation function: operation that each neuron performs
• Attribute: name of a variable describing an observation
• Bias term: a term attached to neurons allowing the model to represent patterns that do not pass through the origin
• Backpropagation: gradient descent-based learning algorithm for calculating derivatives through the network start­
ing from the last layer
• Confusion Matrix: table that summarizes the prediction performance by providing false and true positive/negative
rates

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


• Embedding: learned low-dimensional continuous vector representation of a concept (e.g. a word, sentence, geno­
type matrix or graph)
• Epoch: the number of times the algorithm sees the data set
• Feature: input variable used in making predictions
• Hyperparameters: higher level properties of a model controlling the training process (e.g. learning rate, number of
epochs) and that need to be tuned, in principle before the ML model is trained
• Instance: a data point or sample in a data set (observation)
• Learning rate: magnitude at which an algorithm updates its parameters
• Loss: (also called cost) measurement of distance between predictions and ground truth; its function is minimized dur­
ing training
• Normalization: scaling technique used when input features have different ranges
• Regularization: an additional penalty to the loss function for better generalization
• Testing set: portion of the data set that it is not used for training, but rather to evaluate the performance a neural
network
• Training set: portion of the data set that it is used to optimize parameters of a neural network
• Tuning or hyperparameter optimization: process of finding the hyperparameter values that maximize the per­
formance of the model
• Validation set: portion of the data set that it is used for monitoring the training of a neural network

To train a supervised machine learning algorithm, the application (which only uses observed data) and those
available data sets are typically divided into training, valid­ aimed at detecting selection is that the latter implement
ation, and testing sets, with the latter two sets used to evalu­ an innovation first introduced by Pavlidis et al. (2010)
ate the performance during and after training. In supervised whereby the ML algorithms are trained using synthetic
learning, a labeled data set (which explicitly relates any given data sets generated via simulations. These applications,
input to a specific output) is given to the algorithm. The loss therefore, can be considered as being part of likelihood-
(the distance between the predicted and true value) is calcu­ free simulation-based approaches (Cranmer et al. 2020),
lated, and at the next iteration the internal parameters are which are commonly employed in population genetics.
updated towards decreasing loss (and increasing accuracy). Currently, most population genetics applications of ML
Training a supervised machine learning algorithm is a fine use this strategy but, as we describe below, some recent ap­
balance between prediction accuracy over the training set plications only use observed data to train the algorithms.
and generalization performance over the testing set. These applications, however, require the combination of
Machine learning has a rich history in biological sciences genotypic data with phenotypic, environmental or geo­
and genomics (reviewed in Yue and Wang 2018; Zou et al. graphic coordinate data.
2019; Greener et al. 2022). Additionally, supervised ma­ As already stated, in this review we will focus on deep
chine learning methods have been designed and deployed learning, a class of machine learning algorithms based on arti­
to perform population genetic tasks such as variant calling ficial neural networks comprising nodes in multiple layers con­
(Poplin et al. 2018) and the prediction, characterization, necting features (input) and responses (output) (LeCun et al.
and localization of signatures of natural selection (Pavlidis 2015). Weights between nodes are optimized during the
et al. 2010; Lin et al. 2011; Ronen et al. 2013; Pybus training to minimize the distances between predictions and
et al. 2015; Schrider and Kern 2016; Sugden et al. 2018; ground truth. After training, an ANN can predict the response
Mughal and DeGiorgio 2019; Koropoulis et al. 2020). An given any arbitrary new input data. Unlike approaches that
important difference between the variant calling use a predefined set of summary statistics as input, deep

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 3
Korfmann et al. GBE

learning algorithms can effectively learn which features are generative models. For each type of algorithm, we illustrate
sufficient for the prediction (LeCun et al. 2015). This is an im­ their main applications in the field and the novel findings
portant aspect as summary statistics are meaningful but generated by their deployments. Note that these general al­
human-constructed features. When dealing with different gorithms have a long history spanning many decades and
sources of raw data, the design of such features has been a numerous original contributions which we cannot properly
major part of information engineering. A key finding of credit in our review because of space. Thus, we refer read­
deep learning was that such features emerged within a ers interested in historical developments to previous publi­
well-trained deep network: they are effectively suggested cations (Schmidhuber 2014).
and discovered by a network during training (Krizhevsky
et al. 2012). This finding has been repeated in different do­

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


mains: features can be automatically discovered, and new Fully Connected Neural Networks
suggestions made, by the approaches of deep learning. Fully connected neural networks (FCNNs) are suitable for
Nodes in an ANN can be arranged in various numbers and generic prediction problems when there are no special rela­
layers, making this method as flexible and “deep” as needed. tions among the input data features. They can be viewed as
Deep learning in population genetics is in its infancy, a generalization of linear regression. In fact, standard re­
and most of current applications rely on synthetic data gression is nested in the general neural network framework
sets for training. Nevertheless deep learning represents a in the sense that a linear regression fits a hyperplane to the
notable progress over commonly used simulation-based data, while a neural network fits a space of hyperplanes in a
techniques for several reasons. First, they have the cap­ transformed space (Qin et al. 2022). This becomes clear by
acity to handle any feature extracted from a data set as in­ comparing the formulation for the simplest multivariate lin­
put and are less sensitive to poorly crafted summary ear regression model with the equation representing the
statistics than ABC (Csilléry et al. 2010). Second, neural operations taking place in a single node of a hidden layer
networks are universal approximators of any complex of an FCNN,
function provided that they include a sufficiently large
number of “neurons,” non-linear units (Hornik et al.
􏽐
1989). Nevertheless, careful monitoring of networks’ linear regression: yi (x, w) = b + Ii=1 wi xi
training and a posteriori diagnostic analyses are required 􏽐
FCNN: s(x, w) = f(b + Ii=1 wi xi ),
to ensure that predictions are robust.
Whilst overviews of machine learning applications for
population and molecular genetics are provided elsewhere
(Schrider and Kern 2018; Fountain-Jones et al. 2021;
Kumar et al. 2022), here we aim at providing an update
on the latest advances in deep learning algorithms and
how they have been exploited to address questions in
population genetics. Additionally, we focus our attention
on deep neural networks, in all their supervised forms, ra­
ther than including other commonly used algorithms such
as support vector machine (Pavlidis et al. 2010), random
forests (Schrider and Kern 2016; Vizzari et al. 2020), gradi­
ent forests (Laruson et al. 2022), and hierarchical boosting
(Pybus et al. 2015). Finally, we restrict our review on appli­
cations in population genomics while acknowledging that
similar algorithms herein described are used in other related
disciplines like genomics (Yue and Wang 2018), phyloge­
netics (Suvorov et al. 2020; Azouri et al. 2021; Blischak
et al. 2021), phylogeography (Fonseca et al. 2021; Perez
et al. 2022), and epidemiology (Voznica et al. 2021).

Deep Learning Algorithms


FIG. 1.—A simple FCNN consisting of a single hidden layer with only
We now introduce, describe and discuss four common fam­ two nodes. f and g represent different activation functions used respective­
ilies of architectures for deep learning algorithms used in ly in the hidden layer and the output layer and h and o superscripts are used
population genetics: fully connected neural networks, con­ to identify parameters associated with these layers; all other parameters are
volutional neural networks, recurrent neural networks, and defined in the text.

4 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023
Deep Learning in Population Genetics GBE

where b is the bias (not to be confounded with statistical (SNPs), and identity-by-state tracts are among the most im­
bias), w = {wi } is a vector of weights, x = {xi } is a vector of portant features for the inference of population size
input features (explanatory variables), and f is a nonlinear changes and type of selection.
activation function. In an FCNN with a single hidden layer, Another example of an FCNN application in population
there will be a number J of hidden nodes, each carrying out genetics that uses simulated data to train the algorithm is
a similar operation using a different vector of weights, all of provided by the work of Burger and colleagues on the esti­
which can be represented by a matrix W = {wij }, mation on mutation rates (Burger et al. 2022). They show
i = 1, 2 . . . , I, j = 1, 2, . . . J. A very simple example of an that a simple neural network is able to recapitulate estima­
FCNN with one hidden layer and only two nodes is pre­ tors of mutation rate for intermediate recombination rates.
sented in figure 1. As a novel methodological advance, their implementation

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


In the linear regression case a dependent variable is com­ features an adaptive reweighting of the loss function based
puted by calculating the dot-product of a set of input data on model-based estimators of the mutation rate. By doing
points with a set of parameters. This output variable is then so, with sufficient and appropriate training set, only a single
used in the context of a maximum-likelihood or hidden layer is required to achieve the same performance of
least-square approach to optimize the set of learnable para­ model-based estimators. The method was able to recover
meters. FCNNs extend this idea by computing a matrix- variation in mutation rates from synthetic human popula­
product of the weight matrix with the input data points, tion genetic data under a realistic recombination map.
which is then transformed with a non-linear activation There are also recent population genetics applications of
function. The activation function is applied element-wise FCNNs that implement the standard approach of training
and the result is called an embedding. Instead of using algorithms using observed instead of simulated data. A
the maximum-likelihood or least-square approaches for op­ good example is Locater, which assigns individual geno­
timization, FCNNs are optimized using the multivariate ver­ types to their geographic origin (Battey et al. 2020).
sion of the gradient-descent algorithm, which iteratively Interestingly, this method implements a regression ap­
adapts the parameters across the network layers [back- proach that is capable of assigning correlated genetic sam­
propagation algorithm (Linnainmaa 1976; LeCun et al. ples to similar geographic space. Uncertainty in the
1989)] based on a task-specific loss-function and learning estimates due to drift is taken into account by running pre­
rate. A fundamental property of FCNNs is expressed by dictions in windows across the genome. Simulations indi­
the Universal Approximation Theorem, which states that cate that Locater has an accuracy comparable to that
a neural network with a single hidden-layer can approxi­ of other state-of-the-art competing algorithms but with
mate any continuous function to any desired precision. shorter run-times. Its application to an empirical population
Precision can be increased by increasing the number of hid­ genetic data set of Anopheles mosquitoes, Plasmodium fal­
den neurons or the number of hidden layers. It is this prop­ ciparum, and human populations, provides results that are
erty that enables the use of neural networks as a viable in general concordant with current knowledge.
alternative to common model-based statistical methods. Another example that only uses observed data to train
In an early application of deep learning methods to the FCNN is DeepGenomeScan (Qin et al. 2022).
population genetics, FCNNs are used to simultaneously in­ However, this method’s objective departs from the preva­
fer natural selection and population bottlenecks (Sheehan lent use of neural networks, that is prediction and pattern
and Song 2016). This approach was inspired by ABC meth­ recognition. Its aim is to develop a statistical framework
ods and therefore used summary statistics to extract the in­ to carry out genome scans or GWAS, much in the same
formation present in the raw data, which was then fed to a way that PCA and redundancy analysis have been used to
fixed-size linear input layer of the network. To discriminate develop equivalent approaches (Luu et al. 2017; Capblancq
between demographic and natural selection effects, et al. 2018). Specifically, DeepGenomeScan implements an
Sheehan and Song trained the FCNN using simulated FCNNs that uses genotypes to predict individuals’ traits
data sets generated under various models assuming differ­ (e.g. geographic coordinates or phenotype), and con­
ent bottleneck times and selection models (Sheehan and structs a feature importance measure based on the
Song 2016). The software evoNet, which implemented weights of the trained network. Furthermore, P-values
said FCNN, was applied to almost 200 genomes of for variable importance are obtained through bootstrap­
Drosophila melanogaster from Africa to jointly infer the ping of the input. As opposed to other methods that can
demography history and loci under selection. One interest­ only detect linear associations, DeepGenomeScan is
ing analysis in the study is the evaluation of the most in­ able to detect non-linear ones thanks to the non-linear ap­
formative summary statistics, either by permutation or proximation property of FCNNs. Its application to a gen­
perturbation. Notably, summary statistics derived from omic data set of human samples of European ancestry
the site frequency spectrum, linkage disequilibrium (LD), identified novel targets of natural selection which showed
number and location of single-nucleotide polymorphisms significant geographic variation.

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 5
Korfmann et al. GBE

Finally, we note that FCNNs have also been used in the step. The number of kernels, their dimensions, and initial­
context of ABC frameworks. Early studies used neural net­ ization are all hyperparameters of the model.
works to construct the posterior distribution of parameters CNNs can be regarded as a regularized version of FCNNs
from the collection of accepted values (Blum and François with a focus on localized spatial signatures. In fact, a funda­
2010), as implemented in the abc package (Csilléry et al. mental property of CNNs is the space-invariance of the
2012). More recently, Mondal and colleagues coupled an learned features in the data set, which means that they
ABC framework, using the site frequency spectrum (SFS) can identify a pattern regardless of its spatial location in
as summary statistic, with a four-layer FCNN to infer the the image. Note, however, that identification of feature
demographic history of human Eurasian populations realizations like rotations or scaling requires either appro­
(Mondal et al. 2019). Their implementation includes an priate samples or perturbations of the input (Goodfellow

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


et al. 2016).
ad hoc noise injection algorithm to partly take into the ac­
First applications of CNNs in population genetics relied
count any bias associated with a simulated training set. A
on “image” data sets in the form of stacked summary sta­
similar study by Villanea and Schraiber used the joint SFS
tistics. The method implemented in software diploS/HIC
between Europeans and Neanderthal genomes to fit a
aimed at classifying genomic windows into neutral regions
demographic model using a 3-layer FCNN (Villanea and
or under soft or hard selective sweeps from unphased gen­
Schraiber 2019). Both studies inferred multiple gene flow otypes (Kern and Schrider 2018). It did so by applying con­
events between archaic and anatomically modern humans. volutional operations on a feature vector of normalized
Summary statistics and genotype matrices are not the summary statistics calculated in windows surrounding the
only way in which population genomic data can be de­ target location. The architecture consisted of three
scribed and used as input to deep learning algorithms. It branches of two-dimensional convolutional layers with dif­
is also possible to represent samples of sequences as images ferent filter sizes, followed by max pooling, flattening and
and, in the next section, we discuss an architecture that is two fully connected layers. Extensive simulations of tested
being increasingly applied to such data. scenarios were produced to train the CNN. The authors
showed that CNNs outperformed competing ML algo­
rithms previously used for this classification task (Schrider
Convolutional Neural Networks and Kern 2016), possibly because CNNs retain the spatial
Convolutional neural networks (CNNs) are specifically de­ relationships of summary statistics. Notably, with moderate
signed to analyse data that has a grid-like structure, such sample size, diploS/HIC appears to be robust to model
as images (LeCun et al. 2004; Krizhevsky et al. 2012). misspecification as it retains accuracy when predictions
Whilst in theory FCNNs could be used to make predictions for a population growth demography were obtained from
from images, the number of features (i.e. pixels) they con­ CNNs trained on constant size population simulations. As
tain would require networks with a very large number of an application of diploS/HIC, the authors replicated pre­
parameters, which would render them very slow and com­ vious findings of selective sweep in the Anopheles gambiae
putationally expensive. Similarly to FCNNs, CNNs are com­ genome. A later extension of this method led to
prised of a set of learnable parameters (LeCun et al. partialS/HIC which uses CNNs on a larger feature vec­
1989; LeCun and Bengio 1995). However, as opposed to tor of summary statistics for a finer classification of selective
FCNNs, in which hidden layers are all of the same type events, including partial sweeps and linked selection (Xue
(layers of neurons carrying out similar operations), CNNs et al. 2020). Finally, an additional application of CNNs
architecture consists of consecutive sets of convolutional based on summary statistics to test against different modes
and pooling layers, followed by a fully connected set of of selective sweeps has been recently proposed (Caldas
layers (similar to an FCNN, fig. 2). The first convolutional et al. 2022). This study uses varying window sizes to accom­
layer takes the input image and carries out a convolution modate the calculation of summary statistics at different
using a kernel (also known as filter; a matrix of learnable genomic extents within the target loci. They also intro­
parameters) to generate a feature map that is then fed to duced a hybrid simulation strategy to pair the flexibility of
the pooling layer. This layer uses a filter to reduce the size forward-in-time simulations with the efficiency of coales­
of the feature map and to help dissociate a particular fea­ cent ones.
ture from its position in the input image. This first set of op­ An approach that fully exploits the potential of CNNs is
erations will capture coarse grained features; adding to replace summary statistics as input with full information
additional convolutional and pooling layers helps capture on sequence alignments, with convolutional layers auto­
more fine-grained features (O’Shea and Nash 2015). The fi­ matically extracting informative features. Input data can
nal step of the convolutional layers (flatten step) converts consist of either genotype or haplotype sequences. In the
the feature map into a vector that is fed to the fully con­ simplest form, input data are a binary matrix, with rows
nected layers that will carry out the image classification and columns corresponding to individuals and alleles at

6 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023
Deep Learning in Population Genetics GBE

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


FIG. 2.—A simple CNN illustration consisting of the input matrix (i.e. genotype matrix), a user-specified number of kernels (or filters) and the resulting
feature maps, followed by an FCNN.

each SNP, respectively. Under this representation, and in valuable consideration since, when reliable simulators are
opposition to the structured nature of “classic” images, available (as in the case of population genetics), we have ac­
the ordering of individuals (i.e. random samples from a cess to theoretically infinite training data, the latter being
population) in an unstructured population is arbitrary and constrained by computing time only. The implemented
carries no information (Chan et al. 2018); i.e. genetic software defiNETti was applied to illustrate the accuracy
data are exchangeable. However, standard CNNs rely on of exchangeable neural networks to predict recombination
spatial information and, therefore, the ordering of the hotspots in human data.
data can affect its accuracy. To avoid this problem, indivi­ Further solutions to tackle the issue of exchangeable
duals need to be sorted in a “biologically meaningful” genetic data have been explored by Torada et al. (2019)
way. For example, Flagel and collaborators sort chromo­ in the software ImaGene. Specifically, the authors showed
somes by genetic similarity (Flagel et al. 2018). how ordering haplotypes and SNPs by frequency leads to
Additionally, they represent the information on genomic accurate predictions of positive selection. Whilst sorting
positions of SNPs as a separate branch in the architecture. SNPs implied a loss of information on LD patterns, this ap­
Interestingly, the inclusion of monomorphic sites in win­ proach makes training faster with minimal decay in accur­
dows of fixed length seems to yield good accuracy for pre­ acy, as the number of learnable parameters is drastically
dicting natural selection, as shown in a separate study reduced as the final fully connected layer is not required.
(Nguembang Fadja et al. 2021). Notably, several applica­ However, double-sorting makes the method less appropri­
tions of the proposed method are illustrated, with CNN ate for a general-purpose methodology. Additionally, by
achieving equal if not better performance than training and testing ImaGene with simulations condi­
state-of-the-art methods to detect gene flow and selective tioned on different demographic models, the authors
sweeps, estimate recombination rates, and infer demo­ quantified the drop in accuracy when CNNs are affected
graphic parameters (Flagel et al. 2018). Therefore, these by model misspecification during training. Finally, a multi­
findings demonstrated the capability of CNNs to infer class classification approach was proposed as an alternative
population genetic parameters, even in cases where a the­ method to approximate the posterior distribution of the se­
oretical framework is not available. lection coefficient, a continuous parameter typically hard to
To address the exchangeability issue, Chan et al. (2018) estimate.
proposed an exchangeable neural network. This architec­ In another landmark study, Sanchez et al. (2021) provide
ture consists of convolutional layers with 1-dimension ker­ a comprehensive framework for building deep neural net­
nels with a subsequent permutation-invariant function to works taking into account several nuances of the input
allow for the network to be insensitive to the order of indi­ data, such as the variable number of SNPs, their correlation,
viduals. Although they employed the mean operation and the exchangeability of individuals. These challenges
as permutation-invariant function, other functions are pos­ were tackled by proposing an architecture, called
sible, including a fully connected layer. Another important SPIDNA (Sequence Position Informed Deep Neural
contribution of this study is the adoption of a “simulation- Architecture), which consisted of stacks of multiple blocks
on-the-fly” approach: training data is continuously gener­ of convolutional, pooling, and fully connected layers. In
ated by simulations to avoid the network to see the same addition to deploy their method to reconstruct changes in
data twice and therefore to reduce overfitting. This is a effective population size of cattle breed populations, the

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 7
Korfmann et al. GBE

authors compared the accuracy of several deep neural net­ selection, is presented by Isildak et al. (2021) in the software
works against ABC, including hybrid approaches. Notably, BaSe. Although both architectures exhibit high classification
results suggest that integrating deep learning with ABC accuracy to distinguish between neutrality and selection, CNN
marginally improves performance, and possibly explainabil­ outperformed FCNN to predict the type of balancing selec­
ity. Further investigations from the same authors demon­ tion, a task that proved too challenging when relying solely
strated a more prominent increased performance using on summary statistics as input. Authors used forward-in-time
deep neural networks (Sanchez 2022). These studies depart simulations and conditioned the target variants to a prede­
from previous attempts to adapt existing architectures, and fined range of final allele frequency. To counterbalance the in­
instead they suggest to build novel architectures tailored to creased computational time associated with this simulation
the specifics of population genetic data. scheme, a data augmentation to artificially enlarge the train­

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


In a later study, Gower et al. (2021) aimed to identify ing data was adopted.
signatures of adaptive archaic introgression in the human In recent years, the generation of sequencing data from
genome without relying on statistics that capture the fre­ ancient or historical samples, as well as from capture-
quency of putatively introgressed haplotypes. The authors recapture and evolve-and-resequence experiments, has al­
developed a deep learning method based on CNNs, lowed for a direct observation of how genetic diversity and
genomatnn, to jointly infer archaic admixture and positive allele frequencies change under natural or controlled condi­
selection. genomatnn is trained from a matrix consisting tions over time. To detect positive selection with time-series
of concatenated genotype alignments encompassing do­ data, Whitehouse and Schrider (2022) proposed to stack ei­
nor (archaic humans) and recipient (modern humans) popu­ ther allele frequency or haplotype data over sampling times
lations. Matrix entries represent counts of minor alleles in to be fed as input to one-dimensional CNNs. Their method
an individual haplotype within a given genomic window. was implemented in the software Timesweeper, and eval­
Thus, this approach is applicable to low-quality sequencing uated under various sampling conditions. Results show over­
data where genotype calling can be bypassed by the statis­ all good accuracy levels for predicting selection, localizing
tical estimation of allele frequencies (Kim et al. 2011). the target variant, and distinguishing between selection
Additionally, the authors proposed a framework to visually from de novo mutation and from standing variation.
inspect the input features that are more informative for the Interestingly, using haplotype instead of allele frequency
prediction by means of saliency maps (Simonyan et al. data yields a lower performance, possibly due to the diffi­
2013). Intriguingly, the latter indicated that the network fo­ culty in properly sorting the input data in a biologically mean­
cus most of its attention on Neanderthal and European ingful way. Timesweeper was deployed to time-series
haplotypes when exposed with data from an adaptive pooled-sequencing data from Drosophila simulans, and it
introgression, in line with the expected pairing of donor was able to replicate previously detected sweep signatures
and recipient populations. with better resolution.
DeepSweep is another application of CNNs to detect se­ CNNs have quickly become the main deep learning algo­
lective sweeps from “haplotypic” images, as defined by the rithm in population genetic studies thanks to their ability to
authors (Deelder et al. 2021). This method selects the long­ automatically extract important features from raw geno­
est common haplotype among neighboring SNPs, and sort type data, and their flexibility in accommodating different
all remaining haplotypes based on their distance to it. This models to be tested. As a result, novel applications of
sorted alignment of haplotype differences is then fed into such algorithms in population genetics are frequently pro­
a series of convolutional layers. The aim of the original study posed and introduced (Smith et al. 2022). In machine learn­
was to detect signatures of positive selection in malaria ing, natural language processing (NLP) represents a branch
parasites, namely Plasmodium falciparum and Plasmodium of algorithms that aims at “understanding” words in a
vivax. Interestingly, the algorithm was then trained using text, meaning that they can, for instance, perform speech
real data from regions covering SNPs previously associated recognition, text generation, or sentiment analysis (i.e.
with drug resistance, and the validation was performed associating an output label to each word or sentence).
using a leave-one-out approach. Possibly as a result of As DNA sequences are easily representable as a series
both the data processing and training strategies, when de­ of letters or motifs, in the next section, we will introduce
ployed on whole-genome data, DeepSweep predicted se­ NLP applications that are emerging in population
lection targets to be known drug-resistance genes and genetics.
largely overlapping with predictions using haplotype-based
summary statistics. One advantage of this training strategy
is that it enables an assessment of which data points are in­ Recurrent Neural Networks
formative during training. Recurrent neural networks (RNNs) are algorithms derived
A comparison between the performance of FCNN and from FCNNs but designed specifically for sequential data
CNN to detect natural selection, specifically balancing as they introduce a mechanism that influence current

8 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023
Deep Learning in Population Genetics GBE

predictions based on previous outcomes (Minsky 1967; dramatically decreases and becomes increasingly negligible
Rumelhart and McClelland 1987; Elman 1990). In fact, compared with storage costs.
RNNs are comprised of connected nodes that form a cycle, RNNs, in all their forms, have becoming increasingly
with the output of some nodes feeding back to other (or popular in population genetics thanks to their ability to in­
same) nodes. Therefore, simple RNNs can be considered corporate sequential data. Whilst training recurrent layers
as for-loops iterating along the sequential data, where at tend to be more challenging, coupling them with convolu­
each position the current input and the previous output tional layers appear to be a suitable solution to overcome
are combined to form the next output (or hidden state). such issue whilst incorporating novel information. In the
Multiple RNN layers can be stacked on top of each other next section, we will explore how CNNs can be embedded
to increase the capacity of the network and extract more in a more general family of machine learning algorithms

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


features from the data. One of the limitations of RNNs is called generative models.
the limited capacity to learn long-range dependencies.
Architectures such as Long Short-Term Memory (LSTM) Generative Models
and Gated-Recurrent Units (GRUs) networks circumvent
Generative models aim at capturing, and therefore approxi­
this problem by adding the concept of cell state which is
mating, the probability distribution between data and la­
propagated along the sequence in the case of LSTMs, and
bels. By their nature, generative models are able to
GRUs enabling the filtering of passing information of long-
“generate” novel data points according to the captured
range information through a Gating mechanism alone (Cho
probability distribution. Fitting a Gaussian mixture model
et al. 2014) whilst maintaining similar performance to
and sampling from the distribution can be interpreted as
LSTMs (Hochreiter and Schmidhuber 1997).
a generative process, although it is insufficient to capture
Recurrent layers have been used by Adrion et al. (2020)
complex phenomena in high-dimensional spaces. In fact,
to estimate recombination maps for D. melanogastor. The
even if sampling procedures can yield impressive results,
proposed software ReLERNN provides a comprehensive
that is for ARG inference (Mahmoudi et al. 2022), they of­
modular workflow on how to generalize the method for
ten remain model-based, and are fundamentally limited by
different model species of interest, including instructions
their run-time. For these reasons, deep generative models
for phased, unphased and pooled-sequencing data.
have become a subject of increased attention, especially
However, caution should be made when estimating recom­
for their capability of generating new samples even if the
bination rates from genotype alignments using machine
true underlying distribution is unknown. The following sec­
learning under certain conditions of low variability
tion focuses on three among the most popular
(Johnson and Wilke 2022). Hejase et al. (2021) proposed
non-model-based and high-parameter generative methods
a method to detect natural selection by extracting features
that have been explored in population genetics: autoenco­
from estimated genealogical trees. They used counts of re­
ders (Rumelhart and McClelland 1987), variational autoen­
maining lineages along a discrete log-transformation of the
coders (Kingma and Welling 2014), and generative
time dimension. The sequential nature of the trees along
adversarial networks (Goodfellow et al. 2014).
the sequence was used to set up an LSTM, which recog­
nizes the lack of remaining lineages, that is zeros in the dis­
tant past or upper part of the feature matrix. This approach, Autoencoders and Variational Autoencoders
implemented in the software SIA, gains the possibility to Similar to Principal Component Analysis (PCA), autoencoders
obtain an easily interpretable model at the cost of using aim to solve a compression problem by step-wise reducing the
an ancestral recombination graph (ARG)-inference method input parameters into a smaller set of hidden parameters, ana­
such as Relate (Speidel et al. 2019). logues of the principal components. The number of hidden
Inspired by the sequential nature of the Sequential parameters, known as the latent space, is dependent on the
Markov Chain (SMC) methodology, Khomutov et al. network architecture. In a simple form, compression is
(2021) proposed an RNN method to estimate times to the achieved by an FCNN, called the encoder, with a decreasing
most recent common ancestor from simulated data. number of learnable parameters in each layer. A second ex­
Interestingly, this method achieved good results after coup­ panding network, called the decoder, rebuilds the original
ling it with a CNN. Their approach is setup as a coalescent data from said latent space by minimizing a suitable loss func­
event classification strategy, thus creating a probability tion. An important part of the autoencoders is the regulariza­
distribution of the TMRCA coalescent time at any given tion step, usually introduced as part of the loss function, which
sequence position. Finally, neural net compression algo­ is necessary for learning a meaningful latent space by avoiding
rithms have been developed (Wang et al. 2018; Silva memorization.
et al. 2020) making use of recurrent layers for the emphasis Variational autoencoders (VAEs) differ from autoenco­
of long-range inter-dependencies and convolution layers. ders as they introduce a generative operation by compres­
These approaches appear useful as the cost of sequencing sing the data into a latent space distribution, instead of a

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 9
Korfmann et al. GBE

point representation. Furthermore, the latent space directly (D(x)) and the second part Ez [ log (1 − D(G(z)))] stands for
offers the possibility to probe the network for any kind of the expected value of generated data (G(z), z being the la­
structure as input data, which the encoder has been forced tent initialization) to be classified as fake by the discrimin­
to compress, by plotting the low-dimensional latent vari­ ator (1 − D(G(z))). Thus, the discriminator aims to
ables against each other. Thanks to the non-linearity of maximize the loss function, whereas the generator tries
neural networks, VAEs outperform classic methods, that to minimize it. The parameters of both networks are up­
is PCA, for visual data representation (Battey et al. 2021). dated alternately. Optimization can be particularly challen­
VAEs have been implemented by Battey et al. (2021) in ging as neither network should be under-performing nor
the software popvae. By applying it to genomic data sets, outperforming the other network too quickly. For instance,
they recovered geographic similarities among human popu­ when both networks are not training synchronously, many

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


lations, and tested for robustness in the presence of genom­ values of the random initialization distribution can collapse
ic inversions in Anopheles mosquitoes. Additionally, low into few target estimations, leading to decreased diversity
values of population genetic differentiation, as measured of generated samples of the generator, a phenomenon
by FST (Holsinger and Weir 2009), are more likely to be de­ known as the “Helvetica scenario” or “mode collapse”
tected by VAEs. Lastly, whilst the generative property of (Arjovsky and Bottou 2017). The discriminator would be­
VAEs has difficulties in detecting more complex relations, come trapped in a rejection space, and eventually end in
like long-range LD signatures, it can produce data with simi­ a local minimum (Che et al. 2016). Another issue focuses
lar SFS patterns. around the fleeting convergence property during training,
Other authors proposed a different VAE, named meaning the generator network becomes too good at mis­
HaploNet (Meisner and Albrechtsen 2022) to infer popu­ leading the discriminator, in which case the discriminator
lation structure and ancestry proportions. HaploNet was could only guess the correct class, resulting in poor gradi­
shown to be able to infer parameters from very large gen­ ents for both networks overtime.
omic data sets, such as the UK Biobank and the 1000 In the first application of GANs in population genetics,
Genomes Project. Likewise, others have proposed a multi- Wang et al. (2021) integrated the coalescent simulator
headed autoencoder, called Neural ADMIXTURE msprime (Baumdicker et al. 2021) with a parameter sam­
(Mantes et al. 2021), which was evaluated on the Simons pling algorithm (called simulated annealing) as the gener­
Genome Diversity Project and the Human Genome ator, with a CNN as discriminator. The objective was to
Diversity Project, achieving similar results. Finally, infer optimal parameters of the simulations that generated
López-Cortés et al. (2020) combined an autoencoder with realistic data sets. In this study, authors sought to estimate
common clustering methods, such as hierarchical clustering demographic parameters and recombination rate by
and K-Means. They sought to assign maize lines into subpo­ evaluating both real and simulated data using summary
pulations, and achieved marginally better results than by statistics in a likelihood-free approach, similarly to ABC.
using a Bayesian clustering method. In fact, authors compared their method, implemented in
the software pg-gan, to an SFS-based ABC and
achieved a similar performance. However, it is still un­
Generative Adversarial Networks clear whether ABC or GANs yield a better performance
Generative adversarial networks (GANs) provide a frame­ in terms of the number and accuracy of parameters
work capable of estimating high-dimensional probability (here demographic changes), number of necessary
distributions by solving a min–max optimization problem simulations, and run-time for population genetic
between two opposing networks (Goodfellow et al. applications.
2014). The aim of this architecture is thus to approximate Beyond inferring parameters, the generative property of
the underlying data generation process (i.e. evolutionary GANs has been explored in the form of other generative
process) of a study object of interest (i.e. genotype matrix). models such as Restricted-Bolthman-Machines (RBMs,
The model is capable then to sample new instances of the Smolensky 1986; Teh and Hinton 2000). Yelmen et al.
study object. (2021) used RBMs to recreate a population structure data
The first part of the architecture, called the generator set as genotype matrices extrected from 1000 Genomes
network, only has access to the random distribution as a Project data set. The authors successfully demonstrated
prior for constructing the target object, whereas the second the ability of RBMs to reconstruct multi-modal distributions
network, called the discriminator has access to a real object by reporting various distance measures (such as
(i.e. genotype matrix) and the generated object. The loss Wasserstein distance) and by visual inspection via dimen­
function from GANs illustrates the objectives of both net­ sionality reduction. However, this initial attempt is not cap­
works: L = Ex [ log (D(x))] + Ez [ log (1 − D(G(z)))]. The first able of recovering rare variant patterns, but advanced
part Ex [ log (D(x))] representing the expected value of real architectures designed to deal with mode collapse may
samples x to be classified correctly by the discriminator solve this issue (Ghosh et al. 2017). Despite current

10 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023
Deep Learning in Population Genetics GBE

limitations, GANs appear to be a promising deep learning tools that have been applied to train deep neutral networks for
framework to infer complex population genetic parameters population genetic inferences.
in face of an uncertain or unknown demographic model
(Booker et al. 2022).
Software
Most of the studies herein mentioned provide their imple­
mentations, often as user-friendly software, of deep learn­
Available Resources
ing algorithms for population genetic analyses. In table 1,
Simulators we summarize these implementations by the programming
The application of deep learning methods has been em­ language and required (or preferred) simulator (if any) used,

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


powered by decades of research into mathematical models and by the input data required (table 1). We further categor­
of evolution and development of simulators built to recre­ ize implementations based on their underlying type of neural
ate the hidden stochasticity of unseen evolutionary pro­ network. Whilst general-purpose software for simulation-
cesses. In the context of deep learning, most of the based inferences are available (Tejero-Cantero et al. 2020),
applications in population genetics rely on training algo­ here we focus only on implementations specific to popula­
rithms via synthetic data generated by such simulators. tion genetic analysis.
Broadly speaking, simulators can be categorized as From this collection, we note that recent implementa­
forward-in-time and backward-in-time approaches. The tions often rely on python packages such as keras and
latter category refers to coalescent simulators which, tensorflow which allow for easy building of layers,
due to their rigorous underlying models, are extremely effi­ efficient optimization of networks, and intuitive monitoring
cient as they only keep track of sampled genomes. of training performance. Implementations based on
Forward-in-time simulation tend to be more intuitive in pytorch (another popular python package) allow for
their development, and are often used for complex select­ more flexibility in constructing complex architectures and in­
ive processes which cannot be described by coalescent vestigating internal nodes. These python packages are sup­
models. The following section is dedicated to name a few ported by a strong and active community of developers and
popular simulation tools, which can be used to generate users, which ensures constant debugging and development.
data set to train neural networks. We also note that forward-in-time simulators are becom­
SLiM (Messer 2013), provides a whole programming ing increasingly popular for training deep neural networks
language Eidos (Haller 2016) designed to build forward despite their significant computational cost, although the
simulation code for a vast range of evolutionary processes. adoption of tree-sequence data and “simulation-on-the-fly”
Therefore, it has been used to train deep learning techniques can reduce such burden. Despite the plethora of
algorithms that aimed at inferring complex models. implementations, each one appears to be suitable to perform
Interestingly, current developments on spatial simula­ specific tasks. At the moment of writing, only DNADNA
tors, such as slendr (Petr et al. 2022), leverage (Sanchez et al. 2022) is the sole software providing a general
SLiM’s capabilities to generate synthetic genetic data framework to both generate simulations and build and train­
variable in time and space. Likewise SLiM’s extensions ing arbitrary networks.
to simulate bacterial populations (Cury et al. 2022) allow
for studies of non-model organisms to generate synthetic
A Novel Application: Detecting
data sets which could be used in a deep learning framework.
Another forward-in-time simulator that has been used in deep
Short-Term Balancing Selection from
learning is SFS_code (Hernandez and Uricchio 2015). Temporal Data
Among coalescent simulators, msprime (Baumdicker We now wish to illustrate the feasibility and accessibility of
et al. 2021) is the preferred choice among practitioners deep learning algorithms to perform population genetics
due to its carefully designed code base, efficient tree se­ predictive tasks which are typically unachievable using clas­
quence data structure (Kelleher et al. 2018), fast run-time, sic approaches. To this aim, by using some of the architec­
available choice of coalescent models (Adrion et al. 2020), tures and techniques described above, we seek to develop a
easy programmatic access as well as active maintenance. novel algorithm to detect signals of recent balancing selec­
It should be noted that tree sequences are not inherently lim­ tion from temporal genomic data.
ited to coalescent simulations, but have also been integrated Balancing selection is a process that generates and main­
into forward-in-time simulators such as SLiM (Haller et al. tains genetic diversity within populations (Charlesworth
2019), fwpyy (Thornton 2014) or sleepy (Korfmann, Abu 2006) whose signals are typically detected by investigating
Awad, et al. 2022). Lastly, ms (Hudson 2002), msms (Ewing patterns of genetic diversity, allele frequency, and shared
and Hermisson 2010), fastsimcoal2 (Excoffier et al. polymorphisms between species and populations (Key
2021), and discoal (Kern and Schrider 2016) are coalescent et al. 2014). Long-term balancing selection has been

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 11
Korfmann et al. GBE

Table 1
List of Available Software and Implementations of Deep Learning Methods (not considering generative models) for Population Genetic Inferences
Reference Language/Library Simulator Input
evoNeta (Sheehan and Song 2016) Java msms Summary statistics
DeepGenomeScanb (Qin et al. 2022) R/keras Not trained by simulations genotype, phenotype and
sampling locations
Locaterc (Battey et al. 2020) python/keras Not trained by simulations Phenotype and sampling locations
ML_in_pop_gend (Burger et al. 2022) python/keras msprime SFS
ABC_DLe (Mondal et al. 2019) Java/Encog and R/abc fastSimcoal2 SFS
diploS/HICf (Kern and Schrider 2018) python/keras and scikit-learn discoal Summary statistics
partialS/HICg (Xue et al. 2020)

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


python/keras and scikit-learn discoal Summary statistics
drosophila-sweepsh (Caldas et al. 2022) python/pytorch SLiM/msprime Summary statistics
defiNETtii (Chan et al. 2018) python/tensorflow msprime Genotype data
pop_gen_cnnj (Flagel et al. 2018) python/keras ms discoal Genotype data
ImaGenek (Torada et al. 2019) python/keras msms Haplotype data
dlpopsizel (Sanchez et al. 2021) python/pytorch msprime Haplotype data
BaSem (Isildak et al. 2021) python/keras SLiM Haplotype data
genomatnnn (Gower et al. 2021) python/tensorflow SLiM Genotype data
DeepSweepo (Deelder et al. 2021) python/keras SFS_code Haplotype data
Timesweeperp (Whitehouse python/keras SLiM Haplotype or allele frequency
and Schrider 2022) time-series data
disperseNNq (Smith et al. 2022) python/keras SLiM or msprime Genotype or tree sequence data
and sampling locations
ReLERNNr (Adrion et al. 2020) python/tensorflow msprime Genotype data
SIAs (Hejase et al. 2021) python/keras SLiM or discoal Local trees
DNADNAt (Sanchez et al. 2022) python/pytorch msprime Haplotype data
NOTE.—Software is gratefully supplied at their respective repositories: a https://sourceforge.net/projects/evonet, b https://xinghuq.github.io/DeepGenomeScan, c https://
github.com/kr-colab/locator, d https://github.com/fbaumdicker/ML˙in˙pop˙gen, e https://github.com/oscarlao/ABC˙DL, f https://github.com/kr-colab/diploSHIC, g https://github.
com/xanderxue/partialSHIC, h https://github.com/ianvcaldas/drosophila-sweeps, i https://github.com/popgenmethods/defiNETti, j https://github.com/flag0010/pop˙gen˙cnn,
k l m n
https://github.com/mfumagalli/ImaGene, https://gitlab.inria.fr/ml˙genetics/public/dlpopsize, https://github.com/ulasisik/balancing-selection, https://github.com/
grahamgower/genomatnn, o https://github.com/WDee/Deepsweep, p https://github.com/SchriderLab/timeSeriesSweeps, q https://github.com/kr-colab/disperseNN, r https://
s t
github.com/kr-colab/ReLERNN, https://github.com/CshlSiepelLab/arg-selection, https://mlgenetics.gitlab.io/dnadna

proved to be a major determinant of important pheno­ balancing selection) at the center of the locus, conditioned
types, including in humans (Soni et al. 2022). However, re­ to a demographic model of European populations
cent and fleeting balancing selection leaves cryptic genomic (Jouganous et al. 2017). We performed forward-in-time si­
traces which are hard to detect and greatly confounded by mulations using SLiM (Haller and Messer 2019), similarly to
neutral evolutionary processes (Sellis et al. 2011). a previous study (Isildak et al. 2021). We imposed selection
Therefore, currently employed methods are either unsuit­ on a de novo mutation starting 10k years ago, with selec­
able or underpowered to detect short-term balancing selec­ tion coefficients of 0.25% and 0.5%. We sampled 40
tion (Fijarczyk and Babik 2015). present-day haplotypes, and 10 ancient haplotypes at
Information from temporal genetic variation, either from four different time points (8k, 4k, 2k, 1k years ago, mirror­
evolve-resequence or ancient DNA (aDNA) experiments, is ing a plausible human aDNA data collection).
particularly suitable to identify when and at to what extent We trained a deep neural network to distinguish be­
natural selection acted (Dehasque et al. 2020). Previous at­ tween neutrality and selection. Using pytorch, we built
tempts to use deep learning to infer balancing selection a network comprising two branches. One branch receives
from contemporary genomes (Isildak et al. 2021) and posi­ present-day haplotypes and performs a series of convolu­
tive selection from temporal data (Whitehouse and Schrider tional and pooling layers with permutation-invariant func­
2022) suggest that training an algorithm that uses the tions. The other branch processes stacked ancient
haplotype information from both contemporary and haplotypes at different sampling points, and both branches
aDNA data has high potential to characterize signals of re­ performing residual convolutions. The two branches are
cent adaptation (and thus recent balancing selection). merged with a dense fully layer that performs a ternary clas­
To illustrate the ability of deep learning to detect signals sification. We used 64 filters with 3x3 kernel size and 1x1
of recent balancing selection, we simulated a scenario in­ padding size after sorting haplotypes by frequency
spired by available data in human population genetics. (Torada et al. 2019). We performed 10 separate training
We simulated 2,000 50 kbp loci under either neutrality or operations to obtain confidence intervals in accuracy va­
overdominance (i.e. heterozygote advantage, a form of lues. We report results in the form of confusion matrices,

12 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023
Deep Learning in Population Genetics GBE

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


FIG. 3.—Confusion matrices to classify neutrality (N), weak (D0.25), or moderate (D0.5) overdominance with a deep learning algorithm using only ancient,
present-day, or both types of samples. True and predicted classes are on the x axis and y axis, respectively.

a typical representation to summarize the predictive per­ different operational definitions based on a wide range of
formance at testing. To further showcase the accessibility criteria. In fact, several taxonomies for interpretability of
of deep learning, we made the full implementation and neural networks have been proposed and the number of
scripts are available at https://github.com/kevinkorfmann/ published articles on interpretability has been increasing ex­
temporal-balancing-selection. ponentially since 2000 (Fan et al. 2020). Therefore, here we
Results show that, despite the small training set used, will restrict ourselves to distinguishing between global and
the network has high accuracy to infer recent balancing se­ local interpretability and explaining the relevance of these
lection under this tested scenario (fig. 3). Notably, we ob­ two concepts for population genomics studies. Also, we
serve a significant decrease in accuracy for distinguishing note that we will not consider very recent efforts aimed
between weak and moderate selection when silencing at designing inherently interpretable deep neural networks
the time-series branch, suggesting an important contribu­ (e.g. Chen et al. 2020) and instead focus on post-hoc inter­
tion of ancient samples in the prediction. In this illustrative pretation methods, that is algorithms that can be used to
example, we do not attempt to take into account the uncer­ interpret an already trained network.
tainty given by degraded and low-coverage aDNA data and Global interpretability aims at explaining the overall
population structure across time points, among other con­ behaviour of a model (Ancona et al. 2019), which in turn
founding factors. Nevertheless, these results demonstrate can inform us about the system being studied. In principle,
that building and training novel deep learning algorithms this goal can be achieved by analysing the hyperparameters
is accessible and generates powerful predictions to address (which control the learning process and the values taken by
current questions in population genetics. the parameters; for example learning rate, activation func­
tion, number of hidden layers, number of neurons per hid­
den layer) or parameters (weights and biases) of a deep
Interpretable Machine Learning neural network. However, the information provided by hy­
As already mentioned in the Introduction, population genet­ perparameters tend to be limited to model complexity, for
ics and evolution in general are aimed at uncovering the me­ example, in terms of the number of nodes and hidden
chanisms responsible for the diversity of life in our planet. layers retained after tuning and fitting or the type of activa­
Thus, the black-box nature of deep learning methods re­ tion function. On the other hand, the values taken by para­
present an important obstacle for their application in these meters (weights and biases) after fitting can provide more
research fields. However, very recent advances in “interpret­ meaningful biological information; in particular, they help
able machine learning” algorithms (Linardatos et al. 2021) identify the features that contributed the most to the pre­
are providing the tools needed to overcome this hurdle. dictive power of the algorithm. For example, Sheehan
But what exactly do we mean by interpretability? There is and Song (2016) (see FCNN section above) use random per­
no general consensus on what the word “interpretability” mutation of each summary statistic (feature) and identify as
means (Doshi-Velez and Kim 2017; Fan et al. 2020) and dis­ most informative for the detection of population size
cussions of this concept in the artificial intelligence litera­ changes those statistics that, when randomly permuted,
ture tend to be rather abstract and sometimes highly lead to the sharpest decrease in accuracy. Another ap­
technical. In the context of machine learning, a common proach is based on feature importance (Olden and
definition is “the ability to explain or present in understand­ Jackson 2002), which was used by another study (Qin
able terms to a human” (Doshi-Velez and Kim 2017). This et al. 2022) to identify as outlier loci those that contributed
abstract definition has been translated into a myriad of the most to the power of an FCNN to predict an individual’s

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 13
Korfmann et al. GBE

phenotype or geographic origin. Feature importance is (KernelShap and DeepShap) and R (shapr). However, they
based on the idea that the magnitude of connection are limited to deep neural networks with moderate number
weights between neurons connecting input and output of features. Nevertheless, very recent developments have
nodes measure the extent to which each feature contri­ led to new approaches, DASP (Ancona et al. 2019) and
butes to the network’s predictive power. The architecture G-DeepShap (Chen et al. 2022), that may scale up to popula­
used for these two examples was an FCNN. A different ap­ tion genomics datasets. For the moment, there are no applica­
proach is necessary in the case of CNNs. For example, in the tions of Shapley values to population genomics studies; there
case of a CNN that classify images into different categories, is only an application in population genetics but in the context
a common approach is to use saliency maps, which meas­ of random forests (Kittlein et al. 2022).
ure the support that different groups of pixels in an image Much work remains to be done in order to incorporate the

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


provides for a particular class (Mohamed et al. 2022). This is latest advances in interpretable machine learning to popula­
implemented by feeding the CNN an image of a particular tion genomics. Interpretability can lead to important break­
class and using visualization techniques to generate heat­ throughs by uncovering complex genomic signatures left by
maps overlayed on the original image; the image elements the non-linear interactions among many genetic and evolu­
that are being used by the CNN to identify the class are tionary processes. Although population genetics theory has
highlighted in red. A population genetics application of already provided a deep understanding of the genomic signa­
this approach is presented by Gower et al. (2021), who tures left by complex demographic history and selective pro­
used a CNN algorithm to detect adaptive introgression. cesses, the “agnostic” nature of deep learning has the
Local interpretability aims at understanding the reasons potential to uncover “hidden” genomic signatures that trad­
for a specific decision concerning a particular instance. Note itional model-based statistical methods are unable to detect.
that the ability of a particular feature to predict an attribute In doing so, they may generate new hypotheses for explaining
(e.g. phenotype) for a particular instance (data point), may de­ observed genomic patterns that could then be tested.
pend on the values taken by the other features. This is particu­
larly relevant in population genomics applications as the effect
that a particular locus variant has on the phenotype of an in­ Dealing with Uncertainty
dividual may depend on the variants found at other loci (i.e. Whilst, as described so far, deep learning has led to novel ap­
the genetic background; Chandler et al. 2013). A very prom­ plications in population genetics, the intrinsic challenges asso­
ising technique to address this important issue is the Shapley ciated with uncertain DNA sequencing data, simulated
value approach (Strumbelj and Kononenko 2010). Shapley va­ training data sets, and an incomplete statistical framework
lues were first introduced in cooperative game theory are limiting factors to fully exploit the power of such
(Shapley 1953) to calculate the contribution of individual technique.
players to the outcome of a game. In the context of deep As previously described, data given as input to deep
learning, each feature represent a player, different combina­ learning algorithms in population genetics typically consist
tions of features (feature subsets) represent a coalition, and of alignments of genotypes, inferred haplotypes, or sum­
the set comprising all features represents the “grand coalition mary statistics. Genotype calling, phasing, and calculation
of players”. The objective is to explain how values of a feature of summary statistics are associated with statistical uncer­
for a particular instance contribute to the difference between tainty (Nielsen et al. 2011), especially when performed
the prediction of a machine learning algorithm with the fea­ from low-coverage sequencing (i.e. from museum speci­
ture included and the expected prediction when the feature men, ancient samples, or generally non-model species)
value is ignored (Strumbelj and Kononenko 2010). Thus, the (Lou et al. 2021). Sequencing data uncertainty could be
Shapley value of a feature can be interpreted as the average tackled by providing estimates of summary statistics from
marginal contribution of the feature to all possible feature genotype likelihoods as input. Additional approaches based
subsets that can be formed without it (cf. Ancona et al. on filtering masks to take into account data errors and miss­
2019). An important advantage of the approach is that it is ingness have been proposed in the literature (Adrion et al.
the only explanation method that takes into account all the 2020). Finally, generating sequencing data-like simulations
potential dependencies and interactions between feature va­ (Escalona et al. 2016; Cury et al. 2022) for training could be
lues (cf. Strumbelj and Kononenko 2010). In principle, this re­ a valuable solution to accommodate all nuances of the ex­
quires the evaluation of all 2N feature subsets (coalitions), perimental data, at the expense of increasing computation­
were N is the number of features in the full set (grand coali­ al resources needed. Other sequencing technologies may
tion). Obviously, this is only possible when the number of fea­ provide data of different nature [e.g. sample allele frequen­
tures is small to moderate (some few dozens). Thus, several cies from pooled-sequencing experiments (Anand et al.
algorithms have been proposed for approximating Shapley 2016)], and therefore appropriate considerations should
values and a unified approach proposed by Lundberg and be made in terms of additional statistical uncertainty asso­
Lee (2017) has been implemented in both python ciated with such output. Approaches based on using trees

14 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023
Deep Learning in Population Genetics GBE

or local ancestry tracts as input (Hamid et al. 2022) may be


(e.g. ABC) and non-model-based function approxima­
more prone to input data uncertainty.
tions (e.g. deep learning). Yet, recent trends empha­
One of the main concerns about current applications of
size a need to combine the power of deep learning
deep learning in population genetics is the use of synthetic
approaches with a model-based constraint. A promis­
data for training neutral networks. For instance, the detec­
ing idea is to format the input data (genotype matrix)
tion of signals of natural selection typically requires the
in order for model assumptions to be encoded directly
knowledge of the underlying demography model to gener­
in the data for subsequent training and inference. In
ate a null distribution under neutrality (Nielsen 2005). If the
the most general case, this model-based formatting
baseline demographic model is ill defined, inference of nat­
can be considered as a representation of the ARG,
ural selection is expected to be biased (Johri et al. 2022).
for which few methods have been developed

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


Whilst such issue is shared with other popular inferential fra­
(Rasmussen et al. 2014; Kelleher et al. 2019; Speidel
meworks, such as ABC (Bertorelle et al. 2010), the use of si­
et al. 2019; Mahmoudi et al. 2022). Decoupling the
mulations in this context appears to be more problematic
ARG or genealogy construction and inference of evo­
given the ‘black-box‘ nature of neural networks. Solutions
lutionary parameters of interest would create the op­
to address the uncertainty of simulations explored in the lit­
portunity to increase collaborations with
erature include testing a network trained on misspecified
mathematical modelers, by incorporating more com­
models (e.g. Flagel et al. 2018; Torada et al. 2019; Adrion
plex coalescent models or biological processes like
et al. 2020), and deploying it on known cases of selection
introgression, structured populations, or species-
and neutrality (Isildak et al. 2021) to quantify false positive
specific life-history traits. Additionally, it may no long­
and false negative rates. Although post-inference diagnos­
er be necessary to try to interpret the inner workings of
tic analyses are required to ensure robustness of results, as
a CNN trained on (sparse) genotype matrices (which
per best-practice in machine learning (Lones 2021;
likely rebuilds parts of the ARG through complex ag­
Whalen et al. 2022), the ever-increasing curated list of
gregation of genotype density patterns). Any type of
demographic models (Adrion et al. 2020) will facilitate the
model-based properties could be questioned through
use of synthetic data for training networks. Likewise, these
modification of the ARG. An essential step has been
resources will facilitate the establishment of gold-standard
developed by Korfmann et al., providing not only a
data sets to benchmark newly proposed architectures.
new ARG-parameter inference method based on
Finally, efforts towards the adoption of transfer learning
graph neural networks (GNN) but also an SMC meth­
and domain adaptation techniques should further reduce
od applied to a particular coalescent model, known for
any bias associated with uncertain training data sets.
long-range LD interdependencies (Korfmann,
Most applications described herein aim at classifying
Sellinger, et al. 2022). This approach offers the unique
data into discrete labels or providing point-estimates of
opportunity to test for mathematical model-based
parameters of interests. Statistical uncertainty should be
blind spots in an inherently Markovian constrained
quantified by characterizing probability distributions of
SMC method using GNNs.
both the model uncertainty (epistemic or reduce-able
part) and the inherent stochastic uncertainty of data gener­
ating process (aleatoric or irreduce-able uncertainty)
(Hüllermeier and Waegeman 2021; Sanchez, Caramiaux, Conclusions
et al. 2022). Solutions to this problem include the predic­ This review illustrates the great diversity of deep learning
tion of mean and standard deviation (Chan et al. 2018) or architectures that have been used in population genetics
confidence intervals alongside point estimates, and the applications. Currently, the prevailing type of applica­
quantification of any errors associated with the training tions involve the training of algorithms with simulated
phase (Smith et al. 2022). Thus, we encourage practitioners data but there is an increasing number of studies that
for the upcoming publications to consider modifying their use a more standard approach where training is carried
models to account for uncertainty in a principled manner. out using observed data. Thus, we can identify two
strands of methods, one that is closely associated with
From Regular Convolutions to Graph likelihood-free, simulation-based approaches that con­
Convolutions sider explicit evolutionary models and another one that
conforms to a purely data-driven, model-free approach.
Genotype matrices have been the starting point for
In both cases, however, deep learning is used as an infer­
doing any kind of population genetics analysis, either
ential tool (as opposed to a predictive or pattern recogni­
by calculating summary statistics (e.g. site frequency
tion approach). However, as the popularity of deep
spectra), model-based probabilistic optimization algo­
learning increases among population geneticists, we ex­
rithms (e.g. SMC), or Bayesian sampling techniques
pect that further deep learning algorithms, including the

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 15
Korfmann et al. GBE

latest diffusion models (Ramesh et al. 2022), will be Acknowledgments


adapted to solve predictive tasks. Intriguingly, novel ap­
We are grateful to all members of the EvoGenomics.AI con­
plications may go beyond classic inferential tasks and in­
sortium (www.evogenomics.ai) for helpful discussions. We
clude other aims, such as efficient data compression or
also wish to thank Ulas Isildak for assistance in using SLiM.
generation of synthetic experimental data sets.
Two anonymous reviewers provided insightful comments
Likewise, solutions for making neural networks a
that improved the manuscript.
“transparent-box,” such as neural additive models
(Novakovsky et al. 2022) and symbolic metamodeling
(Alaa and van der Schaar 2019), will facilitate the adop­ Funding
tion of deep learning among empiricists.

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


More research is needed in the domain of “interpret­ K.K. is supported by a grant from the Deutsche
able” machine learning so as to gain an understanding of Forschungsgemeinschaft (DFG) through the TUM
how deep learning algorithms make their decisions. This International Graduate School of Science and Engineering
in turn would enable population geneticists to uncover no­ (IGSSE), GSC 81, within the project GENOMIE QADOP.
vel genomic signatures associated with non-linear pro­ We acknowledge the support of Imperial College London
cesses that current theory has not yet suggested including - TUM Partnership award.
non-linear interactions among many genetic, ecological,
and evolutionary processes. Importantly, further develop­
ments in local interpretability (see above) can help us to Data Availability
identify epistatic interactions and gain a better understand­ No new data were generated in support of this research. An
ing of how genetic background influences the phenotypic implementation of the neural network illustrated in this re­
effect of mutations. view is available at https://github.com/kevinkorfmann/
One key aspect to make deep learning a popular temporal-balancing-selection.
framework in population genetics, is to ensure reprodu­
cible analyses and avoid repeating training of highly
parameterized networks from scratch. In this context, Literature Cited
recent efforts to provide users with documented work­ Adrion JR, et al. 2020. A community-maintained standard library of
flows (Whitehouse and Schrider 2022) and pre-trained population genetic models. eLife 9:e54967. doi:10.7554/eLife.
networks (Hamid et al. 2022) will both reduce carbon 54967
Adrion JR, Galloway JG, Kern AD. 2020. Predicting the landscape of
footprint (Grealey et al. 2022) and facilitate the applica­
recombination using deep learning. Mol Biol Evol. 37(6):
tion of deep learning to a wider range of data sets, 1790–1808. doi:10.1093/molbev/msaa038
allowing users to modify the network’s parameters Alaa AM, van der Schaar M. 2019. Demystifying black-box models
according to the specific requirements of the biological with symbolic metamodels. In: Wallach H, Larochelle H,
system under examination. Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, editors.
Advances in Neural Information Processing Systems. Vol. 32.
Finally, we urge the community to make the field as in­
Curran Associates, Inc. Available from: https://proceedings.
clusive as possible. Whilst open-source software release is neurips.cc/paper/2019/file/567b8f5f423af15818a068235807edc0-
common practice among machine learning practitioners, Paper.pdf.
access to appropriate computing resources is still a limiting Anand S, et al. 2016. Next generation sequencing of pooled samples:
factor for many researchers. Initiatives to provide GPUs guideline for variants’ filtering. Sci Rep. 6(1):33735. doi:10.1038/
(i.e. graphics processing unit) and cloud computing credits srep33735
Ancona M, Oztireli C, Gross M. 2019. Explaining deep neural networks
to academics in need represent a valuable step towards with a polynomial time algorithm for Shapley values approxima­
making deep learning in population genetics accessible tion. In: Chaudhuri K, Salakhutdinov R, editors. Proceedings of
and inclusive to a wide range of scientists. Likewise, we Machine Learning Research, 2019a. 36th International
encourage the establishment of training opportunities in Conference on Machine Learning (ICML). Vol. 97; 2019 Jun 9–
machine learning for early-career population geneticists. 15; Long Beach, CA.
Ancona M, Öztireli C, Gross MH. 2019. Explaining deep neural net­
Importantly, such events should happen either online or
works with a polynomial time algorithm for Shapley values ap­
in hybrid format, with resources provided in multiple proximation. CoRR. Available from: http://arxiv.org/abs/1903.
languages to ensure that text or verbal comprehension is 10992.
not a barrier to learning. Consortia and local networks, Arjovsky M, Bottou L. 2017. Towards principled methods for training
properly funded by the wealthiest countries, appear to generative adversarial networks. Available from: https://arxiv.org/
abs/1701.04862.
be a natural solution to fulfil this need. If all these condi­
Azouri D, Abadi S, Mansour Y, Mayrose I, Pupko T. 2021. Harnessing
tions are met, deep learning will soon be established as machine learning to guide phylogenetic-tree search algorithms.
part of the common toolkit among population geneticists Nat Commun. 12(1):1983–1983. doi:10.1038/s41467-021-
globally. 22073-8

16 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023
Deep Learning in Population Genetics GBE

Battey CJ, Coffing GC, Kern AD. 2021. Visualizing population struc­ Cho K, van Merrienboer B, Bahdanau D, Bengio Y. 2014. On the prop­
ture with variational autoencoders. G3 11(1):jkaa036. doi:10. erties of neural machine translation: encoder-decoder approaches.
1093/g3journal/jkaa036 CoRR. Available from: http://arxiv.org/abs/1409.1259.
Battey CJ, Ralph PL, Kern AD. 2020. Predicting geographic location Cranmer K, Brehmer J, Louppe G. 2020. The frontier of simulation-
from genetic variation with deep neural networks. eLife 9: based inference. Proc Natl Acad Sci U S A. 117(48):30055–30062.
e54507. doi:10.7554/eLife.54507 doi:10.1073/pnas.1912789117
Baumdicker F, et al. 2021. Efficient ancestry and mutation simulation Csilléry K, Blum MG, Gaggiotti OE, François O. 2010. Approximate
with msprime 1.0. Genetics. 220(3):iyab229. doi:10.1093/ Bayesian computation (ABC) in practice. Trends Ecol Evol. 25(7):
genetics/iyab229 410–418. doi:10.1016/j.tree.2010.04.001
Beaumont MA, Zhang W, Balding DJ. 2002. Approximate Bayesian Csilléry K, François O, Blum MGB. 2012. abc: an r package for approxi­
computation in population genetics. Genetics 162(4): mate Bayesian computation (ABC). Methods Ecol Evol. 3(3):
2025–2035. doi:10.1093/genetics/162.4.2025 475–479. doi:10.1111/j.2041-210X.2011.00179.x

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


Bertorelle G, Benazzo A, Mona S. 2010. Abc as a flexible framework to Cury J, Haller BC, Achaz G, Jay F. 2022. Simulation of bacterial popula­
estimate demography over space and time: some cons, many pros. tions with SLiM. Peer Community J. 2:e7. doi:10.24072/pcjournal.
Mol Ecol. 19(13):2609–2625. doi:10.1111/j.1365-294X.2010. 72
04690.x Deelder W, et al. 2021. Using deep learning to identify recent positive
Blischak PD, Barker MS, Gutenkunst RN. 2021. Chromosome-scale in­ selection in malaria parasite sequence data. Malar J. 20(1):270.
ference of hybrid speciation and admixture with convolutional doi:10.1186/s12936-021-03788-x
neural networks. Mol Ecol Resour. 21(8):2676–2688. doi:10. Dehasque M, et al. 2020. Inference of natural selection from ancient
1111/1755-0998.13355 dna. Evol Lett. 4(2):94–108. doi:10.1002/evl3.165
Blum MGB, François O. 2010. Non-linear regression models for ap­ Doshi-Velez F, Kim B. 2017. Towards a rigorous science of interpret­
proximate Bayesian computation. Stat Comput. 20(1):63–73. able machine learning. Available from: https://arxiv.org/abs/1702.
doi:10.1007/s11222-009-9116-0 08608.
Booker WW, Ray DD, Schrider DR. 2022. This population doesn’t exist: Elman JL. 1990. Finding structure in time. Cogn Sci. 14(2):179–211.
learning the distribution of evolutionary histories with generative doi:10.1207/s15516709cog1402_1
adversarial networks. bioRxiv. Available from: https://www. Escalona M, Rocha S, Posada D. 2016. A comparison of tools for the
biorxiv.org/content/early/2022/09/17/2022.09.17.508145. simulation of genomic next-generation sequencing data. Nat Rev
Burger KE, Pfaffelhuber P, Baumdicker F. 2022. Neural networks for Genet. 17(8):459–469. doi:10.1038/nrg.2016.57
self-adjusting mutation rate estimation when the recombination Ewing G, Hermisson J. 2010. MSMS: a coalescent simulation program
rate is unknown. bioRxiv. Available from: https://www.biorxiv. including recombination, demographic structure and selection at a
org/content/early/2022/05/17/2021.09.02.457550. single locus. Bioinformatics 26(16):2064–2065. doi:10.1093/
Byrska-Bishop M, et al. 2022. High-coverage whole-genome sequen­ bioinformatics/btq322
cing of the expanded 1000 genomes project cohort including Excoffier L, et al. 2021. fastsimcoal2: demographic inference under
602 trios. Cell 185(18):3426–3440.e19. doi:10.1016/j.cell.2022. complex evolutionary scenarios. Bioinformatics 37(24):
08.004 4882–4885. doi:10.1093/bioinformatics/btab468
Bzdok D, Altman N, Krzywinski M. 2018. Statistics versus machine Fan F, Xiong J, Wang G. 2020. On interpretability of artificial neural
learning. Nat Methods 15(4):233–234. doi:10.1038/nmeth. networks. CoRR. Available from: http://arxiv.org/abs/2001.02522.
4642 Fijarczyk A, Babik W. 2015. Detecting balancing selection in genomes:
Caldas IV, Clark AG, Messer PW. 2022. Inference of selective sweep limits and prospects. Mol Ecol. 24(14):3529–3545. doi:10.1111/
parameters through supervised learning. bioRxiv. Available from: mec.13226
https://www.biorxiv.org/content/early/2022/07/20/2022.07.19. Flagel L, Brandvain Y, Schrider DR. 2018. The unreasonable effective­
500702. ness of convolutional neural networks in population genetic infer­
Capblancq T, Luu K, Blum MGB, Bazin E. 2018. Evaluation of redun­ ence. Mol Biol Evol. 36(2):220–238. doi:10.1093/molbev/msy224
dancy analysis to identify signatures of local adaptation. Mol Ecol Foll M, Gaggiotti O. 2008. A genome-scan method to identify selected
Resour. 18(6):1223–1233. doi:10.1111/1755-0998.12906 loci appropriate for both dominant and codominant markers: a
Chan J, et al. 2018. A likelihood-free inference framework for popula­ Bayesian perspective. Genetics 180(2):977–993. doi:10.1534/
tion genetic data using exchangeable neural networks. Adv Neural genetics.108.092221
Inf Process Syst. 31:8594–8605. Fonseca EM, Colli GR, Werneck FP, Carstens BC. 2021.
Chandler CH, Chari S, Dworkin I. 2013. Does your gene need a back­ Phylogeographic model selection using convolutional neural net­
ground check? How genetic background impacts the analysis of works. Mol Ecol Resour. 21(8):2661–2675. doi:10.1111/1755-
mutations, genes, and evolution. Trends Genet. 29(6):358–366. 0998.13427
doi:10.1016/j.tig.2013.01.009 Fountain-Jones NM, Smith ML, Austerlitz F. 2021. Machine learning in
Charlesworth D. 2006. Balancing selection and its effects on se­ molecular ecology. Mol Ecol Resour. 21(8):2589–2597. doi:10.
quences in nearby genome regions. PLoS Genet. 2(4):1–6. doi: 1111/1755-0998.13532
10.1371/journal.pgen.0020064 Frichot E, Schoville SD, Bouchard G, François O. 2013. Testing for as­
Che T, Li Y, Jacob AP, Bengio Y, Li W. 2016. Mode regularized genera­ sociations between loci and environmental gradients using latent
tive adversarial networks. Available from: https://arxiv.org/abs/ factor mixed models. Mol Biol Evol. 30(7):1687–1699. doi:10.
1612.02136. 1093/molbev/mst063
Chen Z, Bei Y, Rudin C. 2020. Concept whitening for interpretable im­ Ghosh A, Kulharia V, Namboodiri V, Torr PHS, Dokania PK. 2017.
age recognition. Nat Mach Intell. 2(12):772–782. doi:10.1038/ Multi-agent diverse generative adversarial networks. Available
s42256-020-00265-z from: https://arxiv.org/abs/1704.02906.
Chen H, Lundberg SM, Lee S-I. 2022. Explaining a series of models by Goodfellow IJ, et al. 2014. Generative adversarial networks.
propagating Shapley values. NATURE COMMUNICATIONS. 13(1): arXiv:1406.2661 [cs, stat], June. Available from: http://arxiv.org/
4512. doi:10.1038/s41467-022-31384-3 abs/1406.2661.

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 17
Korfmann et al. GBE

Goodfellow I, Bengio Y, Courville A. 2016. Deep learning. Jouganous J, Long W, Ragsdale AP, Gravel S. 2017. Inferring the joint
Cambridge: MIT Press. demographic history of multiple populations: beyond the diffusion
Gower G, Picazo PI, Fumagalli M, Racimo F. 2021. Detecting adaptive approximation. Genetics 206(3):1549–1567. doi:10.1534/
introgression in human evolution using convolutional neural net­ genetics.117.200493
works. eLife 10:e64669. doi:10.7554/eLife.64669 Kelleher J, et al. 2019. Inferring whole-genome histories in large popu­
Grealey J, et al. 2022. The carbon footprint of bioinformatics. Mol Biol lation datasets. Nat Genet. 51(9):1330–1338. doi:10.1038/
Evol. 39(3):msac034. doi:10.1093/molbev/msac034 s41588-019-0483-y
Greener JG, Kandathil SM, Moffat L, Jones DT. 2022. A guide to ma­ Kelleher J, Thornton KR, Ashander J, Ralph PL. 2018. Efficient pedi­
chine learning for biologists. Nat Rev Mol Cell Biol. 23(1):40–55. gree recording for fast population genetics simulation.
doi:10.1038/s41580-021-00407-0 PLoS Comput Biol. 14(11):e1006581. doi:10.1371/journal.
Halldorsson BV, et al. 2022. The sequences of 150,119 genomes in the pcbi.1006581
UK Biobank. Nature 607(7920):732–740. doi:10.1038/s41586- Kern AD, Schrider DR. 2016. Discoal: flexible coalescent simulations

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


022-04965-x with selection. Bioinformatics 32(24):3839–3841. doi:10.1093/
Haller BC. 2016. A simple scripting language - Ben Haller. Available bioinformatics/btw556
from: http://benhaller.com/slim/Eidos˙Manual.pdf. Kern AD, Schrider DR. 2018. diploS/HIC: an updated approach to clas­
Haller BC, Galloway J, Kelleher J, Messer PW, Ralph PL. 2019. sifying selective sweeps. G3 Genes—Genomes—Genetics 8(6):
Tree-sequence recording in SLiM opens new horizons for forward- 1959–1970. doi:10.1534/g3.118.200262
time simulation of whole genomes. Mol Ecol Resour. 19(2): Key FM, Teixeira JC, de Filippo C, Andrés AM. 2014. Advantageous di­
552–566. doi:10.1111/1755-0998.12968 versity maintained by balancing selection in humans. Curr Opin
Haller BC, Messer PW. 2019. SLiM 3: forward genetic simulations be­ Genet Dev. 29:45–51. doi:10.1016/j.gde.2014.08.001
yond the Wright–Fisher model. Mol Biol Evol. 36(3):632–637. doi: Khomutov E, Arzymatov K, Shchur V. 2021. Deep learning based
10.1093/molbev/msy228 methods for estimating distribution of coalescence rates from
Hamid I, Korunes KL, Schrider DR, Goldberg A. 2022. Localizing post- genome-wide data. J Phys Conf Ser. 1740(1):012031. doi:10.
admixture adaptive variants with object detection on ancestry- 1088/1742-6596/1740/1/012031
painted chromosomes. bioRxiv. Available from: https://www. Kim SY, et al. 2011. Estimation of allele frequency and association
biorxiv.org/content/early/2022/09/05/2022.09.04.506532. mapping using next-generation sequencing data. BMC
Hejase HA, Mo Z, Campagna L, Siepel A. 2021. A deep-learning ap­ Bioinform. 12(1):231. doi:10.1186/1471-2105-12-231
proach for inference of selective sweeps from the ancestral recom­ Kingma DP, Welling M. 2014. Auto-encoding variational Bayes.
bination graph. Mol Biol Evol. 39(1):msab332. doi:10.1093/ arXiv:1312.6114 [cs, stat]. Available from: http://arxiv.org/abs/
molbev/msab332 1312.6114.
Hernandez RD, Uricchio LH. 2015. SFS_code: more efficient and flex­ Kittlein MJ, Mora MS, Mapelli FJ, Austrich A, Gaggiotti OE. 2022. Deep
ible forward simulations. Bioinformatics. Preprint. Available from: learning and satellite imagery predict genetic diversity and differ­
http://biorxiv.org/lookup/doi/10.1101/025064. entiation. Methods Ecol Evol. 13(3):711–721. doi:10.1111/2041-
Hochreiter S, Schmidhuber J. 1997. Long short-term memory. Neural 210X.13775
Comput. 9(8):1735–1780. doi:10.1162/neco.1997.9.8.1735 Korfmann K, Abu Awad D, Tellier A. 2022. Weak seed banks influence
Holsinger KE, Weir BS. 2009. Genetics in geographically structured po­ the signature and detectability of selective sweeps. Evol Biol.
pulations: defining, estimating and interpreting F(ST). Nat Rev Preprint. Available from: http://biorxiv.org/lookup/doi/10.1101/
Genet. 10(9):639–650. doi:10.1038/nrg2611 2022.04.26.489499.
Hornik K, Stinchcombe M, White H. 1989. Multilayer feedforward net­ Korfmann K, Sellinger TPP, Freund F, Fumagalli M, Tellier A. 2022.
works are universal approximators. Neural Netw. 2(5):359–366. Simultaneous inference of past demography and selection from
doi:10.1016/0893-6080(89)90020-8 the ancestral recombination graph under the beta coalescent.
Hudson RR. 2002. Generating samples under a Wright-Fisher neutral bioRxiv. Available from: https://www.biorxiv.org/content/early/
model of genetic variation. Bioinformatics 18(2):337–338. doi: 2022/09/30/2022.09.28.508873.
10.1093/bioinformatics/18.2.337 Koropoulis A, Alachiotis N, Pavlidis P. 2020. Detecting positive selec­
Hüllermeier E, Waegeman W. 2021. Aleatoric and epistemic uncer­ tion in populations using genetic data. New York (NY): Springer
tainty in machine learning: an introduction to concepts and meth­ US p. 87–123.
ods. Mach Learn. 110(3):457–506. doi:10.1007/s10994-021- Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification
05946-3 with deep convolutional neural networks. In: Pereira F, Burges
Isildak U, Stella A, Fumagalli M. 2021. Distinguishing between recent CJC, Bottou L, Weinberger KQ, editors. Advances in neural infor­
balancing selection and incomplete sweep using deep neural net­ mation processing systems. Vol. 25. Curran Associates, Inc.
works. Mol Ecol Resour. 21(8):2706–2718. doi:10.1111/1755- Available from: https://proceedings.neurips.cc/paper/2012/file/
0998.13379 c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
Johnson MM, Wilke CO. 2022. Recombination rate inference via deep Kumar H, et al. 2022. Machine-learning prospects for detecting selec­
learning is limited by sequence diversity. bioRxiv. Available from: tion signatures using population genomics data. J Comput Biol.
https://www.biorxiv.org/content/early/2022/07/02/2022.07.01. 29(9):943–960. doi:10.1089/cmb.2021.0447
498489. Laruson AJ, Fitzpatrick MC, Keller SR, Haller BC, Lotterhos KE. 2022.
Johri P, Eyre-Walker A, Gutenkunst RN, Lohmueller KE, Jensen JD. Seeing the forest for the trees: assessing genetic offset predictions
2022. On the prospect of achieving accurate joint estimation of se­ from gradient forest. Evol Appl. 15(3):403–416. doi:10.1111/eva.
lection with population history. Genome Biol Evol. 14(7):evac088. 13354
doi:10.1093/gbe/evac088 LeCun Y, et al. 1989. Handwritten digit recognition with a back-
Jombart T, Devillard S, Balloux F. 2010. Discriminant analysis of princi­ propagation network. In: Touretzky D, editor. Advances in neural
pal components: a new method for the analysis of genetically information processing systems. Vol. 2. Morgan-Kaufmann.
structured populations. BMC Genet. 11(1):94. doi:10.1186/ Available from: https://proceedings.neurips.cc/paper/1989/file/
1471-2156-11-94 53c3bce66e43be4f209556518c2fcb54-Paper.pdf.

18 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023
Deep Learning in Population Genetics GBE

LeCun Y, Bengio Y. 1995. Convolutional networks for images, speech, Mondal M, Bertranpetit J, Lao O. 2019. Approximate Bayesian compu­
and time-series. Cambridge: MIT Press. tation with deep learning supports a third archaic introgression in
LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521(7553): Asia and Oceania. Nat Commun. 10(1):246. doi:10.1038/s41467-
436–444. doi:10.1038/nature14539 018-08089-7
LeCun Y, Huang FJ, Bottou L. 2004. Learning methods for generic ob­ Mughal MR, DeGiorgio M. 2019. Localizing and classifying adaptive
ject recognition with invariance to pose and lighting. In targets with trend filtered regression. Mol Biol Evol. 36(2):
Proceedings of the 2004 IEEE Computer Society Conference on 252–270. doi:10.1093/molbev/msy205
Computer Vision and Pattern Recognition, 2004. CVPR 2004. Nguembang Fadja A, Riguzzi F, Bertorelle G, Trucchi E. 2021.
IEEE. Vol. 2. p. II–104. doi:10.1109/CVPR.2004.1315150 Identification of natural selection in genomic data with deep con­
Levy SE, Myers RM. 2016. Advancements in next-generation sequen­ volutional neural network. BioData Min. 14(1):51. doi:10.1186/
cing. Annu Rev Genomics Hum Genet. 17(1):95–115. doi:10. s13040-021-00280-9
1146/annurev-genom-083115-022413 Nielsen R. 2005. Molecular signatures of natural selection. Annu Rev

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


Lin K, Li H, Schlötterer C, Futschik A. 2011. Distinguishing positive se­ Genet. 39(1):197–218. doi:10.1146/annurev.genet.39.073003.
lection from neutral evolution: boosting the performance of sum­ 112420
mary statistics. Genetics 187(1):229–244. doi:10.1534/genetics. Nielsen R, Paul JS, Albrechtsen A, Song YS. 2011. Genotype and SNP
110.122614 calling from next-generation sequencing data. Nat Rev Genet.
Linardatos P, Papastefanopoulos V, Kotsiantis S. 2021. Explainable AI: 12(6):443–451. doi:10.1038/nrg2986
a review of machine learning interpretability methods. Entropy Novakovsky G, Fornes O, Saraswat M, Mostafavi S, Wasserman WW.
23(1):18. doi:10.3390/e23010018 2022. Explainn: interpretable and transparent neural networks
Linnainmaa S. 1976. Taylor expansion of the accumulated rounding er­ for genomics. bioRxiv. Available from: https://www.biorxiv.org/
ror. BIT. 16(2):146–160. doi:10.1007/BF01931367 content/early/2022/05/25/2022.05.20.492818.
Lones MA. 2021. How to avoid machine learning pitfalls: a guide for Olden J, Jackson D. 2002. Illuminating the black box: a randomization
academic researchers. Available from: https://arxiv.org/abs/2108. approach for understanding variable contributions in artificial
02497. neural networks. Ecol Modell. 154:135–150.
Lopes J, Beaumont M. 2010. ABC: a useful Bayesian tool for the ana­ O’Shea K, Nash R. 2015. An introduction to convolutional neural net­
lysis of population data. Infect Genet Evol. 10(6):825–832. doi:10. works. Available from: https://arxiv.org/abs/1511.08458.
1016/j.meegid.2009.10.010 Pavlidis P, Jensen JD, Stephan W. 2010. Searching for footprints of
López-Cortés XA, Matamala F, Maldonado C, Mora-Poblete F, Scapim positive selection in whole-genome SNP data from nonequilibrium
CA. 2020. A deep learning approach to population structure infer­ populations. Genetics 185(3):907–922. doi:10.1534/genetics.
ence in inbred lines of maize. Front Genet. 11:543459. doi:10. 110.116459
3389/fgene.2020.543459 Perez MF, et al. 2022. Coalescent-based species delimitation meets
Lou RN, Jacobs A, Wilder AP, Therkildsen NO. 2021. A beginner’s deep learning: insights from a highly fragmented cactus system.
guide to low-coverage whole genome sequencing for popula­ Mol Ecol Resour. 22(3):1016–1028. doi:10.1111/1755-0998.
tion genomics. Mol Ecol. 30(23):5966–5993. doi:10.1111/ 13534
mec.16077 Petr M, Haller BC, Ralph PL, Racimo F. 2022. slendr: a framework for
Lundberg SM, Lee SI. 2017. A unified approach to interpreting model spatio-temporal population genomic simulations on geographic
predictions. In: Guyon I, Luxburg U, Bengio S, Wallach H, Fergus R, landscapes. Available from: https://www.biorxiv.org/content/10.
Vishwanathan S, Garnett R, editors. Advances in neural informa­ 1101/2022.03.20.485041v1
tion processing systems (NIPS 2017). Vol. 30. 31st Annual Poplin R, et al. 2018. A universal SNP and small-indel variant caller
Conference on Neural Information Processing Systems (NIPS); using deep neural networks. Nat Biotechnol. 36(10):983–987.
2017 Dec 4–9; Long Beach, CA. doi:10.1038/nbt.4235
Luu K, Bazin E, Blum MGB. 2017. pcadapt: an R package to perform Prangle D. 2015. Summary statistics in approximate Bayesian compu­
genome scans for selection based on principal component analysis. tation. Available from https://arxiv.org/abs/1512.05633
Mol Ecol Resour. 17(1):67–77. doi:10.1111/1755-0998.12592 Pritchard JK, Stephens M, Donnelly P. 2000. Inference of population
Mahmoudi A, Koskela J, Kelleher J, Chan Y-b, Balding D. 2022. structure using multilocus genotype data. Genetics 155(2):
Bayesian inference of ancestral recombination graphs. PLoS 945–959. doi:10.1093/genetics/155.2.945
Comput Biol. 18(3):e1009960. doi:10.1371/journal.pcbi.1009960 Provine WB. 2020. The origins of theoretical population genetics.
Mantes AD, Montserrat DM, Bustamante CD, Giró-i Nieto X, Ioannidis Chicago: University of Chicago Press.
AG. 2021. Neural ADMIXTURE: rapid population clustering with Pybus M, et al. 2015. Hierarchical boosting: a machine-learning frame­
autoencoders. Genomics. Preprint. Available from: http://biorxiv. work to detect and classify hard selective sweeps in human popu­
org/lookup/doi/10.1101/2021.06.27.450081. lations. Bioinformatics 31(24):3946–3952. doi:10.1093/
Meisner J, Albrechtsen A. Haplotype and population structure infer­ bioinformatics/btv493
ence using neural networks in whole-genome sequencing data. Qin X, Chiang CWK, Gaggiotti OE. 2022. Deciphering signatures of
Genome Res. 32(8)1542–1552. doi:10.1101/gr.276813.122 natural selection via deep learning. Brief Bioinform. 23:bbac354.
Messer PW. 2013. SLiM: simulating evolution with selection and link­ doi:10.1093/bib/bbac354
age. Genetics 194(4):1037–1039. doi:10.1534/genetics.113. Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M. 2022. Hierarchical
152181 text-conditional image generation with clip latents. Available
Minsky ML. 1967. Computation: finite and infinite machines. from: https://arxiv.org/abs/2204.06125.
Hoboken: Prentice-Hall. (Prentice-Hall series in automatic Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. 2014. Genome-wide
computation). inference of ancestral recombination graphs. PLoS Genet. 10(5):
Mohamed E, Sirlantzis K, Howells G. 2022. A review of e1004342. doi:10.1371/journal.pgen.1004342
visualisation-as-explanation techniques for convolutional neural Ronen R, Udpa N, Halperin E, Bafna V. 2013. Learning natural selection
networks and their evaluation. DISPLAYS. 73:102239. doi:10. from the site frequency spectrum. Genetics 195(1):181–193. doi:
1016/j.displa.2022.102239 10.1534/genetics.113.152587

Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023 19
Korfmann et al. GBE

Rumelhart DE, McClelland JL. 1987. Learning internal representations Suvorov A, Hochuli J, Schrider DR. 2020. Accurate inference of
by error propagation: Cambridge: MIT Press. p. 318–362. tree topologies from multiple sequence alignments using
Sanchez T. 2022. Reconstructing our past deep learning for population deep learning. Syst Biol. 69(2):221–233. doi:10.1093/sysbio/
genetics [theses]. Université Paris-Saclay. Available from: https:// syz060
theses.hal.science/tel-03701132. Teh YW, Hinton GE. 2000. Rate-coded restricted Boltzmann ma­
Sanchez T, et al. 2022. dnadna: a deep learning framework for popu­ chines for face recognition. In: Leen T, Dietterich T, Tresp V, edi­
lation genetics inference. Bioinformatics 39:btac765. doi:10.1093/ tors. Advances in neural information processing systems. Vol.
bioinformatics/btac765 13. MIT Press. Available from: https://proceedings.neurips.cc/
Sanchez T, Caramiaux B, Thiel P, Mackay WE. 2022. Deep learning uncer­ paper/2000/file/c366c2c97d47b02b24c3ecade4c40a01-Paper.
tainty in machine teaching. In: IUI 2022 - 27th Annual Conference on pdf.
Intelligent User Interfaces. Finland: Helsinki/Virtual. Available from: Tejero-Cantero A, et al. 2020. SBI: a toolkit for simulation-based in­
https://hal.archives-ouvertes.fr/hal-03579448. ference. J Open Source Softw. 5(52):2505. doi:10.21105/joss.

Downloaded from https://academic.oup.com/gbe/article/15/2/evad008/6997869 by guest on 27 January 2024


Sanchez T, Cury J, Charpiat G, Jay F. 2021. Deep learning for popula­ 02505
tion size history inference: design, comparison and combination Thornton KR. 2014. A C++ template library for efficient forward-time
with approximate Bayesian computation. Mol Ecol Resour. 21(8): population genetic simulation of large populations. Genetics
2645–2660. doi:10.1111/1755-0998.13224 198(1):157–166. doi:10.1534/genetics.114.165019
Schmidhuber J. 2014. Deep learning in neural networks: an overview. Torada L, et al. 2019. Imagene: a convolutional neural network to
CoRR. Available from: http://arxiv.org/abs/1404.7828. quantify natural selection from genomic data. BMC Bioinform.
Schrider DR, Kern AD. 2016. S/HIC: robust identification of soft and 20(9):337. doi:10.1186/s12859-019-2927-x
hard sweeps using machine learning. PLoS Genet. 12(3):1–31. Villanea FA, Schraiber JG. 2019. Multiple episodes of interbreeding be­
doi:10.1371/journal.pgen.1005928 tween neanderthal and modern humans. Nat Ecol Evol. 3(1):
Schrider DR, Kern AD. 2018. Supervised machine learning for popula­ 39–44. doi:10.1038/s41559-018-0735-8
tion genetics: a new paradigm. Trends Genet. 34(4):301–312. doi: Vizzari MT, Benazzo A, Barbujani G, Ghirotto S. 2020. A revised
10.1016/j.tig.2017.12.005 model of anatomically modern human expansions out of
Sellis D, Callahan BJ, Petrov DA, Messer PW. 2011. Heterozygote ad­ Africa through a machine learning approximate Bayesian com­
vantage as a natural consequence of adaptation in diploids. Proc putation approach. Genes 11(12):1510. doi:10.3390/genes111
Natl Acad Sci U S A. 108(51):20666–20671. doi:10.1073/pnas. 21510
1114573108 Voznica J, et al. 2021. Deep learning from phylogenies to uncover the
Shapley LS. 1953. A value for n-person games. In: Kuhn HW, Tucker transmission dynamics of epidemics. bioRxiv. Available from:
AW, editors. Contributions to the theory of games II. Princeton https://www.biorxiv.org/content/early/2021/03/31/2021.03.11.
(RI): Princeton University Press. p. 307–317. 435006.
Sheehan S, Song YS. 2016. Deep learning for population genetic infer­ Wang R, et al. 2018. Deepdna: a hybrid convolutional and recurrent
ence. PLoS Comput Biol. 12(3):1–28. doi:10.1371/journal.pcbi. neural network for compressing human mitochondrial genomes.
1004845 In: 2018 IEEE International Conference on Bioinformatics and
Silva M, Pratas D, Pinho AJ. 2020. Efficient DNA sequence compression Biomedicine (BIBM). IEEE. p. 270–274. doi:10.1109/BIBM.2018.
with neural networks. GigaScience. 9(11):giaa119. doi:10.1093/ 8621140
gigascience/giaa119 Wang Z, et al. 2021. Automatic inference of demographic parameters
Simonyan K, Vedaldi A, Zisserman A. 2013. Deep inside convolutional using generative adversarial networks. Mol Ecol Resour. 21(8):
networks: visualising image classification models and saliency 2689–2705. doi:10.1111/1755-0998.13386
maps. Available from: https://arxiv.org/abs/1312.6034. Whalen S, Schreiber J, Noble WS, Pollard KS. 2022. Navigating the pit­
Smith CCR, Tittes S, Ralph PL, Kern AD. 2022. Dispersal inference from falls of applying machine learning in genomics. Nat Rev Genet.
population genetic variation using a convolutional neural network. 23(3):169–181. doi:10.1038/s41576-021-00434-9
bioRxiv. Available from: https://www.biorxiv.org/content/early/ Whitehouse LS, Schrider DR. 2022. Timesweeper: accurately identify­
2022/08/26/2022.08.25.505329. ing selective sweeps using population genomic time series.
Smolensky P. 1986. Information processing in dynamical systems: bioRxiv. Available from: https://www.biorxiv.org/content/early/
foundations of harmony theory. In Parallel distributed processing: 2022/07/07/2022.07.06.499052.
explorations in the microstructure of cognition. Cambridge (MA): Xue AT, Schrider DR, Kern AD, Consortium A. 2020. Discovery of on­
MIT Press. p. 194–281. going selective sweeps within anopheles mosquito populations
Soni V, Vos M, Eyre-Walker A. 2022. A new test suggests hundreds using deep learning. Mol Biol Evol. 38:1168–1183. doi:10.1093/
of amino acid polymorphisms in humans are subject to balancing se­ molbev/msaa259
lection. PLoS Biol. 20(6):1–27. doi:10.1371/journal.pbio.3001645 Yelmen B, et al. 2021. Creating artificial human genomes using gen­
Speidel L, Forest M, Shi S, Myers SR. 2019. A method for genome-wide erative neural networks. PLoS Genet. 17(2):1–22. doi:10.1371/
genealogy estimation for thousands of samples. Nat Genet. 51(9): journal.pgen.1009303
1321–1329. doi:10.1038/s41588-019-0484-x Yue T, Wang H. 2018. Deep learning for genomics: a concise overview.
Strumbelj E, Kononenko I. 2010. An efficient explanation of individual Available from https://arxiv.org/abs/1802.00810
classifications using game theory. J Mach Learn Res. 11:1–18. Zou J, et al. 2019. A primer on deep learning in genomics. Nat Genet.
Sugden LA, et al. 2018. Localization of adaptive variants in human 51(1):12–18. doi:10.1038/s41588-018-0295-5
genomes using averaged one-dependence estimation. Nat
Commun. 9(1):703. doi:10.1038/s41467-018-03100-7 Associate editor: Andrea Betancourt

20 Genome Biol. Evol. 15(2) https://doi.org/10.1093/gbe/evad008 Advance Access publication 23 January 2023

You might also like