1-s2.0-S2001037022004342-main

Computational and Structural Biotechnology Journal 20 (2022) 5490–5499
journal homepage: www.elsevier.com/locate/csbj
Genomic prediction through machine learning and neural networks

for traits with epistasis
Weverton Gomes da Costa a,⇑, Maurício de Oliveira Celeri b, Ivan de Paiva Barbosa c, Gabi Nunes Silva d,
Camila Ferreira Azevedo c, Aluizio Borem c, Moysés Nascimento b, Cosme Damião Cruz a
a
Department of General Biology, Bioinformatics Laboratory, Federal University of Viçosa, Viçosa, MG, Brazil
b
Department of Statistics, Laboratory of Computational Intelligence and Statistical Learning, Federal University of Viçosa – UFV, Viçosa, MG, Brazil
c
Department of Agronomy, Federal University of Viçosa, Viçosa, MG, Brazil
d
Department of Mathematics and Statistics, Federal University of Rondônia, Ji-Paraná Campus, RO, Brazil
a r t i c l e i n f o a b s t r a c t
Article history: Genomic wide selection (GWS) is one contributions of molecular genetics to breeding. Machine learning
Received 12 April 2022 (ML) and artificial neural networks (ANN) methods are non-parameterized and can develop more accu-
Received in revised form 20 September rate and parsimonious models for GWS analysis. Multivariate Adaptive Regression Splines (MARS) is con-
2022
sidered one of the most flexible ML methods, automatically modeling nonlinearities and interactions of
Accepted 20 September 2022
Available online 24 September 2022
the predictor variables. This study aimed to evaluate and compare methods based on ANN, ML, including
MARS, and G-BLUP through GWS. An F2 population formed by 1000 individuals and genotyped for 4010
SNP markers and twelve traits from a model considering epistatic effect, with QTL numbers ranging from
Keywords: 2
Genome wide selection
eight to 480 and heritability (h ) of 0.3, 0.5 or 0.8 were simulated. Variation in heritability and number of
Quantitative trait locus QTL impacts the performance of methods. About quantitative traits (40, 80, 120, 240, and 480 QTLs) was
Non-additive effects observed highest R2 to Radial Base Network (RBF) and G-BLUP, followed by Random Forest (RF), Bagging
2
Multivariate adaptive regression splines (BA), and Boosting (BO). RF and BA also showed better results for traits to h of 0.3 with R2 values 16.51%
Genome-enabled prediction and 16.30%, respectively, while MARS methods showed better results for oligogenic traits with R2 values
2 2
ranging from 39,12 % to 43,20 % in h of 0.5 and from 59.92% to 78,56% in h of 0.8. Non-additive MARS
methods also showed high R2 for traits with high heritability and 240 QTLs or more. ANN and ML meth-
ods are powerful tools to predict genetic values in traits with epistatic effect, for different degrees of her-
itability and QTL numbers.
Ó 2022 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Bio-
technology. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).
1. Introduction the time and resources allocated to the development of a new cul-
tivar [2,8,9].
Genomic wide selection (GWS), proposed by [1], has become Genome-based prediction is influenced by several factors, such
one of the main contributions of molecular genetics to breeding. as the predictive ability of the methods, the complexity of the
The GWS approach increased the accuracy in the prediction of trait’s genetic architecture due to non-additive effects (dominance
breeding values, making the selection of elite genotypes more effi- and epistasis), number of phenotypic observations and markers
cient and accurate [2]. Furthermore, GWS made it possible to accel- used [2]. Increasingly, researchers are turning to machine learning
erate the improvement process by half the time in relevant crops, and neural network techniques, which have built-in predictor
helping to sustain current food demands [3–5].This time reduction selection capabilities and are unparameterized to develop more
allowed breeders to maximize genetic gains per unit of time, in accurate and parsimonious models [10]. Furthermore, some of
addition to early selection [6,7]. All these benefits are due to the these methods allow identifying interactions between markers.
direct use of DNA information in the selection of individuals, asso- This property allows great flexibility to deal with different types
ciating marker information with phenotypic information, reducing of traits with gene control with additive, dominant and epistatic
effects [5,9,11,12].
Among the various methods based on machine learning, Multi-
variate Adaptive Regression Splines (MARS) is considered one of
⇑ Corresponding author.
the most flexible [13], it proves to be more parsimonious and per-
E-mail address: wevertonufv@gmail.com (W.G.d. Costa).
https://doi.org/10.1016/j.csbj.2022.09.029
2001-0370/Ó 2022 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
W.G. da Costa, M. de O. Celeri, I. de P. Barbosa et al. Computational and Structural Biotechnology Journal 20 (2022) 5490–5499
forms better than artificial neural networks for genomic prediction an approximate distance between them, within the first 8 linkage
in some studies [10,14]. MARS produces continuous models that groups (Supplementary Fig. 1).
can have multiple partitions, automatically models nonlinearities, The total phenotypic values expressed by a given individual for
and contemplates interactions of predictor variables using adap- traits C1 to C18 were simulated according with [21–23] consider-
tively selected spline functions [15–17]. ing the mean equal to 100 and coefficient of variation equal to 12 %,
In the genetic context, MARS can be able to adjust the genetic with a dominance level (di) equal to 0.5 and by a model with epi-
architecture of the trait and can also detect interactions, such as static effect according to the following equation:
epistasis, and can be used to define the type of trait control. Thus, X X
the inheritance mode of the markers and their interactions can also Y ij ¼ l þ jaji þ jaji ajiþ1 þ ei
be determined automatically, therefore, the number of parameters
where Y i is the phenotypic value for observation i; l is the general
in the modeling can be drastically reduced [14]. Although several
mean; aj is the effect of the favorable allele at locus j of individual i,
studies have proven the high power of MARS in the evaluation of
that is, it assumes the values u þ ai ,u þ di and u ai for the geno-
genomic data in the medical field [14,18–20], it is not known
typic values associated with classes AA, Aa and aa, respectively,
whether more advanced machine learning, such as MARS, offer
with u being the mean between the dominant homozygote (AA)
superior performance over traditional statistical methods for
and the recessive homozygote (aa). Classes were identified by cod-
genetic improvement. In this sense, the objectives of this study
ing 1, 0 or 1, respectively; aji ajiþ1 represents the interaction
were: (i) to evaluate the general accuracy and the variability of
between favorable alleles at different loci. The variance structure
the prediction performance of methods based on machine learning,
of the residues was given by [24] e N ð0; V e Þ, where
including MARS, and neural networks in genomic prediction ana-
2 2
lyzes for simulated traits for different numbers of genes in the V e ¼ 1 h V g =h , where V e is the residual variance, V g is the
presence of dominance and epistasis and with different degrees 2
genotypic variance, and h the heritability.
of heritability and (ii) to compare the results obtained with G-
BLUP in different scenarios.
2.3. Prediction of breeding values
2. Material and methods The genomic breeding values (GEBVs) were predicted using
methods based on statistical approaches, represented by G-BLUP,
2.1. Simulation of population genome on neural network approaches, represented by the Multilayer Per-
ceptron Network (MLP) and Radial Basis Function Network (RBF)
To simulate the data, an F2 population of a diploid species and on learning approaches from Multivariate machine Adaptive
(2n ¼ 2x ¼ 20) with an effective size of 1000 individuals was taken Regression Splines (MARS), Decision Tree (DT), Boosting (BO), Bag-
as reference. The genotypic constitution of each individual was ging (BA) and Random Forest (RF).
established considering the information in the genome and the Neural network approaches are often treated as machine learn-
random union of gametes from the parents, assuming a gametic ing [12,22,25]. However, each approach has its specificity and here
pool of 5000 reproductive units, per parent, in each fertilization. they will be considered as different approaches. As neural net-
The population was generated using divergent parental lines, i.e., works work like the human brain, they are composed of neurons
contrasting homozygous parents (P1 dominant and P2 recessive), organized in layers that capture all available information to gener-
with a genome established considering 10 linkage groups with a ate a decision-making criterion, they differ from machine learning
size of 200 cM each. To provide linkage disequilibrium between methods, which model the limitations of data separation with
markers, the percentage of recombination was equivalent to a dis- based on the learning decision rules on the input characteristics
tance between loci of 0.5 cM. The genome was generated with a of the model [26].
saturation level of 401 equidistantly spaced molecular markers in
each linkage group, resulting in a total of 4010 molecular markers 2.4. Data analysis
in the genome. Markers were codominant (SNPs - Single Nucleo-
tide Polymorphism), allowing the identification of heterozygous For all methods, the input data was a matrix of molecular mark-
individuals. ers, represented by the genotypic values encoded in 1, 0 and 1,
simulated for 4010 markers and 1000 individuals. The methods
2.2. Simulation and constitution of phenotypic values returned in the output a vector with the GEBV for each individual.
For comparison, the methods were grouped according to their
From the simulated genotypic data of the F2 population, 18 respective learning approach: G-BLUP – G-BLUP; MLP and RBF –
traits with numbers of controlling genes ranging from 8 to 480 NETWORK; DT, BA, BO and RF – TREES; and MARS 1, MARS 2 and
and heritability of 0.3, 0.5 or 0.8 were simulated (Table 1). The con- MARS 3 – MARS.
trolling genes (QTL - Quantitative Trait Locus) were distributed
equally among the first 8 linkage groups (Supplementary Fig. 1). 2.4.1. Multivariate Adaptive regression Splines (MARS)
Eight QTL controlled for C1, C7 and C13 traits, defined by the The algorithm proposed by [27] Multivariate Adaptive Regres-
central markers of the first eight linkage groups. For traits C2 to sion Splines (MARS), considers an expansion in piecewise linear
C6, C8 to C12, and C14 to C18 the QTL were distributed keeping functions, called basis functions (BFs), as follows:
Table 1
2
Number of controlling loci and heritability (h ) of the 12 simulated traits (C1 to C18).
h
2 Number of controlling loci
8 40 80 120 240 480
0,3 C1 C2 C3 C4 C5 C6
0,5 C7 C8 C9 C10 C11 C12
0,8 C13 C14 C15 C16 C17 C18
5491

x t; sex > t; t x; sex < t; y ¼ Xb þ Zua þ Zud þ Zuepi þ e
ðx tÞþ ¼ ; ð t xÞ þ ¼
0; ifotherwise: 0; ifotherwise: where y is the vector of phenotypic observations; b is the vector
Each function is a piecewise linear spline, with a node at the of fixed effects (in this study, the general mean) with incidence
value t. These two BFs are called a reflexive pair. MARS forms matrix X; ua , ud e uepi are vectors of genetic values of additive, dom-
reflexive pairs for each input (marker) X j , with nodes at each inant and epistatic effects, respectively; Z is the incidence matrix
observed value xij of that input. The model building strategy is like for these vectors; and e is the random error vector. The variance

a progressive linear regression, but instead of using the original structure was given by ua N 0; Ga r2ua ); ud N 0; Gd r2ud );
inputs, we used functions from the set
n o uepi N 0; Gepi r2uepi ) and e N 0; Ir2e , where Ga , Gd and Gepi are the
C ¼ Xj t þ; t Xj þ and/or its
t 2fx1j ; x2j ; ...; xNj g j¼1;2;...; p genomic relationship matrices for the additive, dominant and epi-
products. The MARS model, which is a linear combination of the static effects, respectively , and r2ua , r2ud and r2uepi are the additive,
BFs and/or their interactions, is given by [28]:
dominance and epistatic variances, respectively.
XM For the construction of the genomic relationship matrices (W
f ðX Þ ¼ b0 þ b hm ðX Þ
m¼1 m and S) used in the model, M_ij was considered to be the incidence
where b0 is the regression constant, bm with m = 1, 2, . . ., M, are of the number of alleles of brand j of individual i and pj the fre-
the regression coefficients, and hm ðX Þ is a function in C, or a pro- quency of the dominant allele A in brand j. In this way, the W
duct of two or more functions. and S matrices were given by [31]:
The estimation process of the parameters b0 and bm is based on 8
the minimization of the residual sum of squares. First, the forward < 2 2pj ; ifMij ¼ AA
>
phase is performed on the training data, initially starting to build W ij ¼ 1 2pj ; ifM ij ¼ Aa ; and
>
:
the model only with the constant function h0 ðX Þ ¼ 1, and all func- 0 2pj ; ifMij ¼ aa
tions in the C set are candidate functions. At each subsequent step, 8 2
the base pair that produces the maximum reduction in training >
< 2 1 pj ; ifM ij ¼ AA
>
error is added. Considering a model with basic M functions, the
Sij ¼ 2pj 1 pj ; ifMij ¼ Aa
next pair to be added to the model is [28]: >
>
: 2p2 ; ifM ¼ aa
j ij
b
b Mþ1 hl ðX Þ X j t þ þ b
b Mþ2 hl ðX Þ t X j þ ; hl 2 M
In this way, we obtain:
where b b
b Mþ1 and b Mþ2 are coefficients estimated by the least WW0 SS0
square method, together with all other M þ 1 coefficients in the Ga ¼ Pn ; Gd ¼ P 2 ; Gepi ¼ Ga #Ga
2p 1 p n
model. This process of adding BFs continues until the model j¼1 j j j¼1 2pj 1 pj
reaches a predetermined maximum number, often leading to a Where # is the Hadamard product operator.
purposefully oversized model [29]. The mixed model equations for the full model were given by
The backward phase improves the model by removing the least [24]:
significant terms until finding the best submodel. The model sub- 2 32
X0X X0Z X0Z X0Z 3 2 0 3
sets are compared using the generalized cross-validation (GCV) b
b Xy
6 0 7
method. The GCV is the root-mean-square residual error divided 6ZX Z 0 Z þ Ga 1 k1 X0Z X0Z 766 b
u
7 6 Z0 y 7
7 6 7
6 76 a
7¼6 7
by a penalty that depends on the complexity of the model [29]. 6 0 Z 0 Z þ Gd 1 k2 X 0 Z 7 b d 5 4 Z0 y 5
The GCV is calculated as [28]:
4ZX Z0 Z 54 u
Z0 X Z0 Z X 0 Z Z 0 Z þ Gepi 1 k3 bi
u Z0 y
PN h i2
1
N i¼1 yi bf k ðxi Þ r2 r2 r2
GCV ðkÞ ¼ h i2 were k1 ¼ r2e , k2 ¼ r2e and k3 ¼ r2 e and the variances were esti-
ua ud uepi
1 C ðNMÞ mated by the Restricted Maximum Likelihood Method (REML).
where M is the effective number of model parameters, C ðM Þ is a
2.4.3. Multilayer Perceptron neural Network (MLP)
cost function for each basis function included in the developed
The Levenberg-Marquardt backpropagation training algorithm
submodel, which by default is adopted by default value of 3 [27],
was used for the Multilayer Perceptron Neural Network (MLP). Pre-
N is the number of datasets used in cross-validation and b f ðxi Þ k liminary tests were performed with different architectures, being
denotes the predicted MARS values. represented by 1 layer and the number of neurons varying from
To identify the possible interaction between the QTLs, MARS 5 to 15, to choose the best topology to be used. The linear activa-
models with degrees equal to 1, 2, and 3 were used, with the model tion function (purelin) was used.
with degree 1 considered an additive model and the others non- The linear function for the nth neuron of the output layer of an
additive, which allow interactions between markers. For the stop- MLP was represented by:
ping criterion of the forward phase, the maximum number of Xq
terms in the adopted model was equal to 200, as the default of yri ¼ p x0 w0 þ f ðxi Þwj
j¼1 xj
the ‘‘earth” package of R. A preliminary analysis was carried out
for the second stopping criterion [30], in which incrementing a where: p is a linear activation function, x0 is the bias term of the
term in the model would change the coefficient of determination nth neuron, xi is the i-th input, wj is the synaptic weights to be
from less than 0.001 (default) to 0.05, choosing the best model that adjusted and f xj ðxi Þ is the value coming from the layer hidden for
presented the highest selective accuracy (R2 ) for the validation set. each input i, assigned to an activation function. The activation
function used in this work was the linear one (f xj ðxi Þ ¼ xi ).
2.4.2. Genomic BLUP (G-BLUP)
An epistatic model, including dominance and additive effects, 2.4.4. Neural Network Radial Base function (RBF)
for the REML/G-BLUP method was used according to the following The Radial Base Function Neural Network (RBF) uses a feedfor-
expression: ward architecture. This model also consists of an input layer, a hid-
5492
den layer, and an output layer. RBF training is hybrid (supervised 2.4.7. Random Forest
and unsupervised) and the input layer information goes through Random Forest (RF) [35] is similar to BA in that bootstrap sam-
a linear k-means cluster [11]. The hidden layer applies a non- ples are used to build multiple trees, the difference being that each
linear transformation of the input space to a high-dimensional hid- tree is established with a random subset of predictors. The number
den space with a Gaussian function. The output layer applies a of predictors used to find the best split at each node is a subset that
transformation to the hidden space, providing an output vector was chosen by m ¼ v3, with v being the total number of predictors.
for the network. The RBF optimization training included: the The number of trees for the RF was set at 500. For the RF, the trees
weights between the hidden layer and the output layer, the activa- grow to their maximum size without pruning, and the aggregation
tion function, the center of activation functions, the distribution of is done by averaging the trees [25].
the center of activation functions, and the number of hidden neurons
[11]. During the training process, only the weights between the hid-
2.4.8. Boosting
den layer and the output layer are modified [9]. To select the best
Boosting (BO) creates trees sequentially using information from
RBF architecture, according to the MLP, preliminary tests were car-
previous trees [32]. In this sense, BO is an approach repeatedly
ried out. The number of neurons ranged from 5 to 50 and radius size
trained on the same sample so that at each iteration, a measure
from 30 to 50. The mean square error was set to 0.05.
of prediction error is calculated for each marker, and in the next
The linear function for the nth neuron of the output layer of an
iteration, markers with higher errors receive greater weight in
RBF was represented by:
the model training. The prediction is performed by weighting the
Xq
results of the set of all regression trees [36]. The number of trees
yri ¼ g x0 w0 þ f xj ð x i Þwj
j¼1 sampled was 500, with a learning rate of 0.01 and a depth of 2.
where: g is a linear function, x0 is the bias term of the nth neu- The following model was used to adjust the BO [28]:
ron, xi is the i-th input, wj is the synaptic weights to be adjusted, XM
and f xj ðxi Þ is the value coming from the hidden layer for each input f ð xÞ ¼ b bðx;
m¼1 m
cm Þ
i, assigned to the Gaussian activation function, which is given by Where bm , m = 1, 2, . . ., M are the coefficients of base expansion
and bðx; cm Þ are simple functions of the multivariate argument x,
2
the equation:eð2uc Þ
r2 , where c is the center of the Gaussian function,
r is the variance of the Gaussian function and i is the value of the
2 with a set of parameters c ¼ c1 ; c2 ; ; cm .
individual’s input, which represents the activation potential of the
clustering phase. 2.5. Efficiency parameters
2.4.5. Decision tree To evaluate the efficiency of the techniques, the selective accu-
The decision tree structure was based on a regression tree, cre- racy was used, which is measured by the square of the correlation
ated from the search for the tree that would lead to the data parti- (R2) between the estimated values - GEBVs ( b y ) and the real values
tion until the formation of homogeneous groups was obtained. To (y), and the root means square error (RMSE), which expresses the
perform recursive binary division, first is the marker X j and the predictive accuracy. The selective and predictive accuracies were
cutoff point s so that the division of the predictor space into the given respectively by the following equations: R2 ¼ ðcorðy ^; yÞÞ2
regions xjxj < s e xjxj s leads to the greatest possible reduc- qffiffiffiffiffiffiffiffiffiffiffiffiffi
2
andRMSE ¼ Rðyy Þ
.
tion in RSS. That is, we consider all markers x1 ; ; X m and all pos- n
sible values of the cutoff s for each of the markers, and then choose
the marker and cutpoint so that the resulting tree has the smallest 2.6. Training and validation
RSS. The equation that reflects the binary division is [32]:
For the training and validation of the techniques used, cross-
R1 ð j; sÞ ¼ XjX j < s eR2 ð j; sÞ ¼ XjX j s ;
validation (k-fold) was performed with k = 5 partitions [37]. In
and then we look for the value of J and S that minimize the each of the five rounds, four of these subsets constituted the train-
equation: ing population (80 % 800 individuals), and the remaining subset
2 2 constituted the validation population (20–200 individuals). The
X X
yi b
y R1 þ yi b
y R2 techniques were compared based on the arithmetic mean of the
i:xi 2R1 ð j;sÞ i:xi 2R2 ð j;sÞ five performance estimates of the validation sets.
where: b y R1 is the average of the response variable of the training

2.7. Computational aspects
observations belonging to the region R1 ð j; sÞ, by R2 is the average of
the response variable of the training observations belonging to Population simulations were performed using the GENES soft-
the region R2 ð j; sÞ and yi is the true value of the traits of each ware [38]. The G-BLUP, DT, BA, RF, BO, and MARS methods were
individual. performed with the GENES software integrated with the R software
[39,40]. The MLP and RBF methods were performed by the GENES
2.4.6. Bagging software integrated with the MATLAB software [39,41].
The Bagging (BA) method creates several similar datasets by
resampling (bootstrapping) to obtain an average of several regres-
3. Results
sion trees that are performed without pruning for each dataset
[33,34]. Thus, a number B of models are obtained:
The phenotypic and genotypes values of the 18 simulated traits
b
f 1 ðxÞ; b
f 2 ðxÞ; ; b
f B ðxÞ. These generated models are used to obtain
P are shown in Fig. 1. The low variations can be explained by the 12%
an average model, given by: b f medio ðxÞ ¼ 1B Bb¼1 b
f b ðxÞ. The number coefficient assigned in the simulation and the mean was very close
of trees sampled for BA was set at 500 trees. to 100, as defined in the simulation.
5493
Fig. 1. Boxplot of the genetic and phenotypic values of the 18 simulated traits, considering a coefficient of variation equal to 12% and mean to 100. The specification of each
characteristic is represented in Table 1.
The selective accuracy (R2) of the prediction of breeding values with 240 QTL (Fig. 1). In addition, these methods presented values
for all methods was higher in scenarios with higher heritability of R2 greater than the overall mean of R2 in all scenarios (Fig. 1).
(Fig. 1). On the other hand, the variation in the number of QTL The BO method presented the best result when the scenarios were
showed that the methods have diversity among the results of greater heritability and 40 or more QTL. BO was also the method
obtained, indicating that the number of QTL of the traits directly that showed the greatest sensitivity to heritability and showed a
influences the prediction of GEBVs according to the method used substantial improvement in results in higher heritability scenarios.
and that the increase in the number of QTL is harmful to the MARS The predictive accuracy results (REQM), referring to the error in
approach and the DT method, while for the other methods the the prediction of the GEBVs of the individuals, were always smaller
increase in the number of QTL reflects an improvement in R2 . according to the increase in the number of QTL, that is, the greater
For both heritability scenarios, the methods based on machine the number of QTL, the lower the error in the prediction of the
learning, MARS and Trees, presented higher values of R2 for the GEBVs of the individuals, regardless of the method used (Fig. 2).
traits with the lowest number of QTL, when compared to the other In this case, the impact caused by the increase in the number of
methods (Fig. 1). The effect of the interaction between markers QTLs on the prediction error of GEBVs is greater than the change
was even more evident for higher heritability (80 %), resulting in in heritability and is inversely proportional. This result was possi-
higher R2 values for the non-additive MARS models (MARS 2 and ble due to the fixed number of markers, providing a greater propor-
3). tion of direct effects of markers on traits in relation to those poorly
From the scenarios with 40 QTL, an increase was observed for correlated with the phenotype and without direct effect.
For scenarios with 8 QTLs, the trees (Fig. 3) showed better
the values of R2 as the number of QTL increased, except for MARS
results. It was also observed that the higher the increase in the
and DT, reaching values close to those of the real genetic variation
number of QTLs, the lower the difference between the methods
when the trait presents 480 QTL. In these scenarios, the G-BLUP
for RMSE. In the largest QTLs scenarios, DT had higher RMSE values
and RBF methods, followed by RF and BA, presented the highest
in most scenarios.
values for R2 and always above the general average (red line) for
The RBF method presented very similar RMSE values when
the traits for both heritability scenarios (Fig. 1). For scenarios with
compared to those obtained through G-BLUP for all scenarios
80 % heritability and 40 or more QTL, the MLP and BO methods also
(Fig. 2). From 40 QTL, these methods presented the lowest values
deserve to be highlighted.
for RMSE. Similar values of these methods were obtained by the
Despite presenting lower values for R2 compared to other meth- non-additive MARS (MARS 2 and MARS 3) for the scenarios with
ods, the predictive power of MARS methods for traits with many 240 and 480 QTL and heritability of 80 %.
QTLs cannot be neglected, especially when there is a very high
number of QTLs, such as 240 and 480 genes, the non-additive
MARS methods (MARS 2 and 3) showed high R2 values (above 4. Discussion
60 %), and considering the standard error, values lower only than
G-BLUP (Fig. 1). It is worth mentioning that for these scenarios, The inclusion of a greater number of marker variables in a pre-
MARS had high predictive potential, explaining almost all the dictive model can be useful to obtain a better performance, but it
genetic variations of these traits. can lead to the addition of redundant information and make it dif-
Methods based on MARS and regression trees did not obtain a ficult to apply in practice [20]. Furthermore, in hybrid populations,
linear response as a function of increasing the number of QTL. On non-additive effects, i.e., dominance, and epistasis, are highly rele-
the other hand, both methods based on neural networks and G- vant and should also be considered [42]. In this sense, methods
BLUP showed a substantial improvement the higher the QTL num- that deal with high dimensionality and take into account possible
ber (Fig. 1). With the exception of DT, which presented lower R2 interactions between predictor variables have important traits.
values in almost all scenarios, the tree-based methods presented Many recent studies have been applied to GWS and have shown
R2 values close to the simulated heritability, mainly for scenarios that machine learning and neural network methods can perform
5494
Fig. 2. Average results of selective accuracy (R2 ) as a function of the number of genes and heritability for the families of the methods: Trees [Bagging (BA), Boosting (BO),
Decision Tree (DT); and Random Forest (RF)]; Network (Multilayer Perceptron Network (MLP) and Radial Base Function Network (RBF) MARS (MARS 1, 2 and 3); and G-BLUP.
The red dashed line refers to the overall mean value of the selective accuracy (R2 ) between all methods for comparison purposes. (For interpretation of the references to
colour in this figure legend, the reader is referred to the web version of this article.). (For interpretation of the references to colour in this figure legend, the reader is referred
to the web version of this article.)
better or similarly in predicting genotypic data phenotypes com- As MARS can simultaneously include multiple terms (additive
pared to statistical methods [9,12,22,43–46]. and epistatic effects) in a model [50] genetic interactions can be
However, there is still a gap to be filled on which methods may better evaluated. Apparently, this fact could lead to a better predic-
be preferable to perform the prediction when it comes to different tion for GWS, since, in this way, it would be possible to reduce the
degrees of heritability and QTL number, including when consider- residual variance of the model, by capturing information that
ing epistatic effects. The variability in the results for the different before was isolated only residual component, such as the effects
methods suggests that any method is prone to produce a differen- of the interaction between markers. However, as [14] explain,
tiated result under some type of data perturbations. The results some patterns of interaction tend to be less pronounced to be
obtained were able to demonstrate the strong effect of heritability detected in features with a high number of QTL. Thus, methods
and increase in QTL number on R2 and RMSE values. Several stud- based on recursive partitioning, such as MARS and Trees, benefit
ies have shown that there is a favorable effect of heritability on from situations in which the predictor variables can be partitioned
selective accuracy [9,22,36,47,48], as also obtained in this study. into well-defined regions [12], as is the case with features with
This is justified by the greater genetic variation in higher heritabil- lower QTL numbers (oligogenic). This is because traits controlled
ity’s and, consequently, less environmental effect, contributing to by few genes have well-defined phenotypic classes and suffer little
more accurate predictions of marker effects [22]. or no environmental influence [51].
The results showed that there was a reduction in the RMSE in These results show that MARS is an alternative to be used, espe-
scenarios with a greater number of QTL, and that, in the same cially when it is easier to identify groups of individuals based on
way, there was also a reduction in the RMSE in the scenarios with the population genome. Due to the identification of markers and/
greater heritability, however in a smaller proportion. Results sim- or interactions between markers of greater effect, MARS proved
ilar to those obtained in this study were observed by [36], in sce- to be more efficient when the multiplicative effects of the control-
narios with heritability ranging from 0.1 to 0.5 and 100 QTL, and ling genes (epistasis) may be more important, since, for traits with
[22] evaluating scenarios with QTL numbers ranging from 2 to 88 lower QTL numbers, the multiplicative effects control genes (epis-
and heritability from 0.3 to 0.8. The reduction in RMSE due to tasis) may be of greater magnitude in proportional terms, as the
the increase in the number of QTLs may have occurred due to individual effect of each gene is greater than in traits controlled
the lower influence of the multiplicative effect between the addi- by a greater number of QTL [22]. This is a direct result of its mod-
tive and dominant effects that characterize epistatic effects in eling philosophy, which tries to approximate a (possibly higher-
more complex traits [22,36,49]. order) function with a set of basic functions that are locally
5495
Fig. 3. Average results of predictive accuracy (RMSE) as a function of the number of genes and heritability for the families of the methods: Trees [Bagging (BA), Boosting (BO),
Regression Tree (DT); and Random Forest (RF)]; Network (Multilayer Perceptron Network (MLP) and Radial Base Function Network (RBF) MARS (MARS 1, 2 and 3) and G-
BLUP. The red dashed line refers to the overall mean value of predictive accuracy (RMSE)) between all methods for comparison purposes. (For interpretation of the references
to colour in this figure legend, the reader is referred to the web version of this article.). (For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.)
lower-order, so it has more power and flexibility to model relation- [27]. This explains why MARS did not perform well in genomic
ships that are almost additive or involve interactions in at most a regions where strong genetic interactions are present, such as for
few variables [27]. traits from 40 to 120 QTL. However, for scenarios where these
The frame is different when there is a large number of predictor effects are diluted, high QTL number (250 to 480), MARS has high
variables with high correlation and response trait has high genetic predictive potential and lower model variance. Thus, it is notable
(dominance and epistasis effect) and environmental noise. When that the excellent results obtained for the non-additive MARS show
the trait involves a greater number of QTL, there is a greater chance that this approach should be considered for GWS.
that a marker explains in genetic terms the same variation of The greater number of genotypic classes in scenarios with a
another marker, in addition to providing a greater action of the greater number of QTL reduces the representativeness of each
environmental effect, thus impairing the prediction efficiency. genotypic combination in the training set and overparameteriza-
The excess of markers associated with a reduced number of geno- tion of the model [22]. In this context, it was these restrictions that
typic observations can also lead to multicollinearity problems [12]. led these algorithms to not present such satisfactory results when
As [27] points out, MARS is not particularly robust against corre- the trait is polygenic, mainly DT. Low DT efficiency was also
lated inputs and relies heavily on data to infer the process model, observed by [12] to predict the genetic values for rust incidence
in these cases, MARS loses explanatory power. Thus, analyses in Coffea arabica and by [22] for simulated features with epistatic
should use an optimal set of informative SNPs, according to expec- effects with 16 or more QTL.
tations regarding the number of QTLs, to adopt the best analytical The approaches based on decision trees (BA, BO, and RF)
strategy, maximizing predictive accuracy estimates [12,22]. In showed excellent results regarding the accuracy of the GEBVs pre-
addition to MARS, another method susceptible to multicollinearity diction for traits with many QTL. A differential of the BA and RF
vulnerability is DT. If two predictors are highly collinear, MARS or approaches is the resampling of the original data in sub-samples
DT has to make an arbitrary knot or split selection that minimizes (bootstrap) to perform the prediction according to a number of
the residual sum of squares, this can profoundly affect all subse- determined trees. This resampling of data brings concrete benefits
quent selections and final predictions [52]. for prediction in these cases, allowing for the easy evaluation of
Also, more generally, recursive partitioning methods have diffi- poorly predicted samples and possible discrepancies [53]. BA ana-
culties when the dominant interactions involve a small fraction of lyzes its main effects on variance and can make forecasting more
the total number of variables, so one cannot discern whether the robust by decreasing the variance lead time and RF not only com-
approximation function approximates a simple one, such as linear bines a large number of decision trees to reduce forecast variance
or additive, or if it involves complex interactions between variables like BA but also decreases dependency between decision trees by
5496
projecting random features to obtain a much smaller prediction applied to Genomic Wide Association Studies (GWAS). MARS also
error [54]. As a result, these methods perform better to maximize has several ways of improvement that can improve the predictability
prediction in a target population, suggesting that bootstrapping of the traits and that must be tested, mainly the change in gamma,
can be performed by other methods to achieve better prediction where the model becomes more flexible to detect close variables
results. As it is a gradient-enhancing algorithm that has a learning for inclusion in the model for the forward phase.
rate, the BO method combines individually weak predictors to pro- In general, as the number of QTLs increases, the total genetic
duce a strong classifier [32], thus allowing a better prediction of variation is expected to be divided among the QTLs, which can
the genetic effects of individuals, as observed in this study. reduce the efficiency of methods to estimate small QTL effects
Neural network approaches, as used in this study, MLP and RBF, and lead to a loss of precision [36,67]. This is confirmed only for
apparently, are not affected by correlated inputs. The MLP and RBF traits that present stronger effects between interactions in the
methods are defended for being efficient in capturing nonlinear same linkage group, such as for traits with 40 QTL, since, as they
effects, in this case, provided by interallelic interactions [22]. Both have a smaller number of QTL in a single linkage group, the expres-
RBF and MLP were harmed by the excess of ineffective markers, sion of interactions between these QTL is stronger. On the other
showing lower performance compared to other methods, as also hand, the increase in efficiency for a greater number of QTL can
found by [22] and [55]. However, neural networks were efficient be attributed to the excess of markers with null effects, which
in predicting traits with many QTL, especially when the phenotypic can impair the accuracy of the methods [12,22].
value of the trait was mostly due to genetic value. Each technique has its specificity and must be evaluated in a
G-BLUP considers the interaction between marker pairs and wide set of data so that the decision on which method to base is
relies on the DL between SNPs and QTL, moreover, when QTL are correctly made [2]. It is rare that more than one technique is used
in strong LD and the use of an unweighted genomic relationship when performing GWS analyses, but these results align with the
matrix in G-BLUP can cause upward bias in the heritability esti- view that evaluating multiple methods is a useful strategy to
mate [56–59]. However, if only a few markers are important, the ensure that uncertainty in data is considered from multiple angles.
technique is hampered by this bias, as confirmed in this study.
On the other hand, the G-BLUP was highlighted in the performance
5. Conclusions
of the prediction of the GEBVs, presenting very similar results to
the Family Network methods for the traits with more QTL. Results
MARS ability to simplify complex relationships is quite perti-
similar to those found in this study were found in [22,60]. How-
nent to GWS, as most traits of interest in plant breeding are
ever, some markers are more informative for some traits than
affected by complex interactions of biological, environmental,
others, this increase in the amount of information using the geno-
and management conditions.
mic matrix G (genomic relationship matrix) can sometimes lead to
Non-additive MARS is better for predicting breeding values than
better and more accurate estimates and predictions [12]. These
additive MARS in the scenarios evaluated. The additive and non-
results corroborate other studies where the GBLUP precision
additive MARS methods showed superior results in the prediction
increased for characters with a high proportion of non-additive
of genetic values in characters with dominant and epistatic effects
variation and when with increasing heritability [61–63] and justi-
for scenarios with eight QTL in relation to G-BLUP methods, neural
fied due to G-BLUP principle that genomic predictions are based on
networks, and other machine learning methods.
the relatedness derived from all markers [60], so when more mark-
The use of different statistical methods, neural networks, and
ers have a genetic effect the prediction accuracy increases.
machine learning, such as MARS, to estimate genetic values
Although MARS performs the selection of SNPs, eliminating a
resulted in different consequences influenced by the complexity
large number of markers, the performance of this method showed
and particularity of the analyzed traits. Therefore, it is recom-
a greater difference for the RBF, MLP, and G-BLUP methods, which
mended that when evaluating the prediction of genetic values,
consider all markers, and BA, RF, and BO in the scenarios between
the use of multiple approaches is used, in order to choose the best
40 and 280 QTL. This can be explained by the fact that the F2 pop-
method to be used.
ulation has a high rate of linkage disequilibrium (LD), due to the
combination process. This LD can then cause false-positive signals
for some loci, which have no connection with the studied trait in CRediT authorship contribution statement
question [59]. So, the SNPs closest to a QTL are not sampled often
enough and the QTL signal may be captured by more distant SNPs, Weverton Gomes da Costa: Conceptualization, Methodology,
consequently, the signal from a QTL to MARS may be blurred com- Formal analysis, Data curation, Investigation, Software, Supervi-
pared to other methods. Alternatively, use of hybrid modeling sion, Validation, Visualization, Writing – original draft, Writing –
schemes (combination of two or more methodologies) including review & editing. Mauricio de Oliveira Celeri: Methodology, Inves-
MARS had been previously very effective with initial data cluster- tigation, Visualization, Writing – original draft, Writing – review &
ing using c-means or principal components [64–66], it could be editing. Ivan de Paiva Barbosa: Conceptualization, Methodology,
more important for diverse population for genomic predictions. Visualization, Writing – original draft. Gabi Nunes Silva: Visualiza-
For example, studies such as the one by [43] proposed hybrid tion, Writing – original draft. Camila Ferreira Azevedo: Visualiza-
smart modeling schemes for heart disease classification using com- tion, Writing – original draf. Aluizio Borem: Visualization, Writing
bined MARS-ANN. This would be a viable alternative to improve – original draft. Moysés Nascimento: Investigation , Visualization,
predictability and decrease the effect of multicollinearity using Writing – original draft, Project administration, Supervision, Writ-
MARS on genomic data, which is worthy of further investigation ing - Review & Editin. Cosme Damião Cruz: Investigation, Visual-
and deserves further research. ization, Writing – original draft, Project administration,
The main limitation of the additive MARS is that the model is con- Supervision.
strained to be additive. With many variables, important interactions
can be missed. On the other hand, as the model is additive, we can Declaration of Competing Interest
examine the effect of each marker on the prediction of GEBVs indi-
vidually. Furthermore, the model can be represented in a way that The authors declare that they have no known competing finan-
separately identifies additive contributions and those associated with cial interests or personal relationships that could have appeared
different multivariate interactions, being useful for future studies to influence the work reported in this paper.
5497
Acknowledgments [20] Tang J, Liu R, Zhang YL, Liu MZ, Hu YF, Shao MJ, et al. Application of Machine-
Learning Models to Predict Tacrolimus Stable Dose in Renal Transplant
Recipients. Sci Rep 2017;7. https://doi.org/10.1038/srep42192.
The authors are grateful for the financial support of the Coordi- [21] Cruz CD. Programa GENES – Análise Multivariada e Simulação. Viçosa, MG,
nation for the Improvement of Higher Education Personnel - CAPES Brazil: Editora UFV; 2006.
[22] Barbosa IP, Silva MJ, Costa WG, Sant’Anna IC, Nascimento M, Cruz CD.
financial code 001, and the National Council for Scientific and
Genome-enabled prediction through machine learning methods considering
Technological Development - CNPq. different levels of trait complexity. Crop Sci 2021;61:1890–902. https://doi.
org/10.1002/csc2.20488.
[23] Sant’Anna IC, Tomaz RS, Silva GN, Nascimento M, Bhering LL, Cruz CD.
Superiority of artificial neural networks for a genetic classification procedure.
Appendix A. Supplementary data
Genet Mol Res 2015;14:9898–906. https://doi.org/10.4238/2015.
August.19.24.
Supplementary data to this article can be found online at [24] Resende MDV, Silva FF, Azevedo CF. Estatística Matemática, Biométrica e
https://doi.org/10.1016/j.csbj.2022.09.029. Computacional. Viçosa, MG, Brazil: Editora UFV; 2014.
[25] Costa WG, Barbosa I, De P, Souza JE, Cruz CD, Nascimento M, Oliveira ACB.
Machine learning and statistics to qualify environments through multi-traits
in Coffea arabica. PLoS ONE 2021;16:1–21. https://doi.org/10.1371/journal.
References pone.0245298.
[26] Solano Meza JK, Orjuela Yepes D, Rodrigo-Ilarri J, Cassiraga E. Predictive
[1] Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using analysis of urban waste generation for the city of Bogotá, Colombia, through
genome-wide dense marker maps. Genetics 2001;157:1819–29. the implementation of decision trees-based machine learning, support vector
[2] Tong H, Nikoloski Z. Machine learning approaches for crop improvement: machines and artificial neural networks. Heliyon 2019;5:e02810.
Leveraging phenotypic and genotypic big data. J Plant Physiol 2021;257:. [27] Friedman JH. Multivariate Adaptative regression Splines. Ann Stat
https://doi.org/10.1016/j.jplph.2020.153354153354. 1991;19:1–141.
[3] Singh BD, Singh AK. Marker-assisted plant breeding: Principles and practices. [28] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: Data
2015. 10.1007/978-81-322-2316-0. mining, inference, and prediction. 2. ed. New York, NY, USA: Springer; 2009.
[4] Peixoto LA, Laviola BG, Alves AA, Rosado TB, Bhering LL. Breeding Jatropha 10.1007/978-1-4419-9863-7_941.
curcas by genomic selection: A pilot assessment of the accuracy of predictive [29] Zhang W, Goh ATC. Multivariate adaptive regression splines and neural
models. PLoS ONE 2017;12:1–16. https://doi.org/10.1371/journal. network models for prediction of pile drivability. Geosci Front 2016;7:45–52.
pone.0173368. https://doi.org/10.1016/j.gsf.2014.10.003.
[5] Li B, Zhang N, Wang YG, George AW, Reverter A, Li Y. Genomic prediction of [30] Milborrow S. Notes on the earth package; 2019:1–68.
breeding values using a subset of SNPs identified by three machine learning [31] Zhang H, Yin L, Wang M, Yuan X, Liu X. Factors affecting the accuracy of
methods. Front Genet 2018;9:1–20. https://doi.org/10.3389/ genomic selection for agricultural economic traits in maize, cattle, and pig
fgene.2018.00237. populations. Front Genet 2019;10:1–10. https://doi.org/10.3389/
[6] Yabe S, Hara T, Ueno M, Enoki H, Kimura T, Nishimura S, et al. Potential of fgene.2019.00189.
genomic selection in mass selection breeding of an allogamous crop: An [32] James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical
empirical study to increase yield of common buckwheat. Front Plant Sci Learning. Springer Texts Stat 2021:612. https://doi.org/10.1007/978-1-0716-
2018;9:1–12. https://doi.org/10.3389/fpls.2018.00276. 1418-1_1.
[7] Sousa TV, Caixeta ET, Alkimim ER, Oliveira ACB, Pereira AA, Sakiyama NS, et al. [33] Breiman L. Bagging Predictors. Mach Learn 1996;24:123–40. https://doi.org/
Early Selection Enabled by the Implementation of Genomic Selection in Coffea 10.1007/BF00058655.
arabica Breeding. Front Plant Sci 2019;9:1–12. https://doi.org/10.3389/ [34] Prasad AM, Iverson LR, Liaw A. Newer classification and regression tree
fpls.2018.01934. techniques: Bagging and random forests for ecological prediction. Ecosystems
[8] Alkimim ER, Caixeta ET, Sousa TV, Resende MDV, Silva FL, Sakiyama NS, et al. 2006;9:181–99. https://doi.org/10.1007/s10021-005-0054-1.
Selective efficiency of genome-wide selection in Coffea canephora breeding. [35] Boehmke B, Greenwell B. Random Forests. Hands-On Mach. Learn. with R, vol.
Tree Genet Genomes 2020;16. https://doi.org/10.1007/s11295-020-01433-3. 45, Chapman and Hall/CRC; 2019, p. 203–19. 10.1201/9780367816377-11.
[9] Sant’Anna IC, Nascimento M, Silva GN, Cruz CD, Azevedo CF, Gloria LS, et al. [36] Ghafouri-Kesbi F, Rahimi-Mianji G, Honarvar M, Nejati-Javaremi A. Predictive
Genome-enabled prediction of genetic values for using radial basis function ability of Random Forests, Boosting, Support Vector Machines and Genomic
neural networks. Funct Plant Breed J 2020;1:1–8. 10.35418/2526-4117/ Best Linear Unbiased Prediction in different scenarios of genomic evaluation.
v1n2a1. Anim Prod Sci 2017;57:229–36. https://doi.org/10.1071/AN15538.
[10] Liew BXW, Peolsson A, Rugamer D, Wibault J, Löfgren H, Dedering A, et al. [37] Bengio Y, Grandvalet Y. No Unbiased Estimator of the Variance of K-Fold Cross-
Clinical predictive modelling of post-surgical recovery in individuals with Validation. J Mach Learn Res 2004;5:1089–105. https://doi.org/10.1016/
cervical radiculopathy: a machine learning approach. Sci Rep 2020;10:1–10. S0006-291X(03)00224-9.
https://doi.org/10.1038/s41598-020-73740-7. [38] Cruz CD. GENES - Software para análise de dados em estatística experimental e
[11] Cruz CD, Nascimento M. Inteligência Computacional aplicada ao em genética quantitativa. Acta Sci - Agron 2013;35:271–6. https://doi.org/
melhoramento genético. Viçosa, MG: Editora UFV; 2018. 10.4025/actasciagron.v35i3.21251.
[12] Sousa IC, Nascimento M, Silva GN, Nascimento ACC, Cruz CD, Fonseca e Silva F, [39] Cruz CD. Genes software – extended and integrated with the R, Matlab and
et al. Genomic prediction of leaf rust resistance to Arabica coffee using Selegen. Acta Sci - Agron 2016;38:547–52. https://doi.org/10.4025/
machine learning algorithms. Sci Agric 2021;78:1–8. https://doi.org/10.1590/ actasciagron.v38i4.32629.
1678-992x-2020-0021. [40] R Core Team, Computing RF for S, Team RC. R: A Language and Environment for
[13] Cook NR, Zee RYL, Ridker PM. Tree and spline based association analysis of Statistical Computing 2020. https://www.r-project.org/. (accessed July 1,
gene-gene interaction models for ischemic stroke. Stat Med 2004;23:1439–53. 2020).
https://doi.org/10.1002/sim.1749. [41] MATLAB. Natick, Massachusetts: The MathWorks Inc.; 2019.
[14] Lin HY, Wang W, Liu YH, Soong SJ, York TP, Myers L, et al. Comparison of [42] Schnable PS, Springer NM. Progress toward understanding heterosis in crop
multivariate adaptive regression splines and logistic regression in detecting plants. Annu Rev Plant Biol 2013;64:71–88. https://doi.org/10.1146/annurev-
SNP-SNP interactions and their application in prostate cancer. J Hum Genet arplant-042110-103827.
2008;53:802–11. https://doi.org/10.1007/s10038-008-0313-z. [43] Shao YE, Hou CD, Chiu CC. Hybrid intelligent modeling schemes for heart
[15] Taylan P, Weber GW. CG-Lasso Estimator for Multivariate Adaptive Regression disease classification. Appl Soft Comput J 2014;14:47–52. https://doi.org/
Spline. In: Tas K, Baleanu D, Machado JAT, editors. Math. Methods Eng. Apl. 10.1016/j.asoc.2013.09.020.
Dyn. Complex Syst., Springer International Publishing AG; 2019, p. 121–36. [44] Silva GN, Tomaz RS, Sant’Anna IC, Nascimento M, Bhering LL, Cruz CD. Neural
10.1007/978-3-319-90972-1_9. networks for predicting breeding values and genetic gains. Sci Agric
[16] Altinok G, Karagoz P, Batmaz I. Learning to rank by using multivariate adaptive 2014;71:494–8. 10.1590/0103-9016-2014-0057.
regression splines and conic multivariate adaptive regression splines. Comput [45] Ma W, Qiu Z, Song J, Cheng Q, Ma C. DeepGS: Predicting phenotypes from
Intell 2020:1–38. https://doi.org/10.1111/coin.12413. genotypes using Deep Learning. BioRxiv 2017. https://doi.org/10.1101/
[17] Zheng G, Zhang W, Zhou H, Yang P. Multivariate adaptive regression splines 241414.
model for prediction of the liquefaction-induced settlement of shallow [46] Zingaretti LM, Gezan SA, Ferrão LFV, Osorio LF, Monfort A, Muñoz PR, et al.
foundations. Soil Dyn Earthq Eng 2020;132:. https://doi.org/10.1016/ Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid
j.soildyn.2020.106097106097. Outcrossing Species. Front Plant Sci 2020;11:1–14. https://doi.org/10.3389/
[18] York TP, Eaves LJ, van den Oord EJCG. Multivariate adaptive regression splines: fpls.2020.00025.
A powerful method for detecting disease-risk relationship differences among [47] Coutinho AE, Neder DG, Silva MC, Arcelino EC, Brito SG, Carvalho Filho JLS.
subgroups. Stat Med 2006;25:1355–67. https://doi.org/10.1002/sim.2292. Prediction of phenotypic and genotypic values by BLUP/GWS and neural
[19] Chang CD, Wang CC, Jiang BC. Using data mining techniques for multi-diseases networks. Rev Caatinga 2018;31:532–40. https://doi.org/10.1590/1983-
prediction modeling of hypertension and hyperlipidemia by common risk 21252018v31n301rc.
factors. Expert Syst Appl 2011;38:5507–13. https://doi.org/10.1016/j. [48] Moura EG, Pamplona AKA, Balestre M. Functional models in genome-wide
eswa.2010.10.086. selection. PLoS ONE 2019;14:e0222699.
5498
[49] Coster A, Bastiaansen JWM, Calus MPL, Van Arendonk JAM, Bovenhuis H. [60] Wang J, Zhou Z, Zhang Z, Li H, Liu D, Zhang Q, et al. Expanding the BLUP
Sensitivity of methods for estimating breeding values using genetic markers to alphabet for genomic prediction adaptable to the genetic architectures of
the number of QTL and distribution of QTL variance. Genet Sel Evol complex traits. Heredity (Edinb) 2018;121:648–62. https://doi.org/10.1038/
2010;42:1–11. https://doi.org/10.1186/1297-9686-42-9. s41437-018-0075-0.
[50] Everingham YL, Sexton J. An introduction to Multivariate Adaptive Regression [61] Dufflocq P, Pérez-Enciso M, Lhorente JP, Yáñez JM. Accuracy of genomic
Splines for the cane industry. 33rd Annu Conf Aust Soc Sugar Cane Technol predictions using different imputation error rates in aquaculture breeding
2011, ASSCT 2011 2011:255–68. programs: A simulation study. Aquaculture 2019;503:225–30. https://doi.org/
[51] Cruz CD. Princípios de genética quantitativa. 2. ed. Viçosa, MG: Editora UFV; 10.1016/j.aquaculture.2018.12.061.
2012. [62] Pocrnic I, Lourenco DAL, Masuda Y, Misztal I. Accuracy of genomic BLUP when
[52] De Veaux RD, Ungar LH. Multicollinearity: A tale of two nonparametric considering a genomic relationship matrix based on the number of the largest
regressions 1994:393–402. 10.1007/978-1-4612-2660-4_40. eigenvalues: A simulation study. Genet Sel Evol 2019;51:1–10. https://doi.org/
[53] Diaz-Uriarte R. GeneSrF and varSelRF: A web-based tool and R package for 10.1186/s12711-019-0516-0.
gene selection and classification using random forest. BMC Bioinf 2007;8:1–7. [63] Liu X, Wang H, Hu X, Li K, Liu Z, Wu Y, et al. Improving Genomic Selection With
https://doi.org/10.1186/1471-2105-8-328. Quantitative Trait Loci and Nonadditive Effects Revealed by Empirical
[54] Fuleky P. Macroeconomic Forecasting in the Era of Big Data. vol. 52. 2020. Evidence in Maize. Front Plant Sci 2019;10. 10.3389/fpls.2019.01129.
[55] Sant’Anna I de C, Gouvêa LRL, Martins MA, Scaloppi Junior EJ, de Freitas RS, [64] De Andrés J, Lorca P, De Cos Juez FJ, Sánchez-Lasheras F. Bankruptcy
Gonçalves P de S. Genetic diversity associated with natural rubber quality in forecasting: A hybrid approach using fuzzy c-means clustering and
elite genotypes of the rubber tree. Sci Rep 2021;11:1–10. 10.1038/s41598- multivariate adaptive regression splines (MARS). Expert Syst Appl
020-80110-w. 2011;38:1866–75. https://doi.org/10.1016/j.eswa.2010.07.117.
[56] Speed D, Hemani G, Johnson MR, Balding DJ. Improved heritability estimation [65] Deconinck E, Coomans D, Vander HY. Exploration of linear modelling
from genome-wide SNPs. Am J Hum Genet 2012;91:1011–21. https://doi.org/ techniques and their combination with multivariate adaptive regression
10.1016/j.ajhg.2012.10.010. splines to predict gastro-intestinal absorption of drugs. J Pharm Biomed Anal
[57] Legarra A. Comparing estimates of genetic variance across different 2007;43:119–30. https://doi.org/10.1016/j.jpba.2006.06.022.
relationship models. Theor Popul Biol 2016;107:26–30. https://doi.org/ [66] Nayana BM, Kumar KR, Chesneau C. Wheat Yield Prediction in India Using
10.1016/j.tpb.2015.08.005. Principal Component Analysis-Multivariate Adaptive Regression Splines (PCA-
[58] Fernando RL, Cheng H, Sun X, Garrick DJ. A comparison of identity-by-descent MARS). AgriEngineering 2022;4:461–74. https://doi.org/10.3390/
and identity-by-state matrices that are used for genetic evaluation and agriengineering4020030.
estimation of variance components. J Anim Breed Genet 2017;134:213–23. [67] Resende MDV, Resende MFR, Sansaloni CP, Petroli CD, Missiaggia AA, Aguiar
https://doi.org/10.1111/jbg.12275. AM, et al. Genomic selection for growth and wood quality in Eucalyptus:
[59] Mathew B, Léon J, Sillanpää MJ. A novel linkage-disequilibrium corrected Capturing the missing heritability and accelerating breeding for complex traits
genomic relationship matrix for SNP-heritability estimation and genomic in forest trees. New Phytol 2012;194:116–28. https://doi.org/10.1111/j.1469-
prediction. Heredity (Edinb) 2018;120:356–68. https://doi.org/10.1038/ 8137.2011.04038.x.
s41437-017-0023-4.
5499

1-s2.0-S2001037022004342-main

Uploaded by

Copyright:

Available Formats

1-s2.0-S2001037022004342-main

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-s2.0-S2001037022004342-main

Uploaded by

Copyright:

Available Formats

Computational and Structural Biotechnology Journal 20 (2022) 5490–5499

journal homepage: www.elsevier.com/locate/csbj

Genomic prediction through machine learning and neural networks

where: b y R1 is the average of the response variable of the training

You might also like