Machine Learning For AI Breeding in Plants
Machine Learning For AI Breeding in Plants
Machine Learning For AI Breeding in Plants
https://doi.org/10.1093/gpbjnl/qzae051
Advance access publication: 2 July 2024
Perspective
What makes artificial intelligence (AI) smart is machine Coping with the “curse of dimensionality”
learning (ML), which is defined as “a field of study that gives Population-scale multi-omics datasets tend to be highly
computers the ability to learn without being explicitly pro dimensional, noisy, and heterogeneous. This issue is
grammed” by ML pioneer Arthur Samuel in 1959. ML dedu addressed using a type of unsupervised learning known as di
ces data patterns without relying on prior assumptions as mensionality reduction (DR) to prevent the “curse of
statistics does, greatly reducing the human effort required to dimensionality”. The Multi-Omics Data Association Studies
comprehend the data. ML comprises a large family of algo (MODAS) toolbox applies multiple DR algorithms to geno
rithms, many of which support big data analytics [1]. With types and mTraits in plants [4]. To perform DR on geno
the rapid advances in multi-omics technologies, plant breed types, MODAS combines the Jaccard similarity coefficient,
ing has entered the “genome, germplasm, genes, genomic density-based spatial clustering of applications with noise
breeding, and gene editing (5G)” generation [2], in which bi (DBSCAN), and principal component analysis (PCA)
ological knowledge and omics data are integrated to expedite algorithms to generate a “pseudo-genotype index” file. This
trait improvement. ML holds great promise for 5G breeding, highly simplified variation atlas uses tens of thousands of
with many reports of ML applications for omics-driven gene genomic blocks to represent millions of single-nucleotide
discovery, genotype-to-phenotype (G2P) prediction, genomic polymorphisms (SNPs) in the genome, improving analytical
selection (GS), and plant phenomics. However, there remains efficiency for mapping mTraits.
a gap between basic research and breeding practices in plants The dimensionality of mTraits must also be reduced, as
[3]. Given that multi-omics, genotypic, phenomic, and envi omics data are highly redundant due to technical issues and
ronmental datasets have become highly dimensional and het the characteristics of biological pathways. For example, a me
erogeneous, novel ML algorithms are expected. Hereby, we tabolite is produced by a cascade of enzymatic reactions in
propose ways to overcome major challenges in the applica volving many genes and pathways, and crosstalk between
tion of cutting-edge ML models to plant research, with the ul pathways is common. Therefore, given their highly correlated
timate goal of making plant breeding smart and easy. pattern, both final products and intermediate compounds
could be repeatedly mapped to the same region. The non-neg
ative matrix factorization (NMF) algorithm removes redun
Population-scale multi-omics analysis for dancy by decomposing the matrix of metabolites(n) ×
gene discovery samples(m) into one meta-metabolite dimension and one
Discovery of agronomically useful genes is the premise for meta-sample dimension. The weights of a meta-metabolite
exploiting natural variations for marker-assisted selection across samples represent the overall abundance of a set of
(MAS) or creating artificial mutations via genome editing. clustered compounds, and the weights of meta-samples reflect
Genome-wide association studies (GWAS) of common agro subgroups of samples divided based on the haplotypes of the
nomic traits have reached a bottleneck, as their power to dis mapped region. The genomic blocks that contribute to the
sect complex, polygenic traits is quite limited. Multi-omics corresponding biosynthetic pathway are mapped via GWAS
analysis focusing on a reference germplasm panel under dif between the meta-metabolites and the pseudo-genotype in
ferent spatiotemporal conditions could greatly enhance the dex. SNPs within the block are then used to identify causal
mapping resolution of causal genes and mutations when cel genes and mutations. This strategy greatly reduces computing
lular biomolecules (e.g., RNA transcripts, proteins, metabo time and saves resources while providing clean, easy-to-
lites) are treated as molecular traits (mTraits). Additionally, interpret results.
phenomics has become another main component in multi-
omics, in which phenomic data are mostly generated by high- Automated feature engineering
throughput imaging equipment using computer vision Another common issue is that feature sets, such as SNPs,
technologies. Since phenomic features may reflect certain mTraits, or iTraits, are far larger than sample sets. This
physiological activities inside plant cells, this type of feature increases the risk of overfitting, as the model may learn incor
can be regarded as imaging traits (iTraits). rect features from the data. Thus, feature engineering,
including feature selection or feature extraction, must be per and deletions (InDels), presence and absence variations
formed before training a model. Feature selection tends to se (PAVs), and a variety of structural variations (SVs) causing
lect a small subset from the total features without changing direct functional change, is important for precision-designed
the original feature values. This can be achieved by manual breeding. This is especially true for improving qualitative
selection based on prior knowledge or automated selection traits determined by single gene with major effect. However,
by learning the importance of features when training a model. causative variations involving coding SNPs or short InDels
By contrast, feature extraction creates a small set of new that alternate protein functions only account for a very small
features by summarizing the characteristics of the original fraction of trait-related variations. Mapping of regulatory
features. NMF is a form of feature extraction, as meta- variants attributable to SVs and PAVs is much difficult, as it
need to identify genes and recipient materials that can be Ultrahigh-throughput, scalable platforms based on kompe
modified without affecting non-target traits. titive allele-specific PCR (KASP), such as Nexar Array Tape
systems, could then be utilized. These platforms can multi
Breeding is all about “timing” and “balancing” plex tens of thousands of samples per run, but the markers
Trait improvement is essentially a process of fine-tuning a must be highly universal and effective. One can then take ad
gene regulation network. Crossing generates new patterns of vantage of feature selection embedded in EL to select
gene regulation by recombining deleterious and beneficial markers. EL is a family of ML algorithms, including random
alleles. This process offers the chance to select the optimal forest, gradient boosting decision tree (GBDT), extreme gra
network where genes involved in a regulatory pathway meet dient boosting (XGBoost), categorical boosting (CatBoost),
Data-driven genomic design breeding GS model is risky, as it may cause inestimable overfitting due
Data acquired from industrial breeding programs can include to the extremely high complexity of feature sets. Therefore,
genotypic, phenotypic, environmental, climate, and any type aforementioned feature engineering on mTraits or iTraits
of field data. Unlike knowledge-driven design, data-driven must be utilized to reduce data dimensionality prior to model
design does not require knowledge of the specific genes and training. Then, the dimensional vectors are regarded as fea
mechanisms underlying a trait. Instead, it uses statistical or tures to be incorporated with genotypes of SNPs to train GS
ML models to infer correlations among data, as exemplified models. Additionally, generation of multi-omics data is costly
by GS [9]. However, genotyping cost is still the main factor and it’s impossible to generate RNA sequencing (RNA-seq)
or metabolome profiling for each individual samples in each
infer correlations between yield performance and a few envi human cancer prediction and classification. With the rapid gen
ronmental factors. Statistical models have become unsuitable eration of omics data from plant germplasms, perhaps this
for modeling increasingly complicated genotypic, phenotypic, multi-modal learning algorithm could be used to address the
environmental, and climate datasets, prompting the need for problem of a limited sample size for model training.
ML methods. Another critical issue for modeling phenotypic
plasticity is heterogeneous plasticity between inbred lines and
hybrids, which strongly influences model precision and must Building an ecosystem for AI breeding
be considered when using ML methods to predict in plants
environment-specific traits from inbred to hybrid lines. A common consensus is that high-quality datasets and labels
from the predictive models, equipped with a user-friendly inter 0[4] Liu S, Xu F, Xu Y, Wang Q, Yan J, Wang J, et al. MODAS: ex
face to offer services and report results to end users. Such an ploring maize germplasm with multi-omics data association stud
ML ecosystem will make plant breeding smarter and easier in ies. Sci Bull 2022;67:903–6.
this era of AI. 0[5] Yan J, Xu Y, Cheng Q, Jiang S, Wang Q, Xiao Y, et al.
LightGBM: accelerated genomically designed crop breeding
through ensemble learning. Genome Biol 2021;22:271.
CRediT author statement 0[6] Weckwerth W, Ghatak A, Bellaire A, Chaturvedi P, Varshney
RK. PANOMICS meets germplasm. Plant Biotechnol J 2020;
Qian Cheng: Writing – review & editing. Xiangfeng Wang:
18:1507–25.
Conceptualization, Writing – original draft, Writing – review 0[7] Xiao Y, Jiang S, Cheng Q, Wang X, Yan J, Zhang R, et al. The ge
© The Author(s) 2024. Published by Oxford University Press and Science Press on behalf of the Beijing Institute of Genomics, Chinese Academy of Sciences / China
National Center for Bioinformation and Genetics Society of China.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Genomics, Proteomics & Bioinformatics, 2024, 22, 1–6
https://doi.org/10.1093/gpbjnl/qzae051
Perspective