Machine Learning For AI Breeding in Plants

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Genomics, Proteomics & Bioinformatics, 2024, 22(4), qzae051

https://doi.org/10.1093/gpbjnl/qzae051
Advance access publication: 2 July 2024
Perspective

Machine Learning for AI Breeding in Plants


Qian Cheng , Xiangfeng Wang �
State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding,

Downloaded from https://academic.oup.com/gpb/article/22/4/qzae051/7703285 by Sichuan Agriculture University user on 20 October 2024


China Agricultural University, Beijing 100094, China
�Corresponding author: xwang@cau.edu.cn (Wang X).
Handling Editor: Peng Cui

What makes artificial intelligence (AI) smart is machine Coping with the “curse of dimensionality”
learning (ML), which is defined as “a field of study that gives Population-scale multi-omics datasets tend to be highly
computers the ability to learn without being explicitly pro­ dimensional, noisy, and heterogeneous. This issue is
grammed” by ML pioneer Arthur Samuel in 1959. ML dedu­ addressed using a type of unsupervised learning known as di­
ces data patterns without relying on prior assumptions as mensionality reduction (DR) to prevent the “curse of
statistics does, greatly reducing the human effort required to dimensionality”. The Multi-Omics Data Association Studies
comprehend the data. ML comprises a large family of algo­ (MODAS) toolbox applies multiple DR algorithms to geno­
rithms, many of which support big data analytics [1]. With types and mTraits in plants [4]. To perform DR on geno­
the rapid advances in multi-omics technologies, plant breed­ types, MODAS combines the Jaccard similarity coefficient,
ing has entered the “genome, germplasm, genes, genomic density-based spatial clustering of applications with noise
breeding, and gene editing (5G)” generation [2], in which bi­ (DBSCAN), and principal component analysis (PCA)
ological knowledge and omics data are integrated to expedite algorithms to generate a “pseudo-genotype index” file. This
trait improvement. ML holds great promise for 5G breeding, highly simplified variation atlas uses tens of thousands of
with many reports of ML applications for omics-driven gene genomic blocks to represent millions of single-nucleotide
discovery, genotype-to-phenotype (G2P) prediction, genomic polymorphisms (SNPs) in the genome, improving analytical
selection (GS), and plant phenomics. However, there remains efficiency for mapping mTraits.
a gap between basic research and breeding practices in plants The dimensionality of mTraits must also be reduced, as
[3]. Given that multi-omics, genotypic, phenomic, and envi­ omics data are highly redundant due to technical issues and
ronmental datasets have become highly dimensional and het­ the characteristics of biological pathways. For example, a me­
erogeneous, novel ML algorithms are expected. Hereby, we tabolite is produced by a cascade of enzymatic reactions in­
propose ways to overcome major challenges in the applica­ volving many genes and pathways, and crosstalk between
tion of cutting-edge ML models to plant research, with the ul­ pathways is common. Therefore, given their highly correlated
timate goal of making plant breeding smart and easy. pattern, both final products and intermediate compounds
could be repeatedly mapped to the same region. The non-neg­
ative matrix factorization (NMF) algorithm removes redun­
Population-scale multi-omics analysis for dancy by decomposing the matrix of metabolites(n) ×
gene discovery samples(m) into one meta-metabolite dimension and one
Discovery of agronomically useful genes is the premise for meta-sample dimension. The weights of a meta-metabolite
exploiting natural variations for marker-assisted selection across samples represent the overall abundance of a set of
(MAS) or creating artificial mutations via genome editing. clustered compounds, and the weights of meta-samples reflect
Genome-wide association studies (GWAS) of common agro­ subgroups of samples divided based on the haplotypes of the
nomic traits have reached a bottleneck, as their power to dis­ mapped region. The genomic blocks that contribute to the
sect complex, polygenic traits is quite limited. Multi-omics corresponding biosynthetic pathway are mapped via GWAS
analysis focusing on a reference germplasm panel under dif­ between the meta-metabolites and the pseudo-genotype in­
ferent spatiotemporal conditions could greatly enhance the dex. SNPs within the block are then used to identify causal
mapping resolution of causal genes and mutations when cel­ genes and mutations. This strategy greatly reduces computing
lular biomolecules (e.g., RNA transcripts, proteins, metabo­ time and saves resources while providing clean, easy-to-
lites) are treated as molecular traits (mTraits). Additionally, interpret results.
phenomics has become another main component in multi-
omics, in which phenomic data are mostly generated by high- Automated feature engineering
throughput imaging equipment using computer vision Another common issue is that feature sets, such as SNPs,
technologies. Since phenomic features may reflect certain mTraits, or iTraits, are far larger than sample sets. This
physiological activities inside plant cells, this type of feature increases the risk of overfitting, as the model may learn incor­
can be regarded as imaging traits (iTraits). rect features from the data. Thus, feature engineering,

Received: 11 May 2024; Revised: 21 June 2024; Accepted: 25 June 2024.


© The Author(s) 2024. Published by Oxford University Press and Science Press on behalf of the Beijing Institute of Genomics, Chinese Academy of Sciences /
China National Center for Bioinformation and Genetics Society of China.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 Genomics, Proteomics & Bioinformatics, 2024, Vol. 22, No. 4

including feature selection or feature extraction, must be per­ and deletions (InDels), presence and absence variations
formed before training a model. Feature selection tends to se­ (PAVs), and a variety of structural variations (SVs) causing
lect a small subset from the total features without changing direct functional change, is important for precision-designed
the original feature values. This can be achieved by manual breeding. This is especially true for improving qualitative
selection based on prior knowledge or automated selection traits determined by single gene with major effect. However,
by learning the importance of features when training a model. causative variations involving coding SNPs or short InDels
By contrast, feature extraction creates a small set of new that alternate protein functions only account for a very small
features by summarizing the characteristics of the original fraction of trait-related variations. Mapping of regulatory
features. NMF is a form of feature extraction, as meta- variants attributable to SVs and PAVs is much difficult, as it

Downloaded from https://academic.oup.com/gpb/article/22/4/qzae051/7703285 by Sichuan Agriculture University user on 20 October 2024


metabolites are new features derived from a much larger set requires high-quality pan-genome sequences derived from de
of metabolites. Feature engineering can be embedded in many novo assembly of representative core germplasm lines. To
ML paradigms, such as deep learning (DL) and ensemble achieve this goal, multiple steps assisted by different types of
learning (EL). The DL convolutional neural network omics data are required. It first starts with rough-mapping of
algorithm performs feature extraction when transferring a genomic interval, usually ranging in megabases, by GWAS
information between network layers. Light Gradient analysis of the target trait; then, integrative analysis of vari­
Boosting Machine (LightGBM) performs feature selection by ous datasets generated from transcriptome-wide association
computing a score of information gain (IG) to select features study (TWAS), metabolome-wide association study
of high importance. (MWAS), and other type of techniques profiling cis-regula­
Although automated hyperparameter tuning using grid tory elements by chromatin immunoprecipitation sequencing
searches is widely implemented in plants, automated feature (ChIP-seq) or self-transcribing active regulatory region se­
engineering has largely been neglected. In a recent study, SNP quencing (STARR-seq) has to be done to further narrow
features with high IG scores selected by LightGBM were con­ down the list of candidate genes or genomic regions; third,
sistent with the peak SNPs identified from GWAS, indicating genotypes of SNPs in the candidate genes and regions are
the ability of the algorithm to recognize trait-associated var­ mapped to pan-genome assembly to determine the haplotypic
iations [5]. It suggests that automated feature selection can
map (HapMap) associated with each of the SVs or PAVs; at
also be used to discover agronomically important genes and
last, statistical testing is performed to examine whether the
to facilitate panel design of compiling effective molecular
PAV-associated HapMap is significantly consistent with phe­
markers associated with traits of interest for MAS. In addi­
notypic variations. However, it’s worthy of noting that these
tion to methods embedded in ML algorithms, many indepen­
so-called causative variants identified from multi-omics
dent tools specifically designed for feature engineering could
analysis are only candidate genes or variations. Whether they
also be utilized in plants, such as the deep feature synthesis
are directly involved in functional variations contributing to
method in the Python “Featuretools” library.
trait change still requires strict experimental validation, be­
Manifold learning for data visualization fore this functional marker can be finally utilized in molecu­
Manifold learning uses non-linear DR algorithms to visualize lar design breeding. Because fine-mapping of causative
datasets with ultrahigh dimensionality, which helps maintain variants involves multiple forms of population-scale omics
the geometric properties of high-dimensional data, even data which are recently defined as panomics by Weckwerth
when mapped to a low-dimensional space. This technique is et al., development of ML methods solving integrative analy­
especially useful for visualizing single-cell RNA sequencing sis of panomics has been highly expected [6].
(scRNA-seq) data. Multiple algorithms have been utilized to
investigate the structures of heterogeneous cell populations Knowledge-driven molecular design breeding
based on scRNA-seq data, including t-distributed stochastic
neighbor embedding (t-SNE), Uniform Manifold Knowledge from plant research should ultimately facilitate
Approximation and Projection (UMAP), and Potential of applied plant breeding. With an explicit understanding of the
Heat-diffusion for Affinity-based Trajectory Embedding biological mechanisms underlying a trait, the causal gene can
(PHATE). Another strategy utilizes deep neural networks be precisely utilized for trait improvement. Yet, translating
(DNNs) to extract information from internal nodes at differ­ biological knowledge into breeding remains challenging. For
ent network layers to simultaneously achieve batch correc­ example, germplasm panels used for GWAS usually consist
tion, clustering, denoising, and data visualization under a of wild relatives, landraces, obsolete cultivars, and modern
unified model. DL using this strategy is no longer regarded as cultivars to ensure genotypic and phenotypic diversity.
a “black box”, as the geometric properties may reflect the bi­ However, most mutations mapped in germplasm are no lon­
ological features extracted by the hidden layers of DNNs. ger present in modern cultivars, as deleterious alleles have
Sparse Autoencoder for Unsupervised Clustering, been removed and beneficial alleles fixed by artificial selec­
Imputation, and Embedding (SAUCIE) performs DR and vi­ tion. Hence, relatively few genes are utilized in modern breed­
sualization of scRNA-seq data simultaneously. Other omics ing, and mutations in these genes conveying desirable traits
data types have also been generated at single-cell resolution. usually vary from population to population. A foreground
Aligning and integrating multiple levels of omics data for the mutation only functions properly under a specific genetic
same cell populations has become a new challenge. background; thus, even if a mutation discovered from germ­
plasm is potentially valuable, it may not be directly utilizable
Fine-mapping of causative variants in modern breeding systems. Similarly, when creating artifi­
In essence, gene discovery is to identify allelic genomic varia­ cial mutations, the new mutation must adapt to the existing
tions that are beneficial to a designated trait. Thus, fine- gene regulatory networks. Therefore, the bottleneck is not
mapping of causative variants, including SNPs, insertions the genome editing or transgenic techniques but rather the
Cheng Q and Wang X / ML-assisted Precision-designed Breeding 3

need to identify genes and recipient materials that can be Ultrahigh-throughput, scalable platforms based on kompe­
modified without affecting non-target traits. titive allele-specific PCR (KASP), such as Nexar Array Tape
systems, could then be utilized. These platforms can multi­
Breeding is all about “timing” and “balancing” plex tens of thousands of samples per run, but the markers
Trait improvement is essentially a process of fine-tuning a must be highly universal and effective. One can then take ad­
gene regulation network. Crossing generates new patterns of vantage of feature selection embedded in EL to select
gene regulation by recombining deleterious and beneficial markers. EL is a family of ML algorithms, including random
alleles. This process offers the chance to select the optimal forest, gradient boosting decision tree (GBDT), extreme gra­
network where genes involved in a regulatory pathway meet dient boosting (XGBoost), categorical boosting (CatBoost),

Downloaded from https://academic.oup.com/gpb/article/22/4/qzae051/7703285 by Sichuan Agriculture University user on 20 October 2024


the breeding goal of trait improvement. Thus, even a small and light gradient boosting machine (LightGBM), which as­
phenotypic change might involve a reshaped gene regulatory semble outcomes from multiple weak learners to enhance
network influencing complex interactions among genes and predictability. LightGBM generates leaf-wise trees and identi­
pathways. It is also important to clarify the definition of dele­ fies the “best leaves”, which in this case are SNPs with high
terious and beneficial alleles. That is, no allele is absolutely utility for classify traits. This ability is represented by an IG
deleterious or beneficial: alleles are defined based on their fi­ score, which resembles the effect of the SNP inferred from
nal effects on yield. However, deleterious and beneficial sta­ GWAS [5]. Therefore, LightGBM is an ideal tool for compil­
tuses are potentially interconvertible depending on the
ing highly condensed panels of SNPs via automated feature
developmental stage and/or environment. For example, a
selection while maintaining maximum predictability.
beneficial allele for vegetative growth is beneficial for bio­
mass accumulation but may be deleterious to yield-related
traits by negatively affecting reproductive development [7]. Pathway design via causal learning
Thus, breeding cannot be simply understood as a way of While a marker panel covers SNPs associated with relevant
removing deleterious alleles or pyramiding beneficial alleles; traits identified from GWAS analysis, a pathway panel may
instead, the effects of two sets of counteracting alleles must contain variations associated with genes forming a regulatory
be balanced. network or located in a metabolic biosynthesis pathway iden­
How can our knowledge of genes and mechanisms be effi­ tified from multi-omics analysis. Therefore, designing a path­
ciently translated into applied breeding? ML is suited for this way panel requires the inference of the “cause” and “effect”
mission due to its capacity to integrate knowledge and data. relationship between two genes, such as a transcription factor
To illustrate this, consider ML-facilitated molecular design to and a target gene. Compared to the marker panel that is usu­
breed maize cultivars suitable for mechanical harvesting. This ally used for improving regular agronomic traits covering
requires considering multiple traits for improvement, includ­ thousands of SNP markers, a pathway panel may contain
ing plant compactness, kernel dehydration rate, times of much less markers associated with genes for improving spe­
flowering and maturity, stalk stiffness and strength, and corn cific characteristics of plants, such as anti-stress feature or en­
husk morphology. The greatest difficulty is dealing with the hancing the content of certain metabolite compounds. The
pleiotropic effects of genes: changing one trait may affect inferred causality can be used as a rule to design a trait panel
other traits. Target-oriented prioritization (TOP), a recently by clustering functionally related genes. Mendelian randomi­
developed integrative multi-trait ML algorithm, mathemati­ zation (MR) was recently used to infer the causal relation­
cally learns the synergistic or competitive relationships
ships between mutations, genes, biomolecules, and traits in
among multiple traits to make a cohesive decision for select­
plants based on summarized results from population-scale
ing superior candidates [8]. As long as sufficient genotypic
multi-omics analysis [4]. However, the assumption underly­
and phenotypic data are acquired, ML models can establish
ing MR is based on human population genetics. Whether this
the correlations between genes and traits based on knowledge
tool is applicable to all plant species requires validation, as
graph. The target genes for a designated breeding population
domesticated plants result from artificial selection rather
can be assembled as a panel for ML algorithms to learn the
than natural selection. It is therefore necessary to seek novel
optimal pattern of allelic combinations. The model then aids
methods independent of genetic assumptions. In fact, ML
the selection of materials with the desired haplotypes to si­
multaneously improve multiple traits. and causal inference are two independent fields with different
methodological systems: ML predicts outcomes based on
Panel design with EL data correlations without explaining causality, whereas
Genotyping by targeted sequencing (GBTS), which captures causal inference determines the roles of the “cause” and
SNP-containing regions for gene panel sequencing, is widely “effect” of variables. Data scientists are trying to combine
used for genetic diagnostics in precision medicine. A typical these two systems. The new field of “causal learning” confers
GBTS panel contains thousands to tens of thousands of SNPs the ability of ML models to explain underlying reasons,
covering dozens to hundreds of genes, allowing hundreds of thereby making AI more closely resemble real-world deci­
samples to be multiplexed for genotyping. However, the cost sion-making. For example, causal representation learning
per sample of genotyping is still relatively high for plant was designed to discover high-level causal variables based on
breeding because of the need to process tens of thousands of low-level observations. Causal tree learning, a modified ver­
samples. Nonetheless, GBTS is a good method for accumulat­ sion of the classification and regression tree (CART) model,
ing training data for ML until the population is large enough estimates causal relationships during the process of tree split­
to cover all possible allelic combinations of target genes. As ting. These methods could be used to reconstruct biological
long as the most stable SNPs are identified, a new low-cost networks from multi-omics data, in which the inferred cau­
panel containing dozens of SNPs could be designed. salities represent the directional edges among nodes.
4 Genomics, Proteomics & Bioinformatics, 2024, Vol. 22, No. 4

Data-driven genomic design breeding GS model is risky, as it may cause inestimable overfitting due
Data acquired from industrial breeding programs can include to the extremely high complexity of feature sets. Therefore,
genotypic, phenotypic, environmental, climate, and any type aforementioned feature engineering on mTraits or iTraits
of field data. Unlike knowledge-driven design, data-driven must be utilized to reduce data dimensionality prior to model
design does not require knowledge of the specific genes and training. Then, the dimensional vectors are regarded as fea­
mechanisms underlying a trait. Instead, it uses statistical or tures to be incorporated with genotypes of SNPs to train GS
ML models to infer correlations among data, as exemplified models. Additionally, generation of multi-omics data is costly
by GS [9]. However, genotyping cost is still the main factor and it’s impossible to generate RNA sequencing (RNA-seq)
or metabolome profiling for each individual samples in each

Downloaded from https://academic.oup.com/gpb/article/22/4/qzae051/7703285 by Sichuan Agriculture University user on 20 October 2024


hindering the wide application of GS in plant breeding indus­
breeding cycle. We should only utilize the biological informa­
try. A promising substitution of GBTS is low-coverage
tion derived from a set of multi-omics data, which is essen­
genome-wide sequencing (lcGWS) or ultra-low-coverage
tially the innate correlation of different omics datasets.
genome-wide sequencing (ulcGWS) which randomly sequen­
Therefore, transferring learning with interpretable DL frame­
ces genomic DNA at an expected coverage of 1.5× or 0.5×,
work is promising to transfer the network layers derived
respectively. Genotyping cost by lcGWS is much lower than
from multi-omics data to be integrated with genotypes of
GBTS, since it skips the step of capturing targeted DNA frag­
SNP features. By this means, issues of sequencing cost and
ments. Nevertheless, because DNA fragments are randomly
data complexity can be both properly solved.
sequenced by lcGWS, SNPs may not be consistently covered
A commercial breeding pipeline can be partitioned into
by all the genotyped samples. One possible solution is to first
multiple stages, and each stage may generate data for build­
construct a reference HapMap composed of all elite inbred
ing decision-making models. In theory, any problem solved
lines, which includes usually 50 to 100 lines frequently used
by statistical models can also be solved by ML. However,
as founder lines to generate doubled haploid (DH) lines in a
thus far, only GS has been implemented using ML methods,
breeding project. However, the reference HapMap has to be
and most other studies have been based on statistics. GS is
constructed with high-coverage genome-wide sequencing widely employed for maize breeding due to the use of single-
(hcGWS; i.e., 30×), so that it can be used to perform imputa­ cross breeding in the modern maize industry: in this situation,
tion on genotypic data of DH lines that are generated by the genotyping parental inbred lines makes it possible to infer the
founder lines included in the HapMap. By this means, a rela­ F1 genotypes, greatly reducing genotyping costs. However,
tively consistent SNP panel can be inferred to perform GS attention should be paid to the utility of GS for the breeding
prediction. It’s worthy of attention that, because genotypes of goals. GS is suitable for interrogating the general combining
SNPs inferred by imputation may include a fraction of inesti­ ability or heterotic performance between two parental pools
mable mistakes, the DH lines are better descendants or close using genome-wide genetic background, since heterosis is de­
relatives of the founder lines included in the HapMap, and termine by genomic kinship rather than a few markers. Thus,
strict SNP filtration must be done before imputation is per­ the ultimate goal of GS is to accelerate the progress of genetic
formed in order to minimize the fraction of wrong genotype gain using in silico prediction to reduce field costs.
information. Nevertheless, if the goal is to fine-tune a specific trait, such as
With the help of decision-making models, inputs from hu­ the ability of stress tolerance, GS is unsuitable, while the ideal
man experience are largely minimized in a breeding pipeline. solution is molecular design breeding using a small set of
The main purpose is to reduce costs, and precision is not the trait-associated markers (also called genetic foreground) after
top priority. Thus, the balance between cost and precision the causal genes mapped.
must be considered in actual breeding practices. As the cost Because GS may not solve all problems encountered in
of genotyping and phenotyping accounts for the main pro­ breeding, complementary models have been developed. For
portion of total expense in a breeding project, a GS project example, genome optimization via virtual simulation (GOVS)
usually uses 20%–25% of the entire population to obtain utilizes least-squares means to infer genomic fragments with
both genotypic and phenotypic data to construct training beneficial effects on grain yield and simulates an assembly of
dataset. Under this ratio of training and testing samples, yield all beneficial fragments as an optimized genome [12]. The
prediction accuracy may achieve from 0.5 to 0.6 according to simulated genome facilitates the selection of superior lines
the evaluation of Pearson correlation coefficient, but the total based on the number of beneficial fragments rather than the
cost may be approximately reduced 30% to 40%. For exam­ predicted phenotype. GOVS also helps identify lines with
ple, a pilot maize breeding project used � 9000 hybrids to complementary sets of beneficial fragments. These comple­
train a GS model and predicted the trait performance of mentary lines can be crossed, and doubled haploid technol­
� 34,000 untested hybrids, providing an in-depth under­ ogy can be used to precisely pyramid beneficial fragments.
standing of the genetic mechanisms of heterosis and cross Modeling the phenotypic plasticity of plants in response to
combinations for subsequent breeding cycles [6]. Another the environment is another important way to facilitate
common issue in GS is population stratification when multi­ decision-making during breeding. Phenotypic plasticity
ple panels of distantly related germplasm are involved in results from genotype–environment interactions (G×E) [13].
crossing. The proper partitioning of the training and predic­ The G×E model helps identify the optimal ecological range
tion samples must be carefully considered to prevent serious for achieving the highest yield productivity and estimates
overfitting. yield stability across different ecological zones. If more com­
More and more studies have illustrated the feasibility of in­ plicated climate factors are considered, the model also helps
tegrating multi-omics data to further improve prediction pre­ estimate the influence of climate change on yield performance
cision based on DL or DNN to facilitate GS or genomic and grain quality and identifies the optimal genotypes
prediction (GP), such as the tools of DeepGS and DNNGP adapted to climate change. However, most methods for
[10,11]. However, direct use of multi-omics data in training a modeling G×E are based on linear regression algorithms to
Cheng Q and Wang X / ML-assisted Precision-designed Breeding 5

infer correlations between yield performance and a few envi­ human cancer prediction and classification. With the rapid gen­
ronmental factors. Statistical models have become unsuitable eration of omics data from plant germplasms, perhaps this
for modeling increasingly complicated genotypic, phenotypic, multi-modal learning algorithm could be used to address the
environmental, and climate datasets, prompting the need for problem of a limited sample size for model training.
ML methods. Another critical issue for modeling phenotypic
plasticity is heterogeneous plasticity between inbred lines and
hybrids, which strongly influences model precision and must Building an ecosystem for AI breeding
be considered when using ML methods to predict in plants
environment-specific traits from inbred to hybrid lines. A common consensus is that high-quality datasets and labels

Downloaded from https://academic.oup.com/gpb/article/22/4/qzae051/7703285 by Sichuan Agriculture University user on 20 October 2024


Although in theory, all problems solved by statistics can be are more important than ML models themselves. This rule also
all solved by ML, ML is not always the best choice. If the prob­ applies to breeding. A recent study evaluating 12 GS models by
lem is a “white box”, statistics should be used, especially when predicting 18 traits in six plant species showed that no single
the number of explicitly labeled samples is insufficient to cover method performed best across all traits and species [15].
all patterns that can be learned by an ML model. If the training Hyperparameter tuning is essential for achieving the best perfor­
dataset is smaller than the testing dataset, an ML model will mance using ML. This study revealed the complications of ap­
usually have lower prediction precision than a statistical model. plying ML to plant breeding, perhaps due to the complex
The scarcity of labeled samples is a common issue in breeding, composition of genetic materials and the influence of the envi­
not only because phenotyping is costly and labor intensive, but ronment on phenotypes. Thus, precision is not the only goal
also because certain traits are difficult to explicitly define and when applying ML to breeding: the robustness, extendibility,
accurately measure, such as biotic and abiotic stress-related and efficiency of a model must also be considered. An ML eco­
traits. Semi-supervised learning is a promising method for cop­ system specifically designed for AI breeding in plants is highly
ing with this issue, including positive-unlabeled learning, gener­ anticipated by the seed industry. This ecosystem must contain
ative adversarial network, contrastive learning, and transfer three major components: data, model, and application plat­
learning, but requires caution in its application [14]. If the data forms (Figure 1). The data platform should consist of unified
distribution is not uniform, inestimable overfitting may occur, pipelines for automated collection, processing, analysis, and
as the bias will be amplified by predicted labels. Another option storage of genotypic and phenotypic data, facilitated by cloud-
is multi-modal learning, which integrates complementary infor­ based computing. The model platform will include GS, G2P,
mation in multiple modalities to discover a latent representation G×E, and other decision-making models developed using ML
of the data. Joint DR (jDR) was effectively used to integrate and statistical methods, with automated modules used for
multi-source transcriptome, copy number variation (CNV), model selection, feature engineering, and hyperparameter tun­
microRNA, and methylome data from the same sample for ing. The application platform will consist of tools implemented

Figure 1 The infrastructure of an ecosystem for AI breeding in plants


The proposed ecosystem is composed of four major components. The first component is a data center that contains all types of multi-omics data
generated from a representative germplasm bank. The second component is a library of cutting-edge ML algorithms that can be used to either support
basic omics research in plants or offer solutions for building decision-making models in the seed industry. The third component is a knowledge base that
contains trait-related genes and causal variations derived from multi-omics data association analysis. The last component is an application platform that
contains a spectrum of bioinformatics tools and statistics-/ML-based decision-making models for AI breeding. AI, artificial intelligence; ML, machine
learning; SNP, single nucleotide polymorphism; InDel, insertion and deletion; PAV, presence and absence variation; GS, genomic selection; G2P,
genotype-to-phenotype; MAS, marker-assisted selection; G×E, genotype–environment interaction.
6 Genomics, Proteomics & Bioinformatics, 2024, Vol. 22, No. 4

from the predictive models, equipped with a user-friendly inter­ 0[4] Liu S, Xu F, Xu Y, Wang Q, Yan J, Wang J, et al. MODAS: ex­
face to offer services and report results to end users. Such an ploring maize germplasm with multi-omics data association stud­
ML ecosystem will make plant breeding smarter and easier in ies. Sci Bull 2022;67:903–6.
this era of AI. 0[5] Yan J, Xu Y, Cheng Q, Jiang S, Wang Q, Xiao Y, et al.
LightGBM: accelerated genomically designed crop breeding
through ensemble learning. Genome Biol 2021;22:271.
CRediT author statement 0[6] Weckwerth W, Ghatak A, Bellaire A, Chaturvedi P, Varshney
RK. PANOMICS meets germplasm. Plant Biotechnol J 2020;
Qian Cheng: Writing – review & editing. Xiangfeng Wang:
18:1507–25.
Conceptualization, Writing – original draft, Writing – review 0[7] Xiao Y, Jiang S, Cheng Q, Wang X, Yan J, Zhang R, et al. The ge­

Downloaded from https://academic.oup.com/gpb/article/22/4/qzae051/7703285 by Sichuan Agriculture University user on 20 October 2024


& editing, Supervision. Both authors have read and approved netic mechanism of heterosis utilization in maize improvement.
the final manuscript. Genome Biol 2021;22:148.
0[8] Yang W, Guo T, Luo J, Zhang R, Zhao J, Warburton ML, et al.
Target-oriented prioritization: targeted selection strategy by inte­
Competing interests grating organismal and molecular traits through predictive ana­
Both authors have declared no competing interests. lytics in breeding. Genome Biol 2022;23:80.
0[9] Wang Q, Jiang S, Li T, Qiu Z, Yan J, Fu R, et al. G2P provides an
integrative environment for multi-model genomic selection analy­
Acknowledgments sis to improve genotype-to-phenotype prediction. Front Plant Sci
This work was supported by the Biological Breeding-Major 2023;14:1207139.
Projects (Grant No. 2023ZD04076), the Pinduoduo–China [10] Ma W, Qiu Z, Song J, Li J, Cheng Q, Zhai J, et al. A deep convo­
Agricultural University Research Fund (Grant No. lutional neural network approach for predicting phenotypes from
genotypes. Planta 2018;248:1307–18.
PC2023B01012), and the Yangling Seed Innovation Center
[11] Wang K, Abid MA, Rasheed A, Crossa J, Hearne S, Li H.
Key Research Project (Grant No. ylzy-ym-05), China.
DNNGP, a deep neural network-based method for genomic pre­
diction using multi-omics data in plants. Mol Plant 2023;
ORCID 16:279–93.
[12] Cheng Q, Jiang S, Xu F, Wang Q, Xiao Y, Zhang R, et al.
0000-0001-6873-8923 (Qian Cheng) Genome optimization via virtual simulation to accelerate maize
0000-0002-6406-5597 (Xiangfeng Wang) hybrid breeding. Brief Bioinform 2022;23:bbab447.
[13] Fu R, Wang X. Modeling the influence of phenotypic plas­
ticity on maize hybrid performance. Plant Commun 2023;
References 4:100548.
0[1] Ma C, Zhang HH, Wang X. Machine learning for Big Data ana­ [14] Yan J, Wang X. Unsupervised and semi-supervised learning: the
lytics in plants. Trends Plant Sci 2014;19:798–808. next frontier in machine learning for plant systems biology. Plant
0[2] Varshney RK, Sinha P, Singh VK, Kumar A, Zhang Q, Bennetzen J 2022;111:1527–38.
JL. 5Gs for crop genetic improvement. Curr Opin Plant Biol [15] Azodi CB, Bolger E, McCarren A, Roantree M, de Los Campos
2020;56:190–6. G, Shiu SH. Benchmarking parametric and machine learning
0[3] Yan J, Wang X. Machine learning bridges omics sciences and models for genomic prediction of complex traits. G3 (Bethesda)
plant breeding. Trends Plant Sci 2023;28:199–210. 2019;9:3691–702.

© The Author(s) 2024. Published by Oxford University Press and Science Press on behalf of the Beijing Institute of Genomics, Chinese Academy of Sciences / China
National Center for Bioinformation and Genetics Society of China.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Genomics, Proteomics & Bioinformatics, 2024, 22, 1–6
https://doi.org/10.1093/gpbjnl/qzae051
Perspective

You might also like