BioSequence2Vec: Efficient Embedding Generation for Biological Sequences

Ali, Sarwan; Sardar, Usama; Patterson, Murray; Khan, Imdad Ullah

doi:10.1007/978-3-031-33377-4_14

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13936))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1298 Accesses

Abstract

Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three challenges with kernel methods are (i) the computation time, (ii) the memory usage (storing an \(n\times n\) matrix), and (iii) the usage of kernel matrices limited to kernel-based ML methods (difficult to generalize on non-kernel classifiers). While (i) can be solved using approximate methods, challenge (ii) remains for typical kernel methods. Similarly, although non-kernel-based ML methods can be applied to kernel matrices by extracting principal components (kernel PCA), it may result in information loss, while being computationally expensive. In this paper, we propose a general-purpose representation learning approach that embodies kernel methods’ qualities while avoiding computation, memory, and generalizability challenges. This involves computing a low-dimensional embedding of each sequence, using random projections of its k-mer frequency vectors, significantly reducing the computation needed to compute the dot product and the memory needed to store the resulting representation. Our proposed fast and alignment-free embedding method can be used as input to any distance (e.g., k nearest neighbors) and non-distance (e.g., decision tree) based ML method for classification and clustering tasks. Using different forms of biological sequences as input, we perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance.

M. Patterson and I. U. Khan — Joint Last Authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 117.69; Price includes VAT (France)

Softcover Book: EUR 147.69; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Hist2Vec: Kernel-Based Embeddings for Biological Sequence Classification

Efficient Sequence Embedding for SARS-CoV-2 Variants Classification

Enhancing t-SNE Performance for Biological Sequencing Data Through Kernel Selection

Notes

1.
https://www.gisaid.org/.

References

Ali, S.: Evaluating covid-19 sequence data using nearest-neighbors based network model. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 5182–5188. Osaka, Japan (2022). https://doi.org/10.1109/BigData55660.2022.10020653
Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3), 418 (2022)
Article Google Scholar
Ali, S., Bello, B., Tayebi, Z., Patterson, M.: Characterizing sars-cov-2 spike sequences based on geographical location. J. Comput. Biol. 30, 0391 (2023)
Google Scholar
Ali, S., Murad, T., Chourasia, P., Patterson, M.: Spike2signal: classifying coronavirus spike sequences with deep learning. In: 2022 IEEE Eighth International Conference on Big Data Computing Service and Applications (BigDataService), pp. 81–88 (2022)
Google Scholar
Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021)
Google Scholar
Ali, S., Sahoo, B., Khan, M.A., Zelikovsky, A., Khan, I.U., Patterson, M.: Efficient approximate kernel based spike sequence classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2022)
Google Scholar
Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I.: A k-mer based approach for sars-cov-2 variant identification. In: International Symposium on Bioinformatics Research and Applications, pp. 153–164 (2021)
Google Scholar
Ali, S., Sahoo, B., Zelikovsky, A., Chen, P.Y., Patterson, M.: Benchmarking machine learning robustness in COVID-19 genome sequence classification. Sci. Rep. 13(1), 4154 (2023)
Article Google Scholar
Ali, S., Zhou, Y., Patterson, M.: Efficient analysis of COVID-19 clinical data using machine learning models. Med. Biol. Eng. Comput. 60(7), 1881–1896 (2022)
Article Google Scholar
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: Symposium on Theory of computing, pp. 20–29 (1996)
Google Scholar
Blaisdell, B.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. 83, 5155–5159 (1986)
Article MATH Google Scholar
Borisov, V., et al.: Deep neural networks and tabular data: a survey. arXiv preprint arXiv:2110.01889 (2021)
Brandes, N., Ofer, D., Peleg, Y., Rappoport, N., Linial, M.: ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8), 2102–2110 (2022)
Google Scholar
Carter, J.L., Wegman, M.N.: Universal classes of hash functions. In: ACM symposium on Theory of computing, pp. 106–112 (1979)
Google Scholar
Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M.: Clustering sars-cov-2 variants from raw high-throughput sequencing reads data. In: Computational Advances in Bio and Medical Sciences (ICCABS), pp. 133–148 (2022)
Google Scholar
Chourasia, P., Ali, S., Patterson, M.: Informative initialization and kernel selection improves t-SNE for biological sequences. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 101–106. Osaka, Japan (2022). https://doi.org/10.1109/BigData55660.2022.10020217
Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)
Article Google Scholar
Cristianini, N., Shawe-Taylor, J., et al.: An introduction to support vector machines and other Kernel-based learning methods. Cambridge University Press (2000)
Google Scholar
Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.U.: Efficient approximation algorithms for strings Kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017)
Google Scholar
Ghandi, M., Noori, M., Beer, M.: Robust k k-mer frequency estimation using gapped k-mers. J. Math. Biol. 69(2), 469–500 (2014)
Article MathSciNet MATH Google Scholar
GISAID. https://www.gisaid.org/ (2022). Accessed 04 Dec 2022
Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019)
Article Google Scholar
Hoffmann, H.: Kernel PCA for novelty detection. Pattern Recogn. 40(3), 863–874 (2007)
Article MATH Google Scholar
Hu, W., Bansal, R., Cao, K., Rao, N., Subbian, K., Leskovec, J.: Learning backward compatible embeddings. arXiv preprint arXiv:2206.03040 (2022)
Human DNA. https://www.kaggle.com/code/nageshsingh/demystify-dna-sequencing-with-machine-learning/data. Accessed 10 Oct 2022
Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021)
Article Google Scholar
Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Comm. 533(3), 553–558 (2020)
Google Scholar
O’Toole, A., et al.: Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 7(2), veab064 (2021)
Google Scholar
Rambaut, A., et al.: A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiol. 5, 1403–1407 (2020)
Google Scholar
Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI conference on A.I (2018)
Google Scholar
Shwartz-Ziv, R., Armon, A.: Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022)
Article Google Scholar
Singh, R., Sekhon, A., et al.: GakCo: a fast gapped k-mer string kernel using counting. In: Joint ECML and Knowledge Discovery in Databases, pp. 356–373 (2017)
Google Scholar
Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015)
Google Scholar
Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of SARS-CoV-2 variants. Algorithms 14(12), 348 (2021)
Article Google Scholar
Ullah, A., Ali, S., Khan, I., Khan, M.A., Faizullah, S.: Effect of analysis window and feature selection on classification of hand movements using EMG signal. In: SAI Intelligent Systems Conference (IntelliSys), pp. 400–415 (2020)
Google Scholar
Wang, Z., Yan, W., Oates, T.: Time series classification from scratch with deep neural networks: a strong baseline. In: IJCNN, pp. 1578–1585 (2017)
Google Scholar
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478–487 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Georgia State University, Atlanta, GA, USA
Sarwan Ali & Murray Patterson
Lahore University of Management Sciences, Lahore, Pakistan
Usama Sardar & Imdad Ullah Khan

Authors

Sarwan Ali
View author publications
You can also search for this author in PubMed Google Scholar
Usama Sardar
View author publications
You can also search for this author in PubMed Google Scholar
Murray Patterson
View author publications
You can also search for this author in PubMed Google Scholar
Imdad Ullah Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Murray Patterson or Imdad Ullah Khan .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Hisashi Kashima
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Tsuyoshi Ide
National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ali, S., Sardar, U., Patterson, M., Khan, I.U. (2023). BioSequence2Vec: Efficient Embedding Generation for Biological Sequences. In: Kashima, H., Ide, T., Peng, WC. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2023. Lecture Notes in Computer Science(), vol 13936. Springer, Cham. https://doi.org/10.1007/978-3-031-33377-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-33377-4_14
Published: 28 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33376-7
Online ISBN: 978-3-031-33377-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BioSequence2Vec: Efficient Embedding Generation for Biological Sequences

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Hist2Vec: Kernel-Based Embeddings for Biological Sequence Classification

Efficient Sequence Embedding for SARS-CoV-2 Variants Classification

Enhancing t-SNE Performance for Biological Sequencing Data Through Kernel Selection

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

BioSequence2Vec: Efficient Embedding Generation for Biological Sequences

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Hist2Vec: Kernel-Based Embeddings for Biological Sequence Classification

Efficient Sequence Embedding for SARS-CoV-2 Variants Classification

Enhancing t-SNE Performance for Biological Sequencing Data Through Kernel Selection

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation