TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Cavlak, Meryem Banu; Singh, Gagandeep; Alser, Mohammed; Firtina, Can; Lindegger, Joël; Sadrosadati, Mohammad; Ghiasi, Nika Mansouri; Alkan, Can; Mutlu, Onur

doi:10.3389/fgene.2024.1429306

Quantitative Biology > Genomics

arXiv:2212.04953 (q-bio)

[Submitted on 9 Dec 2022 (v1), last revised 23 Oct 2024 (this version, v3)]

Title:TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Authors:Meryem Banu Cavlak, Gagandeep Singh, Mohammed Alser, Can Firtina, Joël Lindegger, Mohammad Sadrosadati, Nika Mansouri Ghiasi, Can Alkan, Onur Mutlu

View PDF HTML (experimental)

Abstract:Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. Our thorough experimental evaluations show that TargetCall 1) improves the end-to-end basecalling runtime performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) recall in keeping on-target reads, 2) maintains high accuracy in downstream analysis, and 3) achieves better runtime performance, throughput, recall, precision, and generality compared to prior works. TargetCall is available at this https URL.

Subjects:	Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2212.04953 [q-bio.GN]
	(or arXiv:2212.04953v3 [q-bio.GN] for this version)
	https://doi.org/10.48550/arXiv.2212.04953
Related DOI:	https://doi.org/10.3389/fgene.2024.1429306

Submission history

From: Can Firtina [view email]
[v1] Fri, 9 Dec 2022 16:03:34 UTC (637 KB)
[v2] Thu, 14 Sep 2023 15:42:54 UTC (528 KB)
[v3] Wed, 23 Oct 2024 13:36:37 UTC (552 KB)

Quantitative Biology > Genomics

Title:TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Genomics

Title:TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators