Pierre Peterlongo

Followers

Following

Co-authors

Public Views

Bill Warters

Wayne State University

Georg Theiner

Villanova University

Roy Pea

Stanford University

David Seamon

Kansas State University

Daniel Deslauriers

California Institute of Integral Studies

Hatice Kafadar

Abant Izzet Baysal University, Bolu, Turkey

Joachim Funke

Universität Heidelberg

Gian Carlo Torres

University of Santo Tomas, Phils

Madhav B Karki

Southasia Institute of Advanced Studies

Sharon Bar-Ziv

Sapir Academic College

Interests

Uploads

Papers by Pierre Peterlongo

Back to sequences: find the origin ofk-mers

A vast majority of bioinformatics tools dedicated to the treatment of raw sequencing data heavily... more A vast majority of bioinformatics tools dedicated to the treatment of raw sequencing data heavily use the concept ofk-mers. This enables us to reduce the data redundancy (and thus the memory pressure), to discard sequencing errors, and to dispose of objects of fixed size that can be manipulated and easily compared to each others. A drawback is that the link between eachk-mer and the original set of sequences it belongs to is generally lost. Given the volume of data considered in this context, finding back this association is costly. In this work, we present “back_to_sequences”, a simple tool designed to index a set ofk-mers of interests, and to stream a set of sequences, extracting those containing at least one of the indexedk-mer. In addition, the number of occurrences ofk-mers in the sequences is provided. Our results show thatback_to_sequencesstreams≈200 short read per millisecond, enabling to searchk-mers in hundreds of millions of reads in a matter of a few minutes.Availability...

The Backpack Quotient Filter: a dynamic and space-efficient data structure for querying<i>k</i>-mers with abundance

bioRxiv (Cold Spring Harbor Laboratory), Feb 18, 2024

Genomic data sequencing has become indispensable for elucidating the complexities of biological s... more Genomic data sequencing has become indispensable for elucidating the complexities of biological systems. As databases storing genomic information, such as the European Nucleotide Archive, continue to grow exponentially, efficient solutions for data manipulation are imperative. One fundamental operation that remains challenging is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets. This paper introduces a novel data structure indexing k-mers (substrings of length k), the Backpack Quotient Filter (BQF), which serves as an alternative to the Counting Quotient Filter (CQF). The BQF offers enhanced space efficiency compared to the CQF while retaining key properties, including abundance information and dynamicity, with a negligible false positive rate, below 10 -5 %. The approach involves a redefinition of how abundance information is handled within the structure, alongside with an independent strategy for space efficiency. We show that the BQF uses 4x less space than the CQF on some of the most complex data to index: sea-water metagenomics sequences. Furthermore, we show that space efficiency increases as the amount of data to be indexed increases, which is in line with the original objective of scaling to ever-larger datasets.

Pierre Peterlongo

Uploads

Papers by Pierre Peterlongo

Log In