Papers by Pierre Peterlongo
A vast majority of bioinformatics tools dedicated to the treatment of raw sequencing data heavily... more A vast majority of bioinformatics tools dedicated to the treatment of raw sequencing data heavily use the concept ofk-mers. This enables us to reduce the data redundancy (and thus the memory pressure), to discard sequencing errors, and to dispose of objects of fixed size that can be manipulated and easily compared to each others. A drawback is that the link between eachk-mer and the original set of sequences it belongs to is generally lost. Given the volume of data considered in this context, finding back this association is costly. In this work, we present “back_to_sequences”, a simple tool designed to index a set ofk-mers of interests, and to stream a set of sequences, extracting those containing at least one of the indexedk-mer. In addition, the number of occurrences ofk-mers in the sequences is provided. Our results show thatback_to_sequencesstreams≈200 short read per millisecond, enabling to searchk-mers in hundreds of millions of reads in a matter of a few minutes.Availability...
bioRxiv (Cold Spring Harbor Laboratory), Feb 18, 2024
Genomic data sequencing has become indispensable for elucidating the complexities of biological s... more Genomic data sequencing has become indispensable for elucidating the complexities of biological systems. As databases storing genomic information, such as the European Nucleotide Archive, continue to grow exponentially, efficient solutions for data manipulation are imperative. One fundamental operation that remains challenging is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets. This paper introduces a novel data structure indexing k-mers (substrings of length k), the Backpack Quotient Filter (BQF), which serves as an alternative to the Counting Quotient Filter (CQF). The BQF offers enhanced space efficiency compared to the CQF while retaining key properties, including abundance information and dynamicity, with a negligible false positive rate, below 10 -5 %. The approach involves a redefinition of how abundance information is handled within the structure, alongside with an independent strategy for space efficiency. We show that the BQF uses 4x less space than the CQF on some of the most complex data to index: sea-water metagenomics sequences. Furthermore, we show that space efficiency increases as the amount of data to be indexed increases, which is in line with the original objective of scaling to ever-larger datasets.
bioRxiv (Cold Spring Harbor Laboratory), Jun 4, 2023
Despite their wealth of biological information, public sequencing databases are largely underutil... more Despite their wealth of biological information, public sequencing databases are largely underutilized. One cannot efficiently search for a sequence of interest in these immense resources. Sophisticated computational methods such as approximate membership query data structures allow searching for fixed-length words (kmers) in large datasets. Yet they face scalability challenges when applied to thousands of complex sequencing experiments. In this context we propose kmindex, a new approach that uses inverted indexes based on Bloom filters. Thanks to its algorithmic choices and its fine-tuned implementation, kmindex offers the possibility to index thousands of highly complex metagenomes into an index that answers sequences queries in the tenth of a second. Index construction is one order of magnitude faster than previous approaches, and query time is two orders of magnitude faster. Based on Bloom filters, kmindex achieves negligible false positive rates, below 0.01% on average. Its average false positive rate is four orders of magnitude lower than existing approaches, for similar index sizes. It has been successfully used to index 1,393 complex marine seawater metagenome samples of raw sequences from the Tara Oceans project, demonstrating its effectiveness on large and complex datasets. This level of scaling was previously unattainable. Building on the kmindex results, we provide a public web server named "Ocean Read Atlas" (ORA) at https://ocean-read-atlas.mio.osupytheas.fr/ that can answer queries against the entire Tara Oceans dataset in real-time. kmindex is open-source software
bioRxiv (Cold Spring Harbor Laboratory), Jun 29, 2022
Motivations: Approximate membership query data structures (AMQ) such as Cuckoo filters or Bloom f... more Motivations: Approximate membership query data structures (AMQ) such as Cuckoo filters or Bloom filters are widely used for representing and indexing large sets of elements. AMQ can be generalized for additionally counting indexed elements, they are then called "counting AMQ". This is for instance the case of the "counting Bloom filters". However, counting AMQs suffer from false positive and overestimated calls. Results: In this work we propose a novel computation method, called fimpera, consisting of a simple strategy for reducing the false-positive rate of any AMQ indexing all k-mers (words of length k) from a set of sequences, along with their abundance information. This method decreases the false-positive rate of a counting Bloom filter by an order of magnitude while reducing the number of overestimated calls, as well as lowering the average difference between the overestimated calls and the ground truth. In addition, it slightly decreases the query run time. fimpera does not require any modification of the original counting Bloom filter, it does not generate false-negative calls, and it causes no memory overhead. The unique drawback is that fimpera yields a new kind of false positives and overestimated calls. However their amount is negligible. fimpera requires a unique parameter, and its results are only little impacted when using this parameter within recommended values. As a side note, for the algorithmic needs of the method, we also propose a novel generic algorithm for finding minimal values of a sliding window over a vector of x integers in O(x) time with zero memory allocation.
Bioinformatics, May 1, 2023
Motivation: High throughput sequencing technologies generate massive amounts of biological sequen... more Motivation: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate. Results: We propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 22, 2019
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Allele-specific expression (ASE) is now a widely studied mechanism at cell, tissue and organism l... more Allele-specific expression (ASE) is now a widely studied mechanism at cell, tissue and organism levels. However, population-level ASE and its evolutive impacts have still never been investigated. Here, we hypothesized a potential link between ASE and natural selection on the cosmopolitan copepod Oithona similis. We combined metagenomic and metatranscriptomic data from seven wild populations of the marine copepod O. similis sampled during the Tara Oceans expedition. We detected 587 single nucleotide variants (SNVs) under ASE and found a significant amount of 152 SNVs under ASE in at least one population and under selection across all the populations. This constitutes a first evidence that selection and ASE target more common
HAL (Le Centre pour la Communication Scientifique Directe), Jul 3, 2018
... Gustavo AT Sacomoto , Janice Kielbassa , Pavlos Antoniou , Rayan Chikhi , Raluca Uricaru , Ma... more ... Gustavo AT Sacomoto , Janice Kielbassa , Pavlos Antoniou , Rayan Chikhi , Raluca Uricaru , Marie-France Sagot , Pierre Peterlongo* , Vincent Lacroix∗ ... The short sequencing reads then need to be reassembled in order to get back to the initial RNA molecules. ...
Molecular Ecology Resources, Nov 11, 2013
Assessing the genetic variability of the tick Ixodes ricinus-an important vector of pathogens in ... more Assessing the genetic variability of the tick Ixodes ricinus-an important vector of pathogens in Europe-is an essential step for setting up antitick control methods. Here, we report the first identification of a set of SNPs isolated from the genome of I. ricinus, by applying a reduction in genomic complexity, pyrosequencing and new bioinformatics tools. Almost 1.4 million of reads (average length: 528 nt) were generated with a full Roche 454 GS FLX run on two reduced representation libraries of I. ricinus. A newly developed bioinformatics tool (DiscoSnp), which isolates SNPs without requiring any reference genome, was used to obtain 321 088 putative SNPs. Stringent selection criteria were applied in a bioinformatics pipeline to select 1768 SNPs for the development of specific primers. Among 384 randomly SNPs tested by Fluidigm genotyping technology on 464 individuals ticks, 368 SNPs loci (96%) exhibited the presence of the two expected alleles. Hardy-Weinberg equilibrium tests conducted on six natural populations of ticks have shown that from 26 to 46 of the 384 loci exhibited significant heterozygote deficiency.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 2, 2019
arXiv (Cornell University), Aug 31, 2022
In this work, we consider the problem of pattern matching under the dynamic time warping (DTW) di... more In this work, we consider the problem of pattern matching under the dynamic time warping (DTW) distance motivated by potential applications in the analysis of biological data produced by the third generation sequencing. To measure the DTW distance between two strings, one must "warp" them, that is, double some letters in the strings to obtain two equal-lengths strings, and then sum the distances between the letters in the corresponding positions. When the distances between letters are integers, we show that for a pattern P with m runs and a text T with n runs: 1. There is an O(m + n)-time algorithm that computes all locations where the DTW distance from P to T is at most 1; 2. There is an O(kmn)-time algorithm that computes all locations where the DTW distance from P to T is at most k. As a corollary of the second result, we also derive an approximation algorithm for general metrics on the alphabet.
Motivations Short-read accuracy is important for downstream analyses such as genome assembly and ... more Motivations Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large data sets or consider reads as mere suites of k-mers, without taking into account their full-length read information. Results We propose a new method to correct short reads using de Bruijn graphs, and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond. Availability and Implementation The implementation is open source and available at http: //github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package.
bioRxiv (Cold Spring Harbor Laboratory), Feb 17, 2021
When indexing large collection of sequencing data, a common operation that has now been implement... more When indexing large collection of sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI, ..) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are 1/ an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting hashes instead of k-mers; 2/ a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. In addition, our experimental results highlight that the usual yet crude filtering of low-abundant k-mers is inappropriate for complex data such as metagenomes.
eLife, Aug 3, 2022
Biogeographical studies have traditionally focused on readily visible organisms, but recent techn... more Biogeographical studies have traditionally focused on readily visible organisms, but recent technological advances are enabling analyses of the large-scale distribution of microscopic organisms, whose biogeographical patterns have long been debated. Here we assessed the global structure of plankton geography and its relation to the biological, chemical, and physical context of the ocean (the 'seascape') by analyzing metagenomes of plankton communities sampled across oceans during the Tara Oceans expedition, in light of environmental data and ocean current transport. Using a consistent approach across organismal sizes that provides unprecedented resolution to measure changes in genomic composition between communities, we report a pan-ocean, sizedependent plankton biogeography overlying regional heterogeneity. We found robust evidence for a basin-scale impact of transport by ocean currents on plankton biogeography, and on a characteristic timescale of community dynamics going beyond simple seasonality or life history transitions of plankton. Editor's evaluation Richter and colleagues present an impressive analysis of metagenomic, OTU and imaging data collected from >100 ocean locations worldwide, with the purpose of elucidating the role of largescale currents on global-scale marine plankton biogeography. The topic is exciting and timely.
Uploads
Papers by Pierre Peterlongo