Papers by Gonzalo Navarro
2022 Data Compression Conference (DCC)
String Processing and Information Retrieval, 2017
The Block Tree is a recently proposed data structure that reaches compression close to Lempel-Ziv... more The Block Tree is a recently proposed data structure that reaches compression close to Lempel-Ziv while supporting efficient direct access to text substrings. In this paper we show how a self-index can be built on top of a Block Tree so that it provides efficient pattern searches while using space proportional to that of the original data structure. More precisely, if a Lempel-Ziv parse cuts a text of length n into z nonoverlapping phrases, then our index uses O(z lg(n/z)) words and finds the occ occurrences of a pattern of length m in time O(m 2 lg n+occ lg n) for any constant > 0.
Aos meus pais e irmáns Aos meus sobriños Acknowledgements I consider that it is fair to thank my ... more Aos meus pais e irmáns Aos meus sobriños Acknowledgements I consider that it is fair to thank my PhD supervisors: Nieves and Gonzalo. Without any kind of doubt, the research I have been doing during the last years would not have been the same without their assistance. Their knowledge and experience, their unconditional dedication, their tireless support (and patience), their advice,... were always useful and showed me the way to follow. Thank you for your professionalism and particularly thank you for your friendship. On the other hand, I would not have reached this point without the support of my family. They always gave me the strength to carry on fighting for those things I longed for, and they were always close to me both in the good and in the bad moments. They all form a very important part of my life. First of all, this thesis is a gift that I want to dedicate to the most combative and strong people I have ever met: thank you mum and dad. My parents (Felicidad and Manolo) taught
String Processing and Information Retrieval, 2019
Suffix trees are a fundamental data structure in stringology, but their space usage, though linea... more Suffix trees are a fundamental data structure in stringology, but their space usage, though linear, is an important problem for its applications. We design and implement a new compressed suffix tree targeted to highly repetitive texts, such as large genomic collections of the same species. Our suffix tree builds on Block Trees, a recent Lempel-Ziv-bounded data structure that captures the repetitiveness of its input. We use Block Trees to compress the topology of the suffix tree, and augment the Block Tree nodes with data that speeds up suffix tree navigation. Our compressed suffix tree is slightly larger than previous repetition-aware suffix trees based on grammars, but outperforms them in time, often by orders of magnitude. The component that represents the tree topology achieves a speed comparable to that of general-purpose compressed trees, while using 2.3-10 times less space, and might be of interest in other scenarios.
The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes... more The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family of techniques that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows–Wheeler transform (BWT) and on the Compact Directed Acyclic Word Graph (CDAWG). The most space-efficient indexes, on the other hand, are based on the Lempel–Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this obse...
This report documents the program and the outcomes of Dagstuhl Seminar 16431 “Computation over Co... more This report documents the program and the outcomes of Dagstuhl Seminar 16431 “Computation over Compressed Structured Data”. Seminar October 23–28, 2016 – http://www.dagstuhl.de/16431 1998 ACM Subject Classification Coding and Information Theory, Data Structures
ACM Computing Surveys, 2022
Two decades ago, a breakthrough in indexing string collections made it possible to represent them... more Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic develop...
Journal of the ACM, 2020
Indexing highly repetitive texts—such as genomic databases, software repositories and versioned t... more Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r , the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O ( r ) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O ( m log log n ) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r . In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O ( occ log log n ) time) within O ( r ) space. By raising the space to O ( r log log n ), our index counts the occurrences in optimal time, O ( m ), and ...
Fundamenta Informaticae, 2011
Theoretical Computer Science, 2019
String Processing and Information Retrieval, 2008
A repetitive sequence collection is one where portions of a base sequence of length n are repeate... more A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of highly repetitive sequence collections in space-efficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided time-efficiently. We show that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task. We engineer some new structures that use run-length encoding and give empirical evidence that these structures are superior to the current structures.
Lecture Notes in Computer Science, 2009
String Processing and Information Retrieval, 2012
We introduce the first grammar-compressed representation of a sequence that supports searches in ... more We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T [1..u] that is represented by a (context-free) grammar of n (terminal and nonterminal) symbols and size N (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of T takes N lg n bits of space. Our representation requires 2N lg n + N lg u + n lg n + o(N lg n) bits of space, for any 0 < ≤ 1. It can find the positions of the occ occurrences of a pattern of length m in T in O (m 2 /) lg lg u lg n + occ lg n time, and extract any substring of length of T in time O(+ h lg(N/h)), where h is the height of the grammar tree.
Theoretical Computer Science, 2013
We address the problem of indexing a collection D = {T 1 , T 2 , ...T D } of D string documents o... more We address the problem of indexing a collection D = {T 1 , T 2 , ...T D } of D string documents of total length n, so that we can efficiently answer top-k queries: retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using O(n) words, that answer such queries in optimal O(p + k) time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document T i), an index occupying |CSA|+o(n) bits answers the query in time O(t search (p)+k lg 2 k lg ε n), where CSA is a compressed suffix array indexing D, t search is its time to find the suffix array interval of P, and ε > 0 is any constant. (2) With the same measure of relevance, an index occupying |CSA| + n lg D + o(n lg σ + n lg D) bits answers the query in time O(t search (p) + k lg * k), where lg * k is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying |CSA| + O(n lg lg n) bits answers the query in O(t search (p) + k t SA) time, where t SA is the time the CSA needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest.
Lecture Notes in Computer Science, 2012
Lecture Notes in Computer Science, 2009
Lecture Notes in Computer Science, 2011
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is p... more We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1-2 million characters of the text per second, and finds patterns at a rate of 10-50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.
Information Systems, 2015
Lecture Notes in Computer Science, 2014
Sequence representations supporting not only direct access to their symbols, but also rank/select... more Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. Several recent applications need to represent highly repetitive sequences, and classical statistical compression proves ineffective. We introduce, instead, grammar-based representations for repetitive sequences, which use up to 6% of the space needed by statistically compressed representations, and support direct access and rank/select operations within tens of microseconds. We demonstrate the impact of our structures in text indexing applications.
String Processing and Information Retrieval, 2012
We design a new compressed suffix tree specifically tailored to highly repetitive text collection... more We design a new compressed suffix tree specifically tailored to highly repetitive text collections. This is particularly useful for sequence analysis on large collections of genomes of the close species. We build on an existing compressed suffix tree that applies statistical compression, and modify it so that it works on the grammar-compressed version of the longest common prefix array, whose differential version inherits much of the repetitiveness of the text.
Uploads
Papers by Gonzalo Navarro