Academia.eduAcademia.edu

Bundled Suffix Trees

dmi.units.it

Motivation Bundled Suffix Trees Applications Summary BUNDLED SUFFIX TREES Luca Bortolussi1 Francesco Fabris2 Alberto Policriti1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science University of Trieste Udine, 22 Marzo 2005 L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Outline 1 Motivation Suffix Trees 2 Bundled Suffix Trees Non-Transitive Relations Definition Size and Construction 3 Applications 4 Summary L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Outline 1 Motivation Suffix Trees 2 Bundled Suffix Trees Non-Transitive Relations Definition Size and Construction 3 Applications 4 Summary L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Outline 1 Motivation Suffix Trees 2 Bundled Suffix Trees Non-Transitive Relations Definition Size and Construction 3 Applications 4 Summary L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Outline 1 Motivation Suffix Trees 2 Bundled Suffix Trees Non-Transitive Relations Definition Size and Construction 3 Applications 4 Summary L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Suffix Trees Introduction Since the discovery of DNA, biology gave birth to many thorough string problems. Important challenge: find repeated patterns in DNA that are biologically significant. Feature: patterns are repeated with errors. (Approximate pattern discovery is difficult) Other feature (more difficult): formalization of “biologically significant”. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Suffix Trees Suffix Trees bcabbabc A Suffix Tree is a data structure which exploits the internal structure of a string. Construct it! Efficient for: Exact String Matching Problem Longest Exact Common Substring Problem Identifying Exactly Repeated Patterns L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Suffix Trees Suffix Trees bcabbabc A Suffix Tree is a data structure which exploits the internal structure of a string. Construct it! They are linear in size (w.r.t text length), and can be built in linear time. Ukkonen Algorithm E. McCreight. A space-economical suffix tree construction algorithm, Journal of the ACM, 23(2), 262-272, 1976. E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249-260, 1995. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Suffix Trees Suffix Trees bcabbabc Suffix Trees are not natural to deal with approximate string matching problems (positive Hamming or Edit distance) Landau G.M., Vishkin U., Efficient String Matching with k Mismatches, Theoretical Computer Science, 43, 239-249, 1986. Gusfield D., Algorithms on strings, trees and sequences, Cambridge University Press, 1997. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Suffix Trees Suffix Trees bcabbabc Suffix Trees are not natural to deal with approximate string matching problems (positive Hamming or Edit distance) The Longest Common Approximate Substring Problem or the extraction of approximate repeated patterns can’t be solved as in the exact case. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Extending Suffix Trees THE PROJECT Exploring the possibility of using different tree-based structures to tackle approximate string matching problems. SO FAR We developed Bundled Suffix Trees, an extension of Suffix Trees such that: they incorporate information about “errors”; they can be used for the Longest Common Approximate Substring Problem and for extracting approximate repeated patterns like Suffix Trees. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Extending Suffix Trees THE PROJECT Exploring the possibility of using different tree-based structures to tackle approximate string matching problems. SO FAR We developed Bundled Suffix Trees, an extension of Suffix Trees such that: they incorporate information about “errors”; they can be used for the Longest Common Approximate Substring Problem and for extracting approximate repeated patterns like Suffix Trees. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction How? Instructions Coding the concept of “error” in some suitable way (i.e. with a relation among letters) Adding in the Suffix Tree some information about the “relation” Controlling the combinatorial explosion of the structure Designing good algorithms to build and manage it L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction How? Instructions Coding the concept of “error” in some suitable way (i.e. with a relation among letters) Adding in the Suffix Tree some information about the “relation” Controlling the combinatorial explosion of the structure Designing good algorithms to build and manage it L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction How? Instructions Coding the concept of “error” in some suitable way (i.e. with a relation among letters) Adding in the Suffix Tree some information about the “relation” Controlling the combinatorial explosion of the structure Designing good algorithms to build and manage it L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction How? Instructions Coding the concept of “error” in some suitable way (i.e. with a relation among letters) Adding in the Suffix Tree some information about the “relation” Controlling the combinatorial explosion of the structure Designing good algorithms to build and manage it L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Non-Transitive Relation Character matching is a relation among letters (in fact, it is the equality relation) Approximate matching can also be modeled as a non-transitive relation among letters (bigger than equality!): two strings “match” if all their letters are in relation. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Non-Transitive Relation Character matching is a relation among letters (in fact, it is the equality relation) Approximate matching can also be modeled as a non-transitive relation among letters (bigger than equality!): two strings “match” if all their letters are in relation. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Non-Transitive Relation: An Example Modelling a relation based on Hamming Distance Start from a basic alphabet (e.g. binary: A = {0, 1}) Construct an alphabet composed of macrocharacters (e.g. A = {00, 01, 10, 11}) Impose that two letters x, y ∈ A are in relation if and only if dH (x, y ) ≤ 1 (relation is non–transitive) The Relation Graph 00 ↔ 01 l l 10 ↔ 11 L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Non-Transitive Relation: An Example Modelling a relation based on Hamming Distance Start from a basic alphabet (e.g. binary: A = {0, 1}) Construct an alphabet composed of macrocharacters (e.g. A = {00, 01, 10, 11}) Impose that two letters x, y ∈ A are in relation if and only if dH (x, y ) ≤ 1 (relation is non–transitive) The Relation Graph 00 ↔ 01 l l 10 ↔ 11 L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Bundled Suffix Trees THE IDEA We start with a suffix tree for a text S and a (non-transitive) relation among letters. We compare suffixes S[i] and S[j]. If a prefix S[i . . . i + k ] of S[i] is in relation with a prefix S[j . . . j + k ] of S[j], we put a marker after S[j . . . j + k ] in the tree: a red node labeled with i. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Bundled Suffix Trees THE IDEA We start with a suffix tree for a text S and a (non-transitive) relation among letters. We compare suffixes S[i] and S[j]. If a prefix S[i . . . i + k ] of S[i] is in relation with a prefix S[j . . . j + k ] of S[j], we put a marker after S[j . . . j + k ] in the tree: a red node labeled with i. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a↔b↔c We start from the suffix tree for the string bcabbabc. The alphabet is {a, b, c}, and the relation is a ↔ b ↔ c. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a↔b↔c Let’s compare suffix 3 (abbabc) and suffix 1 (bcabbabc) According to our relation, the maximal prefix of suffix 3, which is in relation with a prefix of suffix one, is abbab. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a↔b↔c Therefore, after bcabb, we put in the tree a red node with label 3. Due to symmetry, there is also a red node with label 1 after abbab. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a↔b↔c Therefore, after bcabb, we put in the tree a red node with label 3. Due to symmetry, there is also a red node with label 1 after abbab. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a↔b↔c If we do this process for every couple of suffixes, we will build a Bundled Suffix Tree! Note that this data structure is in the middle between a suffix tree and a suffix trie. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a↔b↔c If we do this process for every couple of suffixes, we will build a Bundled Suffix Tree! Note that this data structure is in the middle between a suffix tree and a suffix trie. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a↔b↔c This tree can be use to solve the Longest Common Approximate Substring Problem with respect to a given relation. We just have to find the lowest red node! Similarly, we can also extract information about approximate repeated patterns. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction How Big? The number of red nodes inserted depends on: the relation In the worst case, the number of red nodes is quadratic in the length of the text S. Example the structure of the text. On average, the number of red nodes is limited by m1+δ , δ = log1/p+ C. ( m is the length of the text, p+ is the highest frequency of the most common letter in S and C depends on the relation) 1 + δ is slightly greater than one! L. Bortolussi, F. Fabris, A. Policriti Example BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction How Big? The number of red nodes inserted depends on: the relation In the worst case, the number of red nodes is quadratic in the length of the text S. Example the structure of the text. On average, the number of red nodes is limited by m1+δ , δ = log1/p+ C. ( m is the length of the text, p+ is the highest frequency of the most common letter in S and C depends on the relation) 1 + δ is slightly greater than one! L. Bortolussi, F. Fabris, A. Policriti Example BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction How Fast? Naive Algorithm The naive algorithm for building a BuST simply tries to “match” every suffix of the text along every branch of the suffix tree, until a “mismatch” is found. It can be quadratic in the worst case. Anyway, an analysis based on the average shape of a suffix tree, shows that its average complexity is bounded ′ by m1+δ (δ ′ just slightly greater that δ). W. Szpankowski. A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors. SIAM J. Comput. 22(6): 1176-1198 (1993) P. Jacquet, B. McVey, W. Szpankowski. Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of Depth, Journal of the Iranian Statistical Society, 3, 139-148, 2004. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Faster Efficient Algorithm We found an “McCreight-like” algorithm that is linear in the size of the output. Intuitions It processes the suffixes backwards. It is based on the concept of inverse suffix links. Show Details It identifies the red nodes for suffix i by processing the red nodes for suffix i + 1. Show Details L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Non-Transitive Relations Definition Size and Construction Tests We have implemented the naive algorithm for the construction of BuST. We have tested it with relations induced by hamming distance, defined over DNA-macrocharacters. With macrocharacters of size 4, such that two of them are in relation iff their Hamming distance is ≤ 1, the algorithm is quite fast, and can process texts of 100 Kb in few seconds. The number of red nodes grows with an exponent smaller than the predicted one. Show Details L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Hunting TFBS We are using BuST to identify TFBS candidates in DNA sequences. The algorithm first constructs the BuST for the set of sequences under analysis, and then extracts and combines the information contained in it. The relation used is defined by an Hamming distance criterion. The algorithm is quite fast: for instance, we are able to solve the benchmark proposed by Pevzner et al. in few seconds. Show Benchmark’s Details P.A. Pevzner and S.H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc Int Conf Intell Syst Mol Biol. 2000;8:269-78. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES The generalL.concept of non-transitive relation seems very Motivation Bundled Suffix Trees Applications Summary Hunting TFBS We are using BuST to identify TFBS candidates in DNA sequences. The general concept of non-transitive relation seems very fruitful: it can be used to encode Hamming distance, but also to tackle edit distance or to encode other biologically-driven relations. G. Pavesi, G. Mauri and G. Pesole. In silico representation and discovery of transcription factor binding sites, Briefings in Bioinformatics. 5(3):1–20, 2004. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Motivation Bundled Suffix Trees Applications Summary Conclusions We have introduced Bundled Suffix Trees, a new data structure extending suffix trees. It can be used to extract approximate information from a string, and it is manipulated similarly to suffix trees. The structure is based on a very general concept of non-transitive relation among (macro)characters. Its size is slightly more than linear on average, and there’s a fast (McCreight-like) algorithm to build it. It can be used to discover approximate patterns in a text. For instance, it can be used to identify candidates for TFBS. L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Naive Construction of Suffix Trees We start from the string bcabbabc We put down the whole string in a branch. We try to match the suffix down the tree. bcabbabc Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Naive Construction of Suffix Trees bcabbabc We start from the string bcabbabc We put down the whole string in a branch. We try to match the suffix down the tree. bcabbabc Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Naive Construction of Suffix Trees bcabbabc We start from the string bcabbabc We put down the whole string in a branch. We try to match the suffix down the tree. bcabbabc Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Naive Construction of Suffix Trees bcabbabc We start from the string bcabbabc We put down the whole string in a branch. We try to match the suffix down the tree. bcabbabc Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Naive Construction of Suffix Trees bcabbabc We start from the string bcabbabc We put down the whole string in a branch. We try to match the suffix down the tree. bcabbabc Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Naive Construction of Suffix Trees bcabbabc Here’s the suffix tree! Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Ukkonen Algorithm It is an on-line linear algorithm. It constructs the tree incrementally for S[1 . . . i], i = 1, . . . , m. At every step it extends the frontier: the set of points in the tree that are possible branching points. (the position of the longest suffix of S[1 . . . i] such that it has another occurrence in S[1 . . . i]) This frontier can be manipulated efficiently thanks to suffix links. (pointers from nodes with label xα to nodes with label α) Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Ukkonen Algorithm It is an on-line linear algorithm. It constructs the tree incrementally for S[1 . . . i], i = 1, . . . , m. At every step it extends the frontier: the set of points in the tree that are possible branching points. (the position of the longest suffix of S[1 . . . i] such that it has another occurrence in S[1 . . . i]) This frontier can be manipulated efficiently thanks to suffix links. (pointers from nodes with label xα to nodes with label α) Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Ukkonen Algorithm It is an on-line linear algorithm. It constructs the tree incrementally for S[1 . . . i], i = 1, . . . , m. At every step it extends the frontier: the set of points in the tree that are possible branching points. (the position of the longest suffix of S[1 . . . i] such that it has another occurrence in S[1 . . . i]) This frontier can be manipulated efficiently thanks to suffix links. (pointers from nodes with label xα to nodes with label α) Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Ukkonen Algorithm It is an on-line linear algorithm. It constructs the tree incrementally for S[1 . . . i], i = 1, . . . , m. At every step it extends the frontier: the set of points in the tree that are possible branching points. (the position of the longest suffix of S[1 . . . i] such that it has another occurrence in S[1 . . . i]) This frontier can be manipulated efficiently thanks to suffix links. (pointers from nodes with label xα to nodes with label α) Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Quadratic BuST Let’s consider the text a . . a} |b .{z . . b} c| .{z . . c}, | .{z m m 2m over {a, b, c, d}, with a ↔ b l l d ↔ c The number of nodes surrounded by the red box is quadratic in m! Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Quadratic BuST Let’s consider the text a . . a} |b .{z . . b} c| .{z . . c}, | .{z m m 2m over {a, b, c, d}, with a ↔ b l l d ↔ c The number of nodes surrounded by the red box is quadratic in m! Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Quadratic BuST Let’s consider the text a . . a} |b .{z . . b} c| .{z . . c}, | .{z m m 2m over {a, b, c, d}, with a ↔ b l l d ↔ c The number of nodes surrounded by the red box is quadratic in m! Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests The exponent δ Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests The exponent δ Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Test Number of macrocharacters of length 4 over DNA alphabet. Test strings are generated according to a uniform p.d. Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm Inverse Suffix Links A crucial role in the fast construction of suffix trees is played by suffix links. Suffix links are pointers from nodes with path label xα to nodes with path label α. Whenever there is a node with path label xα, there’s also a node with path label α. Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm Inverse Suffix Links Inverse suffix links are pointers from nodes with path label α to positions in the tree labeled xα, for each x in the alphabet such that xα is a substring of S. They can point in the middle of an arc. If a ISL takes from α to xα, it is labeled with x. Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm The Algorithm Inverse suffix links con be used to identify the red nodes for suffix S[i]from the red nodes for suffix S[i + 1]. Suppose we know the location of a red node for suffix S[i + 1], and that it is just under a “black” node with path label α. Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm The Algorithm From this node, we can cross all inverse suffix links such that S(i) is in relation with the character labeling the ISL. With a skip and count trick, we can identify the positions of red nodes for S[i]. Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm The Algorithm From this node, we can cross all inverse suffix links such that S(i) is in relation with the character labeling the ISL. With a skip and count trick, we can identify the positions of red nodes for S[i]. Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark A First Application The Benchmark There is a set of 20 strings of length 1000, generated according to a uniform distribution over the DNA alphabet. There is a pattern p of length 16, such that 20 of its occurrences are implanted in the strings, with 4 mutations occurring in random positions. The problem is to identify p (the signal), given the strings. P.A. Pevzner and S.H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc Int Conf Intell Syst Mol Biol. 2000;8:269-78. Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES Suffix Trees Dimension of BuST Efficient Algorithm The Benchmark A First Application A solution with BuST We used macrocharacters of length 4 (2 of them are in relation if their Hamming distance is ≤ 1). We built the generalized BuST for the strings (converted in macrocharacters in every possible way). For every substring of length 16 of the 20 strings, we looked at the set of substrings in relation with it, and we combined this information to find p. It’s a naive use of BuST, but it works! Return L. Bortolussi, F. Fabris, A. Policriti BUNDLED SUFFIX TREES