Motivation
Bundled Suffix Trees
Applications
Summary
BUNDLED SUFFIX TREES
Luca Bortolussi1
Francesco Fabris2
Alberto Policriti1
1 Department
of Mathematics and Computer Science
University of Udine
2 Department
of Mathematics and Computer Science
University of Trieste
Udine, 22 Marzo 2005
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Outline
1
Motivation
Suffix Trees
2
Bundled Suffix Trees
Non-Transitive Relations
Definition
Size and Construction
3
Applications
4
Summary
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Outline
1
Motivation
Suffix Trees
2
Bundled Suffix Trees
Non-Transitive Relations
Definition
Size and Construction
3
Applications
4
Summary
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Outline
1
Motivation
Suffix Trees
2
Bundled Suffix Trees
Non-Transitive Relations
Definition
Size and Construction
3
Applications
4
Summary
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Outline
1
Motivation
Suffix Trees
2
Bundled Suffix Trees
Non-Transitive Relations
Definition
Size and Construction
3
Applications
4
Summary
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Suffix Trees
Introduction
Since the discovery of DNA, biology gave birth to many
thorough string problems.
Important challenge: find repeated patterns in DNA that
are biologically significant.
Feature: patterns are repeated with errors.
(Approximate pattern discovery is difficult)
Other feature (more difficult): formalization of “biologically
significant”.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Suffix Trees
Suffix Trees
bcabbabc
A Suffix Tree is a data structure
which exploits the internal
structure of a string. Construct it!
Efficient for:
Exact String Matching
Problem
Longest Exact Common
Substring Problem
Identifying Exactly
Repeated Patterns
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Suffix Trees
Suffix Trees
bcabbabc
A Suffix Tree is a data structure
which exploits the internal
structure of a string. Construct it!
They are linear in size (w.r.t text
length), and can be built in
linear time. Ukkonen Algorithm
E. McCreight. A space-economical suffix tree
construction algorithm, Journal of the ACM,
23(2), 262-272, 1976.
E. Ukkonen. On-line construction of suffix-trees.
Algorithmica, 14:249-260, 1995.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Suffix Trees
Suffix Trees
bcabbabc
Suffix Trees are not natural to
deal with approximate string
matching problems
(positive Hamming or Edit
distance)
Landau G.M., Vishkin U., Efficient String
Matching with k Mismatches, Theoretical
Computer Science, 43, 239-249, 1986.
Gusfield D., Algorithms on strings, trees and
sequences, Cambridge University Press, 1997.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Suffix Trees
Suffix Trees
bcabbabc
Suffix Trees are not natural to
deal with approximate string
matching problems
(positive Hamming or Edit
distance)
The Longest Common
Approximate Substring
Problem or the extraction of
approximate repeated patterns
can’t be solved as in the exact
case.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Extending Suffix Trees
THE PROJECT
Exploring the possibility of using different tree-based structures
to tackle approximate string matching problems.
SO FAR
We developed Bundled Suffix Trees, an extension of Suffix
Trees such that:
they incorporate information about “errors”;
they can be used for the Longest Common Approximate
Substring Problem and for extracting approximate repeated
patterns like Suffix Trees.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Extending Suffix Trees
THE PROJECT
Exploring the possibility of using different tree-based structures
to tackle approximate string matching problems.
SO FAR
We developed Bundled Suffix Trees, an extension of Suffix
Trees such that:
they incorporate information about “errors”;
they can be used for the Longest Common Approximate
Substring Problem and for extracting approximate repeated
patterns like Suffix Trees.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
How?
Instructions
Coding the concept of “error” in some suitable way
(i.e. with a relation among letters)
Adding in the Suffix Tree some information about the
“relation”
Controlling the combinatorial explosion of the structure
Designing good algorithms to build and manage it
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
How?
Instructions
Coding the concept of “error” in some suitable way
(i.e. with a relation among letters)
Adding in the Suffix Tree some information about the
“relation”
Controlling the combinatorial explosion of the structure
Designing good algorithms to build and manage it
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
How?
Instructions
Coding the concept of “error” in some suitable way
(i.e. with a relation among letters)
Adding in the Suffix Tree some information about the
“relation”
Controlling the combinatorial explosion of the structure
Designing good algorithms to build and manage it
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
How?
Instructions
Coding the concept of “error” in some suitable way
(i.e. with a relation among letters)
Adding in the Suffix Tree some information about the
“relation”
Controlling the combinatorial explosion of the structure
Designing good algorithms to build and manage it
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Non-Transitive Relation
Character matching is a relation among letters
(in fact, it is the equality relation)
Approximate matching can also be modeled as a non-transitive
relation among letters (bigger than equality!):
two strings “match” if all their letters are in relation.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Non-Transitive Relation
Character matching is a relation among letters
(in fact, it is the equality relation)
Approximate matching can also be modeled as a non-transitive
relation among letters (bigger than equality!):
two strings “match” if all their letters are in relation.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Non-Transitive Relation: An Example
Modelling a relation based on Hamming Distance
Start from a basic alphabet (e.g. binary: A = {0, 1})
Construct an alphabet composed of macrocharacters
(e.g. A = {00, 01, 10, 11})
Impose that two letters x, y ∈ A are in relation if and only if
dH (x, y ) ≤ 1 (relation is non–transitive)
The Relation Graph
00 ↔ 01
l
l
10 ↔ 11
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Non-Transitive Relation: An Example
Modelling a relation based on Hamming Distance
Start from a basic alphabet (e.g. binary: A = {0, 1})
Construct an alphabet composed of macrocharacters
(e.g. A = {00, 01, 10, 11})
Impose that two letters x, y ∈ A are in relation if and only if
dH (x, y ) ≤ 1 (relation is non–transitive)
The Relation Graph
00 ↔ 01
l
l
10 ↔ 11
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Bundled Suffix Trees
THE IDEA
We start with a suffix tree for a text S and a (non-transitive)
relation among letters.
We compare suffixes S[i] and S[j].
If a prefix S[i . . . i + k ] of S[i] is in relation
with a prefix S[j . . . j + k ] of S[j],
we put a marker after S[j . . . j + k ] in the tree:
a red node labeled with i.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Bundled Suffix Trees
THE IDEA
We start with a suffix tree for a text S and a (non-transitive)
relation among letters.
We compare suffixes S[i] and S[j].
If a prefix S[i . . . i + k ] of S[i] is in relation
with a prefix S[j . . . j + k ] of S[j],
we put a marker after S[j . . . j + k ] in the tree:
a red node labeled with i.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Bundled Suffix Tree: An Example
bcabbabc;
a↔b↔c
We start from the suffix
tree for the string
bcabbabc.
The alphabet is {a, b, c},
and the relation is
a ↔ b ↔ c.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Bundled Suffix Tree: An Example
bcabbabc;
a↔b↔c
Let’s compare suffix 3
(abbabc) and suffix 1
(bcabbabc)
According to our relation,
the maximal prefix of suffix
3, which is in relation with
a prefix of suffix one, is
abbab.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Bundled Suffix Tree: An Example
bcabbabc;
a↔b↔c
Therefore, after bcabb, we
put in the tree a red node
with label 3.
Due to symmetry, there is
also a red node with label
1 after abbab.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Bundled Suffix Tree: An Example
bcabbabc;
a↔b↔c
Therefore, after bcabb, we
put in the tree a red node
with label 3.
Due to symmetry, there is
also a red node with label
1 after abbab.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Bundled Suffix Tree: An Example
bcabbabc;
a↔b↔c
If we do this process for every
couple of suffixes, we will build
a Bundled Suffix Tree!
Note that this data structure is
in the middle between a suffix
tree and a suffix trie.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Bundled Suffix Tree: An Example
bcabbabc;
a↔b↔c
If we do this process for every
couple of suffixes, we will build
a Bundled Suffix Tree!
Note that this data structure is
in the middle between a suffix
tree and a suffix trie.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Bundled Suffix Tree: An Example
bcabbabc;
a↔b↔c
This tree can be use to
solve the Longest
Common Approximate
Substring Problem with
respect to a given relation.
We just have to find the
lowest red node!
Similarly, we can also
extract information about
approximate repeated
patterns.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
How Big?
The number of red nodes
inserted depends on:
the relation
In the worst case, the number
of red nodes is quadratic in the
length of the text S. Example
the structure of the text.
On average, the number of red nodes is limited by
m1+δ , δ = log1/p+ C.
( m is the length of the text, p+ is the highest frequency of the most
common letter in S and C depends on the relation)
1 + δ is slightly greater than one!
L. Bortolussi, F. Fabris, A. Policriti
Example
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
How Big?
The number of red nodes
inserted depends on:
the relation
In the worst case, the number
of red nodes is quadratic in the
length of the text S. Example
the structure of the text.
On average, the number of red nodes is limited by
m1+δ , δ = log1/p+ C.
( m is the length of the text, p+ is the highest frequency of the most
common letter in S and C depends on the relation)
1 + δ is slightly greater than one!
L. Bortolussi, F. Fabris, A. Policriti
Example
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
How Fast?
Naive Algorithm
The naive algorithm for building a BuST simply tries to
“match” every suffix of the text along every branch of the
suffix tree, until a “mismatch” is found.
It can be quadratic in the worst case.
Anyway, an analysis based on the average shape of a
suffix tree, shows that its average complexity is bounded
′
by m1+δ (δ ′ just slightly greater that δ).
W. Szpankowski. A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors. SIAM J. Comput.
22(6): 1176-1198 (1993)
P. Jacquet, B. McVey, W. Szpankowski. Compact Suffix Trees Resemble PATRICIA Tries: Limiting
Distribution of Depth, Journal of the Iranian Statistical Society, 3, 139-148, 2004.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Faster
Efficient Algorithm
We found an “McCreight-like” algorithm that is linear in the size
of the output.
Intuitions
It processes the suffixes backwards.
It is based on the concept of inverse suffix links.
Show Details
It identifies the red nodes for suffix i by processing the red
nodes for suffix i + 1. Show Details
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Non-Transitive Relations
Definition
Size and Construction
Tests
We have implemented the naive algorithm for the
construction of BuST.
We have tested it with relations induced by hamming
distance, defined over DNA-macrocharacters.
With macrocharacters of size 4, such that two of them are
in relation iff their Hamming distance is ≤ 1,
the algorithm is quite fast, and can process texts of 100 Kb
in few seconds.
The number of red nodes grows with an exponent smaller
than the predicted one. Show Details
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Hunting TFBS
We are using BuST to identify TFBS candidates in DNA
sequences.
The algorithm first constructs the BuST for the set of
sequences under analysis, and then extracts and combines the
information contained in it. The relation used is defined by an
Hamming distance criterion.
The algorithm is quite fast: for instance, we are able to solve
the benchmark proposed by Pevzner et al. in few seconds.
Show Benchmark’s Details
P.A. Pevzner and S.H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc Int
Conf Intell Syst Mol Biol. 2000;8:269-78.
Bortolussi, F. Fabris,
A. Policriti
BUNDLED
SUFFIX TREES
The generalL.concept
of non-transitive
relation
seems very
Motivation
Bundled Suffix Trees
Applications
Summary
Hunting TFBS
We are using BuST to identify TFBS candidates in DNA
sequences.
The general concept of non-transitive relation seems very
fruitful:
it can be used to encode Hamming distance, but also to tackle
edit distance or to encode other biologically-driven relations.
G. Pavesi, G. Mauri and G. Pesole. In silico representation and discovery of transcription factor binding
sites, Briefings in Bioinformatics. 5(3):1–20, 2004.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Motivation
Bundled Suffix Trees
Applications
Summary
Conclusions
We have introduced Bundled Suffix Trees, a new data
structure extending suffix trees.
It can be used to extract approximate information from a
string, and it is manipulated similarly to suffix trees.
The structure is based on a very general concept of
non-transitive relation among (macro)characters.
Its size is slightly more than linear on average, and there’s
a fast (McCreight-like) algorithm to build it.
It can be used to discover approximate patterns in a text.
For instance, it can be used to identify candidates for
TFBS.
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Naive Construction of Suffix Trees
We start from the string
bcabbabc
We put down the whole string
in a branch.
We try to match the suffix down
the tree.
bcabbabc
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Naive Construction of Suffix Trees
bcabbabc
We start from the string
bcabbabc
We put down the whole string
in a branch.
We try to match the suffix down
the tree.
bcabbabc
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Naive Construction of Suffix Trees
bcabbabc
We start from the string
bcabbabc
We put down the whole string
in a branch.
We try to match the suffix down
the tree.
bcabbabc
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Naive Construction of Suffix Trees
bcabbabc
We start from the string
bcabbabc
We put down the whole string
in a branch.
We try to match the suffix down
the tree.
bcabbabc
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Naive Construction of Suffix Trees
bcabbabc
We start from the string
bcabbabc
We put down the whole string
in a branch.
We try to match the suffix down
the tree.
bcabbabc
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Naive Construction of Suffix Trees
bcabbabc
Here’s the suffix tree!
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Ukkonen Algorithm
It is an on-line linear algorithm.
It constructs the tree incrementally for S[1 . . . i],
i = 1, . . . , m.
At every step it extends the frontier: the set of points in the
tree that are possible branching points.
(the position of the longest suffix of S[1 . . . i] such that it has
another occurrence in S[1 . . . i])
This frontier can be manipulated efficiently thanks to suffix
links.
(pointers from nodes with label xα to nodes with label α)
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Ukkonen Algorithm
It is an on-line linear algorithm.
It constructs the tree incrementally for S[1 . . . i],
i = 1, . . . , m.
At every step it extends the frontier: the set of points in the
tree that are possible branching points.
(the position of the longest suffix of S[1 . . . i] such that it has
another occurrence in S[1 . . . i])
This frontier can be manipulated efficiently thanks to suffix
links.
(pointers from nodes with label xα to nodes with label α)
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Ukkonen Algorithm
It is an on-line linear algorithm.
It constructs the tree incrementally for S[1 . . . i],
i = 1, . . . , m.
At every step it extends the frontier: the set of points in the
tree that are possible branching points.
(the position of the longest suffix of S[1 . . . i] such that it has
another occurrence in S[1 . . . i])
This frontier can be manipulated efficiently thanks to suffix
links.
(pointers from nodes with label xα to nodes with label α)
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Ukkonen Algorithm
It is an on-line linear algorithm.
It constructs the tree incrementally for S[1 . . . i],
i = 1, . . . , m.
At every step it extends the frontier: the set of points in the
tree that are possible branching points.
(the position of the longest suffix of S[1 . . . i] such that it has
another occurrence in S[1 . . . i])
This frontier can be manipulated efficiently thanks to suffix
links.
(pointers from nodes with label xα to nodes with label α)
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Quadratic BuST
Delta
Tests
Quadratic BuST
Let’s consider the text
a
. . a} |b .{z
. . b} c| .{z
. . c},
| .{z
m
m
2m
over {a, b, c, d}, with
a ↔ b
l
l
d ↔ c
The number of nodes
surrounded by the red box
is quadratic in m!
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Quadratic BuST
Delta
Tests
Quadratic BuST
Let’s consider the text
a
. . a} |b .{z
. . b} c| .{z
. . c},
| .{z
m
m
2m
over {a, b, c, d}, with
a ↔ b
l
l
d ↔ c
The number of nodes
surrounded by the red box
is quadratic in m!
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Quadratic BuST
Delta
Tests
Quadratic BuST
Let’s consider the text
a
. . a} |b .{z
. . b} c| .{z
. . c},
| .{z
m
m
2m
over {a, b, c, d}, with
a ↔ b
l
l
d ↔ c
The number of nodes
surrounded by the red box
is quadratic in m!
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Quadratic BuST
Delta
Tests
The exponent δ
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Quadratic BuST
Delta
Tests
The exponent δ
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Quadratic BuST
Delta
Tests
Test
Number of
macrocharacters of length 4
over DNA
alphabet. Test
strings are
generated
according to a
uniform p.d.
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Inverse Suffix Links
Ideas of the Algorithm
Inverse Suffix Links
A crucial role in the fast
construction of suffix trees is
played by suffix links.
Suffix links are pointers from
nodes with path label xα to
nodes with path label α.
Whenever there is a node with
path label xα, there’s also a node
with path label α.
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Inverse Suffix Links
Ideas of the Algorithm
Inverse Suffix Links
Inverse suffix links are pointers
from nodes with path label α to
positions in the tree labeled xα,
for each x in the alphabet such
that xα is a substring of S.
They can point in the middle of
an arc.
If a ISL takes from α to xα, it is
labeled with x.
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Inverse Suffix Links
Ideas of the Algorithm
The Algorithm
Inverse suffix links con be used
to identify the red nodes for suffix
S[i]from the red nodes for suffix
S[i + 1].
Suppose we know the location of
a red node for suffix S[i + 1], and
that it is just under a “black” node
with path label α.
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Inverse Suffix Links
Ideas of the Algorithm
The Algorithm
From this node, we can cross all
inverse suffix links such that S(i)
is in relation with the character
labeling the ISL.
With a skip and count trick, we
can identify the positions of red
nodes for S[i].
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
Inverse Suffix Links
Ideas of the Algorithm
The Algorithm
From this node, we can cross all
inverse suffix links such that S(i)
is in relation with the character
labeling the ISL.
With a skip and count trick, we
can identify the positions of red
nodes for S[i].
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
A First Application
The Benchmark
There is a set of 20 strings of length 1000, generated
according to a uniform distribution over the DNA alphabet.
There is a pattern p of length 16, such that 20 of its
occurrences are implanted in the strings, with 4 mutations
occurring in random positions.
The problem is to identify p (the signal), given the strings.
P.A. Pevzner and S.H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc Int
Conf Intell Syst Mol Biol. 2000;8:269-78.
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES
Suffix Trees
Dimension of BuST
Efficient Algorithm
The Benchmark
A First Application
A solution with BuST
We used macrocharacters of length 4 (2 of them are in
relation if their Hamming distance is ≤ 1).
We built the generalized BuST for the strings (converted in
macrocharacters in every possible way).
For every substring of length 16 of the 20 strings, we
looked at the set of substrings in relation with it, and we
combined this information to find p.
It’s a naive use of BuST, but it works!
Return
L. Bortolussi, F. Fabris, A. Policriti
BUNDLED SUFFIX TREES