Suffixtrees
Suffixtrees
02-713
Slides by Carl Kingsford
1
Preprocessing Strings
• Typical setting: A long, known, and fixed text string (like a genome) and
many unknown, changing query strings.
• Allowed to preprocess the text string once in anticipation of the
future unknown queries.
!
• Preprocessing string data into data structures that make many
questions (like searching) easy to answer.
2
Suffix Tries
- Suffix trees are hugely important for searching large sequences like
genomes. Eg. the basis for a tool called “MUMMer”.
3
Suffix Tries
s = abaaba$
b $
a
SufTrie(s) = suffix trie representing string s.
$
!
b a a
Edges of the suffix trie are labeled with
letters from the alphabet ∑ (say {A,C,G,T}).
a
a b $
!
Every path from the root to a solid node
$
b represents a suffix of s.
a a
!
a
Every suffix of s is represented by some
b $ path from the root to a solid node.
$
a
Why are all the solid nodes leaves?
How many leaves will there be?
$
4
Processing Strings Using Suffix Tries
Main idea:
every substring of s is a prefix of some suffix of s.
5
s = abaaba$ Searching Suffix Tries
b $
a
Is “baa” a substring of s?
b a $ a
a b
a Follow the path given by
$
the query string.
b
$
a a
a
b $
$
After we’ve built the suffix trees,
a
queries can be answered in time:
O(|query|)
regardless of the text size.
$
6
s = abaaba$ Searching Suffix Tries
b $
a
Is “baa” a substring of s?
b a $ a
a b
a Follow the path given by
$
the query string.
b
$
a a
a
b $
$
After we’ve built the suffix trees,
a
queries can be answered in time:
O(|query|)
regardless of the text size.
$
6
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Count # of occurrences of q in T:
7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.
Count # of occurrences of q in T:
7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.
Count # of occurrences of q in T:
7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.
Count # of occurrences of q in T:
Follow the path for q starting from the root.
The number of leaves under the node you end up in is the
number of occurrences of q.
Find the longest repeat in T:
7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.
Count # of occurrences of q in T:
Follow the path for q starting from the root.
The number of leaves under the node you end up in is the
number of occurrences of q.
Find the longest repeat in T:
Find the deepest node that has at least 2 leaves under it.
7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.
Count # of occurrences of q in T:
Follow the path for q starting from the root.
The number of leaves under the node you end up in is the
number of occurrences of q.
Find the longest repeat in T:
Find the deepest node that has at least 2 leaves under it.
a b $
a
• Most important suffix links are
the ones connecting suffixes of
b the full string (shown at right).
$
a a
!
a • But every node has a suffix link.
b $
• Why?
$ • How do we know a node
a representing α exists for
every node representing xα?
$
8
s = abaaba$ Suffix Tries
b $
a
a abaaba$
a b $
s
b The node’s suffix link should link to the
$
a a prefix of the suffix s that is 1 character
shorter.
a
$
b
Since the suffix trie contains all
$ suffixes, it contains a path representing
a s, and therefore contains a node
representing every prefix of s.
$
9
s = abaaba$ Suffix Tries
b $
a
a abaaba$
a b $
s
b The node’s suffix link should link to the
$
a a prefix of the suffix s that is 1 character
shorter.
a
$
b
Since the suffix trie contains all
$ suffixes, it contains a path representing
a s, and therefore contains a node
representing every prefix of s.
$
9
Applications of Suffix Tries (II)
abaaba$
Find the longest common substring of T and q:
$
b
a
$
b a
a $
a a
b b
a
$
a b a
T = abaaba$
q = bbaa $
a $
10
Applications of Suffix Tries (II)
abaaba$
Find the longest common substring of T and q:
Walk down the tree following q.
$
b
a
If you hit a dead end, save the current depth,
$
and follow the suffix link from the current b a
node.
a $
When you exhaust q, return the longest a a
substring found.
b b
a
$
a b a
T = abaaba$
q = bbaa $
a $
10
Constructing Suffix Tries
11
Suppose we want to build suffix trie for string:
s = abbacabaa
We will walk down the string from left to right:
abbacabaa
building suffix tries for s[0], s[0..1], s[0..2], ..., s[0..n]
abbac
Purple are suffixes that
abbacabaa Need to add nodes for bbac
will exist in
SufTrie(s[0..i-1]) Why?
i=4 the suffixes: bac
!
ac
How can we find these
c suffixes quickly?
12
Suppose we want to build suffix trie for string:
s = abbacabaa
We will walk down the string from left to right:
abbacabaa
building suffix tries for s[0], s[0..1], s[0..2], ..., s[0..n]
abbac
Purple are suffixes that
abbacabaa Need to add nodes for bbac
will exist in
SufTrie(s[0..i-1]) Why?
i=4 the suffixes: bac
!
ac
How can we find these
c suffixes quickly?
12
abbac
Purple are suffixes that
abbacabaa Need to add nodes for bbac
will exist in
SufTrie(s[0..i-1]) Why?
i=4 the suffixes: bac
!
ac
How can we find these
c suffixes quickly?
b a b
a c
a b
a b
c b
b
c a
a
b
b c
a Where is the new
deepest node? (aka
a
longest suffix)
c !
How do we add the
SufTrie(abba) suffix links for the
SufTrie(abbac) new nodes?
13
abbac
Purple are suffixes that
abbacabaa Need to add nodes for bbac
will exist in
SufTrie(s[0..i-1]) Why?
i=4 the suffixes: bac
!
ac
How can we find these
c suffixes quickly?
b a b
a c
a b
a b
c b
b
c a
a
b
b c
a Where is the new
deepest node? (aka
a
longest suffix)
c !
How do we add the
SufTrie(abba) suffix links for the
SufTrie(abbac) new nodes?
13
To build SufTrie(s[0..i]) from SufTrie(s[0..i-1]):
14
Python Code to Build a Suffix Trie
def build_suffix_trie(s):!
"""Construct a suffix trie."""!
assert len(s) > 0!
class SuffixNode:!
def __init__(self, suffix_link = None):!
!
# explicitly build the two-node suffix tree!
self.children = {}!
Root = SuffixNode() # the root node! s[0]
if suffix_link is not None:!
Longest = SuffixNode(suffix_link = Root)!
self.suffix_link = suffix_link!
Root.add_link(s[0], Longest)!
else:!
!
self.suffix_link = self!
! # for every character left in the string!
for c in s[1:]:!
def add_link(self, c, v):!
Current = Longest; Previous = None!
"""link this node to node v via string c"""!
while c not in Current.children:!
self.children[c] = v
!
# create new node r1 with transition Current -c->r1!
r1 = SuffixNode()!
Current.add_link(c, r1)!
!
# if we came from some previous node, make that!
# node's suffix link point here!
if Previous is not None:!
Previous.suffix_link = r1!
!
# walk down the suffix links!
Previous = r1!
Current = Current.suffix_link!
!
# make the last suffix link!
if Current is Root:!
Previous.suffix_link = Root!
else:!
Previous.suffix_link = Current.children[c]!
!
# move to the newly added child of the longest path!
# (which is the new longest path)!
Longest = Longest.children[c]!
return Root
15
current
current
s[i]
s[i]
s[i]
longest s[i]
longest s[i]
s[i]
s[i] u
u
s[i]
s[i]
Prev Prev
current
s[i]
s[i]
s[i]
s[i] Prev
longest
16
a
17
a ab
b
a
a
b
17
ab aba
a
b
b a
a
a
b a
b
17
aba
abaa
a ab
b
b a
b a
a
a b a
b a
b a
a a
a
a
Note: there's already a path for
suffix "a", so we don't change it (we
just add a suffix link to it)
17
aba
abaa abaab
a ab
b
b a b
b a a
a
a b a
b a b a
b a a
a a a
a a
a b b
a
Note: there's already a path for
suffix "a", so we don't change it (we
just add a suffix link to it) b
17
aba
abaa abaab
a ab
b
b a b
b a a
a
a b a
b a b a
b a a
a a a
a a
a b b
a
Note: there's already a path for
suffix "a", so we don't change it (we
just add a suffix link to it) b
abaaba
b
a
b a
a
a a
b b
a
a b a
17
aba
abaa abaab
a ab
b
b a b
b a a
a
a b a
b a b a
b a a
a a a
a a
a b b
a
Note: there's already a path for
suffix "a", so we don't change it (we
just add a suffix link to it) b
abaaba abaaba$
$
b b
a a
$
b a b a
a a $
a a a
a
b b b
a b a
$
a a b a
a b
a a $
$
17
How many nodes can a suffix trie
have?
s = aaabbb a b • s = anbn will have
a
• 1 root node
b b • n nodes in a path of “b”s
• n paths of n+1 “b” nodes
b a
b
•
b
Total = n(n+1)+n+1 = O(n2)
b nodes.
b
b • This is not very efficient.
b
b
• How could you make it
smaller?
b
18
So... we have to “trie” again...
19
A More Compact Representation
s = abaaba$ s = abaaba$
1234567 1234567
a $ 7:7
6:6
ba 5:6
$ 7:7
ba 5:6
aba$ 4:7
aba$ 4:7
$ 7:7
aba$ 4:7
$ 7:7
21
Constructing Suffix Trees -
Ukkonen’s Algorithm
22
Storing more than one string with
Generalized Suffix Trees
23
Constructing Generalized Suffix
Trees
Goal. Represent a set of strings P = {s1, s2, s3, ..., sm}.
Example. att, tag, gat
Simple solution:
(1) build suffix tree for string aat#1tag#2gat#3
#1tag#2gat#3
#3
#2gat#3
g
t a
#2gat#3
#3 #3 at#3
ag#2gat#3
#1tag#2gat#3 g#2gat#3
t
at#1tag#2gat#3
#3
#1tag#2gat#3
24
Constructing Generalized Suffix
Trees
Goal. Represent a set of strings P = {s1, s2, s3, ..., sm}.
Example. att, tag, gat
Simple solution:
(1) build suffix tree for string aat#1tag#2gat#3 (2) For every leaf node, remove
any text after the first # symbol.
#1tag#2gat#3 #1
#3 #3
#2gat#3 #2
g g
t a t a
#2gat#3 #2
#3 #3 #3
at#3 #3 at#3
ag#2gat#3
ag#2
#1tag#2gat#3 g#2gat#3
#1 g#2
t
at#1tag#2gat#3 t
at#1
#3
#1tag#2gat#3
#3
#1
24
Applications of Generalized Suffix
Trees
Longest common substring of S and T:
Determine the strings in a database {S1, S2, S3, ..., Sm} that contain
query string q:
25
Applications of Generalized Suffix
Trees
Longest common substring of S and T:
Build generalized suffix tree for {S, T}
Find the deepest node that has has descendants from both
strings (containing both #1 and #2)
Determine the strings in a database {S1, S2, S3, ..., Sm} that contain
query string q:
25
Applications of Generalized Suffix
Trees
Longest common substring of S and T:
Build generalized suffix tree for {S, T}
Find the deepest node that has has descendants from both
strings (containing both #1 and #2)
Determine the strings in a database {S1, S2, S3, ..., Sm} that contain
query string q:
Build generalized suffix tree for {S1, S2, S3, ..., Sm}
Follow the path for q in the suffix tree.
Suppose you end at node u: traverse the tree below u, and
output i if you find a string containing #i.
25
Longest Common Extension
Longest common extension: We are given strings S and T. In the future, many pairs (i,j) will be
provided as queries, and we want to quickly find:
!
the longest substring of S starting at i that matches a substring of T starting at j.
LCE(i,j) LCE(i,j)
S T
i j
i j
27
Using LCE to Find Palindromes
Maximal even palindrome at position i: the longest string to the left and right so that the left
half is equal to the reverse of the right half.
x y x≠y
S
= the reverse of
i
y x x≠y
Sr
n-i
Construct Sr, the reverse of S.
!
Preprocess S and Sr so that LCE queries can be solved in constant time (previous slide).
!
LCE(i, n-i) is the length of the longest palindrome centered at i.
!
For every position i:
Compute LCE(i, n-i)
27
Using LCE to Find Palindromes
Maximal even palindrome at position i: the longest string to the left and right so that the left
half is equal to the reverse of the right half.
x y x≠y
S
= the reverse of
i
y x x≠y
Sr
n-i
Construct Sr, the reverse of S.
O(|S|)
!
Preprocess S and Sr so that LCE queries can be solved in constant time (previous slide).
O(|S|)
!
LCE(i, n-i) is the length of the longest palindrome centered at i.
!
For every position i:
O(|S|)
Total time = O(|S|)
Compute LCE(i, n-i) O(1)
27
Recap
• Suffix trees are space optimal: O(n), but require a little more
subtle algorithm to construct.
28