0% found this document useful (0 votes)
28 views

Suffixtrees

Suffix tries are a data structure used to store the suffixes of a string to enable efficient searching and other string operations on the stored text. They work by building a trie where each path from the root represents a suffix. This allows queries like searching for a substring or counting occurrences to be done in time proportional to the length of the query regardless of the size of the stored text. Common applications include searching for substrings/suffixes, counting occurrences, and finding the longest repeated substring. Suffix tries enable these operations to be performed very rapidly for large texts.

Uploaded by

abhayanilark
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Suffixtrees

Suffix tries are a data structure used to store the suffixes of a string to enable efficient searching and other string operations on the stored text. They work by building a trie where each path from the root represents a suffix. This allows queries like searching for a substring or counting occurrences to be done in time proportional to the length of the query regardless of the size of the stored text. Common applications include searching for substrings/suffixes, counting occurrences, and finding the longest repeated substring. Suffix tries enable these operations to be performed very rapidly for large texts.

Uploaded by

abhayanilark
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Suffix Trees

02-713

Slides by Carl Kingsford

1
Preprocessing Strings

• Typical setting: A long, known, and fixed text string (like a genome) and
many unknown, changing query strings.

• Allowed to preprocess the text string once in anticipation of the
future unknown queries.

!
• Preprocessing string data into data structures that make many
questions (like searching) easy to answer.

2
Suffix Tries

• A trie, pronounced “try”, is a tree that exploits some structure in


the keys

- e.g. if the keys are strings, a binary search tree would compare the
entire strings, but a trie would look at their individual characters

- Suffix trie are a space-inefficient data structure to store a string that


allows many kinds of queries to be answered quickly.

- Suffix trees are hugely important for searching large sequences like
genomes. Eg. the basis for a tool called “MUMMer”.

3
Suffix Tries
s = abaaba$
b $
a
SufTrie(s) = suffix trie representing string s.

$
!
b a a
Edges of the suffix trie are labeled with
letters from the alphabet ∑ (say {A,C,G,T}).

a
a b $
!
Every path from the root to a solid node
$
b represents a suffix of s.

a a
!
a
Every suffix of s is represented by some
b $ path from the root to a solid node.
$
a
Why are all the solid nodes leaves?

How many leaves will there be?
$

4
Processing Strings Using Suffix Tries

Given a suffix trie T, and a string q, how can we:

• determine whether q is a substring of T?



• check whether q is a suffix of T?

• count how many times q appears in T?

• find the longest repeat in T?

• find the longest common substring of T and q?

Main idea:

every substring of s is a prefix of some suffix of s.
5
s = abaaba$ Searching Suffix Tries
b $
a

Is “baa” a substring of s?
b a $ a

a b
a Follow the path given by
$
the query string.
b
$
a a

a
b $

$
After we’ve built the suffix trees,
a
queries can be answered in time:

O(|query|)

regardless of the text size.
$

6
s = abaaba$ Searching Suffix Tries
b $
a

Is “baa” a substring of s?
b a $ a

a b
a Follow the path given by
$
the query string.
b
$
a a

a
b $

$
After we’ve built the suffix trees,
a
queries can be answered in time:

O(|query|)

regardless of the text size.
$

6
Applications of Suffix Tries (1)
Check whether q is a substring of T:

Check whether q is a suffix of T:

Count # of occurrences of q in T:

Find the longest repeat in T:

Find the lexicographically (alphabetically) first suffix:

7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.

Check whether q is a suffix of T:

Count # of occurrences of q in T:

Find the longest repeat in T:

Find the lexicographically (alphabetically) first suffix:

7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.

Check whether q is a suffix of T:


Follow the path for q starting from the root.
If you end at a leaf at the end of q, then q is a suffix of T

Count # of occurrences of q in T:

Find the longest repeat in T:

Find the lexicographically (alphabetically) first suffix:

7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.

Check whether q is a suffix of T:


Follow the path for q starting from the root.
If you end at a leaf at the end of q, then q is a suffix of T

Count # of occurrences of q in T:
Follow the path for q starting from the root.
The number of leaves under the node you end up in is the
number of occurrences of q.
Find the longest repeat in T:

Find the lexicographically (alphabetically) first suffix:

7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.

Check whether q is a suffix of T:


Follow the path for q starting from the root.
If you end at a leaf at the end of q, then q is a suffix of T

Count # of occurrences of q in T:
Follow the path for q starting from the root.
The number of leaves under the node you end up in is the
number of occurrences of q.
Find the longest repeat in T:
Find the deepest node that has at least 2 leaves under it.

Find the lexicographically (alphabetically) first suffix:

7
Applications of Suffix Tries (1)
Check whether q is a substring of T:
Follow the path for q starting from the root.
If you exhaust the query string, then q is in T.

Check whether q is a suffix of T:


Follow the path for q starting from the root.
If you end at a leaf at the end of q, then q is a suffix of T

Count # of occurrences of q in T:
Follow the path for q starting from the root.
The number of leaves under the node you end up in is the
number of occurrences of q.
Find the longest repeat in T:
Find the deepest node that has at least 2 leaves under it.

Find the lexicographically (alphabetically) first suffix:


Start at the root, and follow the edge labeled with the
lexicographically (alphabetically) smallest letter.
7
s = abaaba$ Suffix Links
b $
a

• Suffix links connect node


representing “xα” to a node
b a $ a
representing “α”.

a b $
a
• Most important suffix links are
the ones connecting suffixes of
b the full string (shown at right).

$
a a
!
a • But every node has a suffix link.

b $
• Why?

$ • How do we know a node
a representing α exists for
every node representing xα?
$

8
s = abaaba$ Suffix Tries
b $
a

A node represents the prefix of some


b a $ a suffix:

a abaaba$
a b $
s
b The node’s suffix link should link to the
$
a a prefix of the suffix s that is 1 character
shorter.
a
$
b
Since the suffix trie contains all
$ suffixes, it contains a path representing
a s, and therefore contains a node
representing every prefix of s.
$

9
s = abaaba$ Suffix Tries
b $
a

A node represents the prefix of some


b a $ a suffix:

a abaaba$
a b $
s
b The node’s suffix link should link to the
$
a a prefix of the suffix s that is 1 character
shorter.
a
$
b
Since the suffix trie contains all
$ suffixes, it contains a path representing
a s, and therefore contains a node
representing every prefix of s.
$

9
Applications of Suffix Tries (II)
abaaba$
Find the longest common substring of T and q:
$
b
a
$
b a
a $

a a

b b
a
$

a b a
T = abaaba$

q = bbaa $
a $

10
Applications of Suffix Tries (II)
abaaba$
Find the longest common substring of T and q:
Walk down the tree following q.
$
b
a
If you hit a dead end, save the current depth,
$
and follow the suffix link from the current b a
node.
a $
When you exhaust q, return the longest a a
substring found.
b b
a
$

a b a
T = abaaba$

q = bbaa $
a $

10
Constructing Suffix Tries

11
Suppose we want to build suffix trie for string:

s = abbacabaa
We will walk down the string from left to right:

abbacabaa
building suffix tries for s[0], s[0..1], s[0..2], ..., s[0..n]

To build suffix trie for s[0..i], we


will use the suffix trie for s[0..i-1]
built in previous step

To convert SufTrie(S[0..i-1]) → SufTrie(s[0..i]), add character s[i] to all the suffixes:

abbac
Purple are suffixes that
abbacabaa Need to add nodes for bbac
will exist in
SufTrie(s[0..i-1]) Why?

i=4 the suffixes: bac
!
ac
How can we find these
c suffixes quickly?
12
Suppose we want to build suffix trie for string:

s = abbacabaa
We will walk down the string from left to right:

abbacabaa
building suffix tries for s[0], s[0..1], s[0..2], ..., s[0..n]

To build suffix trie for s[0..i], we


will use the suffix trie for s[0..i-1]
built in previous step

To convert SufTrie(S[0..i-1]) → SufTrie(s[0..i]), add character s[i] to all the suffixes:

abbac
Purple are suffixes that
abbacabaa Need to add nodes for bbac
will exist in
SufTrie(s[0..i-1]) Why?

i=4 the suffixes: bac
!
ac
How can we find these
c suffixes quickly?
12
abbac
Purple are suffixes that
abbacabaa Need to add nodes for bbac
will exist in
SufTrie(s[0..i-1]) Why?

i=4 the suffixes: bac
!
ac
How can we find these
c suffixes quickly?

b a b
a c

a b
a b
c b
b
c a
a
b
b c
a Where is the new
deepest node? (aka
a
longest suffix)

c !
How do we add the
SufTrie(abba) suffix links for the
SufTrie(abbac) new nodes?
13
abbac
Purple are suffixes that
abbacabaa Need to add nodes for bbac
will exist in
SufTrie(s[0..i-1]) Why?

i=4 the suffixes: bac
!
ac
How can we find these
c suffixes quickly?

b a b
a c

a b
a b
c b
b
c a
a
b
b c
a Where is the new
deepest node? (aka
a
longest suffix)

c !
How do we add the
SufTrie(abba) suffix links for the
SufTrie(abbac) new nodes?
13
To build SufTrie(s[0..i]) from SufTrie(s[0..i-1]):

CurrentSuffix = longest (aka deepest suffix)



!
Repeat:

until you reach the Add child labeled s[i] to CurrentSuffix.

root or the current
node already has an
Follow suffix link to set CurrentSuffix to next
edge labeled s[i] shortest suffix.

leaving it. !
Add suffix links connecting nodes you just added in
Because if you the order in which you added them.
already have a node
for suffix αs[i]

then you have a
node for every In practice, you add these links as you go
smaller suffix.
along, rather than at the end.

14
Python Code to Build a Suffix Trie
def build_suffix_trie(s):!
"""Construct a suffix trie."""!
assert len(s) > 0!
class SuffixNode:!
def __init__(self, suffix_link = None):!
!
# explicitly build the two-node suffix tree!
self.children = {}!
Root = SuffixNode() # the root node! s[0]
if suffix_link is not None:!
Longest = SuffixNode(suffix_link = Root)!
self.suffix_link = suffix_link!
Root.add_link(s[0], Longest)!
else:!
!
self.suffix_link = self!
! # for every character left in the string!
for c in s[1:]:!
def add_link(self, c, v):!
Current = Longest; Previous = None!
"""link this node to node v via string c"""!
while c not in Current.children:!
self.children[c] = v
!
# create new node r1 with transition Current -c->r1!
r1 = SuffixNode()!
Current.add_link(c, r1)!
!
# if we came from some previous node, make that!
# node's suffix link point here!
if Previous is not None:!
Previous.suffix_link = r1!
!
# walk down the suffix links!
Previous = r1!
Current = Current.suffix_link!
!
# make the last suffix link!
if Current is Root:!
Previous.suffix_link = Root!
else:!
Previous.suffix_link = Current.children[c]!
!
# move to the newly added child of the longest path!
# (which is the new longest path)!
Longest = Longest.children[c]!
return Root
15
current
current
s[i]
s[i]

s[i]
longest s[i]
longest s[i]
s[i]
s[i] u
u
s[i]
s[i]
Prev Prev

current

s[i]

boundary path s[i]

s[i]

s[i]

s[i] Prev

longest

16
a

17
a ab

b
a
a
b

17
ab aba
a
b
b a
a
a
b a
b

Note: there's already a path for


suffix "a", so we don't change it (we
just add a suffix link to it)

17
aba
abaa
a ab
b
b a
b a
a
a b a
b a
b a

a a
a

a
Note: there's already a path for
suffix "a", so we don't change it (we
just add a suffix link to it)

17
aba
abaa abaab
a ab
b
b a b
b a a
a
a b a
b a b a
b a a
a a a
a a

a b b
a
Note: there's already a path for
suffix "a", so we don't change it (we
just add a suffix link to it) b

17
aba
abaa abaab
a ab
b
b a b
b a a
a
a b a
b a b a
b a a
a a a
a a

a b b
a
Note: there's already a path for
suffix "a", so we don't change it (we
just add a suffix link to it) b

abaaba

b
a

b a
a

a a

b b
a

a b a

17
aba
abaa abaab
a ab
b
b a b
b a a
a
a b a
b a b a
b a a
a a a
a a

a b b
a
Note: there's already a path for
suffix "a", so we don't change it (we
just add a suffix link to it) b

abaaba abaaba$

$
b b
a a
$
b a b a
a a $
a a a
a

b b b
a b a
$

a a b a
a b

a a $
$

17
How many nodes can a suffix trie
have?
s = aaabbb a b • s = anbn will have

a
• 1 root node

b b • n nodes in a path of “b”s

• n paths of n+1 “b” nodes

b a
b

b
Total = n(n+1)+n+1 = O(n2)
b nodes.

b
b • This is not very efficient.

b
b
• How could you make it
smaller?
b

18
So... we have to “trie” again...

Space-Efficient Suffix Trees

19
A More Compact Representation
s = abaaba$ s = abaaba$
1234567 1234567
a $ 7:7
6:6
ba 5:6

$ 7:7
ba 5:6

aba$ 4:7
aba$ 4:7
$ 7:7
aba$ 4:7
$ 7:7

• Compress paths where • Represent sequence


there are no choices.
along the path using a
range [i,j] that refers to
the input string s.
20
Space usage:

• In the compressed representation:!


- # leaves = O(n) [one leaf for each position in the string]!
- Every internal node is at least a binary split.!
- Each edge uses O(1) space.!
!

• Therefore, # number of internal nodes is about equal


to the number of leaves.!
• And # of edges ≈ number of leaves, and space per
edge is O(1).!
• Hence, linear space.

21
Constructing Suffix Trees -
Ukkonen’s Algorithm

• The same idea as with the suffix trie


algorithm.

• Main difference: not every trie node is s = abab u

explicitly represented in the tree.



bab
abab
• Solution: represent trie nodes as pairs (u,
α), where u is a real node in the tree and v

α is some string leaving it.



suffix_link[v] = (u, ab)

• Some additional tricks to get to O(n)


time. (We’ll talk about these later.)

22
Storing more than one string with
Generalized Suffix Trees

23
Constructing Generalized Suffix
Trees
Goal. Represent a set of strings P = {s1, s2, s3, ..., sm}.
Example. att, tag, gat
Simple solution:

(1) build suffix tree for string aat#1tag#2gat#3

#1tag#2gat#3
#3
#2gat#3
g
t a

#2gat#3
#3 #3 at#3
ag#2gat#3
#1tag#2gat#3 g#2gat#3
t
at#1tag#2gat#3

#3
#1tag#2gat#3

24
Constructing Generalized Suffix
Trees
Goal. Represent a set of strings P = {s1, s2, s3, ..., sm}.
Example. att, tag, gat
Simple solution:

(1) build suffix tree for string aat#1tag#2gat#3 (2) For every leaf node, remove
any text after the first # symbol.
#1tag#2gat#3 #1
#3 #3
#2gat#3 #2
g g
t a t a

#2gat#3 #2
#3 #3 #3
at#3 #3 at#3
ag#2gat#3
ag#2
#1tag#2gat#3 g#2gat#3
#1 g#2
t
at#1tag#2gat#3 t
at#1

#3
#1tag#2gat#3
#3
#1

24
Applications of Generalized Suffix
Trees
Longest common substring of S and T:

Determine the strings in a database {S1, S2, S3, ..., Sm} that contain
query string q:

25
Applications of Generalized Suffix
Trees
Longest common substring of S and T:
Build generalized suffix tree for {S, T}

Find the deepest node that has has descendants from both
strings (containing both #1 and #2)

Determine the strings in a database {S1, S2, S3, ..., Sm} that contain
query string q:

25
Applications of Generalized Suffix
Trees
Longest common substring of S and T:
Build generalized suffix tree for {S, T}

Find the deepest node that has has descendants from both
strings (containing both #1 and #2)

Determine the strings in a database {S1, S2, S3, ..., Sm} that contain
query string q:
Build generalized suffix tree for {S1, S2, S3, ..., Sm}

Follow the path for q in the suffix tree.

Suppose you end at node u: traverse the tree below u, and
output i if you find a string containing #i.

25
Longest Common Extension
Longest common extension: We are given strings S and T. In the future, many pairs (i,j) will be
provided as queries, and we want to quickly find:

!
the longest substring of S starting at i that matches a substring of T starting at j.
LCE(i,j) LCE(i,j)
S T

i j

Build generalized suffix tree for S and T.



Preprocess tree so that lowest common
ancestors (LCA) can be found in constant time.

LCA(i,j)

Create an array mapping suffix numbers to leaf


nodes.

Given query (i,j):
j i
Find the leaf nodes for i and j

Return string of LCA for i and j

i j
26
Longest Common Extension
Longest common extension: We are given strings S and T. In the future, many pairs (i,j) will be
provided as queries, and we want to quickly find:

!
the longest substring of S starting at i that matches a substring of T starting at j.
LCE(i,j) LCE(i,j)
S T

i j

Build generalized suffix tree for S and T.


O(|S| + |T|)
Preprocess tree so that lowest common O(|S| + |T|)
ancestors (LCA) can be found in constant time.

LCA(i,j)

Create an array mapping suffix numbers to leaf


nodes.
O(|S| + |T|)

Given query (i,j):


j i
Find the leaf nodes for i and j
O(1)

Return string of LCA for i and j
O(1)
i j
26
Using LCE to Find Palindromes
Maximal even palindrome at position i: the longest string to the left and right so that the left
half is equal to the reverse of the right half.
x y x≠y
S
= the reverse of
i

Goal: find all maximal palindromes in S.

27
Using LCE to Find Palindromes
Maximal even palindrome at position i: the longest string to the left and right so that the left
half is equal to the reverse of the right half.
x y x≠y
S
= the reverse of
i

Goal: find all maximal palindromes in S.

y x x≠y
Sr

n-i
Construct Sr, the reverse of S.

!
Preprocess S and Sr so that LCE queries can be solved in constant time (previous slide).

!
LCE(i, n-i) is the length of the longest palindrome centered at i.

!
For every position i:

Compute LCE(i, n-i)
27
Using LCE to Find Palindromes
Maximal even palindrome at position i: the longest string to the left and right so that the left
half is equal to the reverse of the right half.
x y x≠y
S
= the reverse of
i

Goal: find all maximal palindromes in S.

y x x≠y
Sr

n-i
Construct Sr, the reverse of S.
O(|S|)
!
Preprocess S and Sr so that LCE queries can be solved in constant time (previous slide).
O(|S|)
!
LCE(i, n-i) is the length of the longest palindrome centered at i.

!
For every position i:
O(|S|)

Total time = O(|S|)
Compute LCE(i, n-i) O(1)
27
Recap

• Suffix tries natural way to store a string -- search, count


occurrences, and many other queries answerable easily.

• But they are not space efficient: O(n2) space.


• Suffix trees are space optimal: O(n), but require a little more
subtle algorithm to construct.

• Suffix trees can be constructed in O(n) time using Ukkonen’s


algorithm.

• Similar ideas can be used to store sets of strings.

28

You might also like