0% found this document useful (0 votes)
2 views

Lecture 6

This document provides an introduction to suffix trees and their role in text retrieval, focusing on string matching and indexing. It covers basic definitions, data structures like tries and Patricia trees, and details the construction of suffix trees from suffix tries. The document also emphasizes the efficiency of suffix trees for querying text data.

Uploaded by

Varun Thakur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 6

This document provides an introduction to suffix trees and their role in text retrieval, focusing on string matching and indexing. It covers basic definitions, data structures like tries and Patricia trees, and details the construction of suffix trees from suffix tries. The document also emphasizes the efficiency of suffix trees for querying text data.

Uploaded by

Varun Thakur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

Introduction Basic Definitions Dictionaries Suffix tree Example Overview

An introduction to suffix trees and indexing

Tomáš Flouri Solon P. Pissis

Heidelberg Institute for Theoretical Studies

December 3, 2012
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

1 Introduction
Introduction
2 Basic Definitions
Graph theory
Alphabet and strings
3 Dictionaries
Trie
Patricia tree
4 Suffix tree
Suffix trie
Suffix tree
Ukkonen’s algorithm
5 Example
6 Overview
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction
2 Basic Definitions
3 Dictionaries
4 Suffix tree
5 Example
6 Overview
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction

Introduction

Two main problem areas in text retrieval

1 String matching
2 Indexing and querying
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction

Introduction

Two main problem areas in text retrieval

1 String matching
2 Indexing and querying

Exact and approximate cases!


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction

Exact string matching

Many efficient algorithms exist

Knuth-Morris-Pratt algorithm
Boyer-Moore, Boyer-Moore-Horspool, Turbo-Boyer-Moore,
etc.
Aho-Corasick
...
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction

Indexing - 1

Problem
Given a text T , we need to construct an efficient data structure D
which will serve as an index of T , so that we can efficiently query
text T .

What do we expect from an efficient indexing data structure?


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction

Indexing - 2

Given a query pattern P, we want to find all occurrences of P in


preprocessed text T using the indexing data structure D

The data structure D is efficient if


It can be built in linear time in the size of T (O(|T |))
It occupies space linear in the size of T (O(|T |))
It can answer a query whether P exists in T in time linear in
the size of P (O(|P|))
It can report all occurrences of P in T in time O(|P| + occ),
where occ is the number of occurrences
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction

Indexing - 2

Some efficient indexing data structures include

Suffix automata (DAWG) and variations such as CDAWG


Suffix trees
Position heaps
Suffix arrays

In this lecture we will concentrate only on suffix trees


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction
2 Basic Definitions
3 Dictionaries
4 Suffix tree
5 Example
6 Overview
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory

Graph, Cycle, Path

Graph
A graph is a pair G = (V , E ) of sets such that E ⊆ V × V .

2 3

1 4

6 5
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory

Graph, Cycle, Path

Graph
A graph is a pair G = (V , E ) of sets such that E ⊆ V × V .

2 3

1 4

6 5
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory

Graph, Cycle, Path

Graph
A graph is a pair G = (V , E ) of sets such that E ⊆ V × V .

Path
A path of length n in a graph G = (V , E ) is a sequence
v0 , v1 , . . . vn ∈ V such that (v0 , v1 ), (v1 , v2 ), . . . , (vn−1 , vn ) ∈ E .

2 3

1 4

6 5
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory

Graph, Cycle, Path

Graph
A graph is a pair G = (V , E ) of sets such that E ⊆ V × V .

Path
A path of length n in a graph G = (V , E ) is a sequence
v0 , v1 , . . . vn ∈ V such that (v0 , v1 ), (v1 , v2 ), . . . , (vn−1 , vn ) ∈ E .

2 3

1 4

6 5
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory

Graph, Cycle, Path

Graph
A graph is a pair G = (V , E ) of sets such that E ⊆ V × V .

Path
A path of length n in a graph G = (V , E ) is a sequence
v0 , v1 , . . . vn ∈ V such that (v0 , v1 ), (v1 , v2 ), . . . , (vn−1 , vn ) ∈ E .

Cycle
A path v0 , v1 , . . . vn , v0 , where n ≥ 2, is called a cycle.

2 3

1 4

6 5
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory

Graph, Cycle, Path

Graph
A graph is a pair G = (V , E ) of sets such that E ⊆ V × V .

Path
A path of length n in a graph G = (V , E ) is a sequence
v0 , v1 , . . . vn ∈ V such that (v0 , v1 ), (v1 , v2 ), . . . , (vn−1 , vn ) ∈ E .

Cycle
A path v0 , v1 , . . . vn , v0 , where n ≥ 2, is called a cycle.

2 3

1 4

6 5
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory

Rooted tree, subtree, tree height, node height

Tree
A rooted tree is an acyclic graph T = (V , E ) with a special vertex
v ∈ V called the root. Nodes with degree 1 are called leaves.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings

Alphabet and strings

Definition (Alphabet)
An alphabet Σ is a finite non-empty set whose elements are called
letters.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings

Alphabet and strings

Definition (Alphabet)
An alphabet Σ is a finite non-empty set whose elements are called
letters.

Definition (String)
A string on an alphabet Σ is a finite, possibly empty, sequence of
elements of Σ.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings

Alphabet and strings

Definition (Alphabet)
An alphabet Σ is a finite non-empty set whose elements are called
letters.

Definition (String)
A string on an alphabet Σ is a finite, possibly empty, sequence of
elements of Σ.
The zero-letter sequence is called the empty string, and is denoted
by ε.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings

Alphabet and strings

Definition (Alphabet)
An alphabet Σ is a finite non-empty set whose elements are called
letters.

Definition (String)
A string on an alphabet Σ is a finite, possibly empty, sequence of
elements of Σ.
The zero-letter sequence is called the empty string, and is denoted
by ε.
The set of all possible strings on the alphabet Σ is denoted by Σ∗ .
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings

Alphabet and strings

Definition (Alphabet)
An alphabet Σ is a finite non-empty set whose elements are called
letters.

Definition (String)
A string on an alphabet Σ is a finite, possibly empty, sequence of
elements of Σ.
The zero-letter sequence is called the empty string, and is denoted
by ε.
The set of all possible strings on the alphabet Σ is denoted by Σ∗ .
Definition (Length of string)
The length of a string x is defined as the length of the sequence
associated with the string x , and is denoted by |x |.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings

Alphabet and strings

We denote by x [i], for all 1 ≤ i ≤ |x |, the letter at index i of x .


We also call index i, for all 1 ≤ i ≤ |x |, a position in x when
x 6= ε. It follows that the ith letter of x is the letter at position i
in x , and that
x = x [1 . . |x |]
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings

Alphabet and strings

We denote by x [i], for all 1 ≤ i ≤ |x |, the letter at index i of x .


We also call index i, for all 1 ≤ i ≤ |x |, a position in x when
x 6= ε. It follows that the ith letter of x is the letter at position i
in x , and that
x = x [1 . . |x |]

Definition (Factor of string)


A string x is a factor (substring) of a string y if there exist two
strings u and v , such that y = uxv .

We denote the factor (substring) of x starting at position i and


ending at position j as x [i . . j].
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction
2 Basic Definitions
3 Dictionaries
4 Suffix tree
5 Example
6 Overview
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Trie

Trie

Retrieval
Construct a dictionary for the set of words
{amy, andy, ann, rob, roger, ben, betty}
a A r
b

B C D
m n
e o

E F J M
d n n t b g
y

G H I K L N O
y t e

P Q S
y r

R T
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Trie

Trie

Retrieval
Construct a dictionary for the set of words
{amy, andy, ann, rob, roger, ben, betty}
a A r
b

B C D
m n
e o

E F J M
d n n t b g
y

G H I K L N O
$ y $ $ t $ e

P Q S
$ y r

R T
$ $
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Patricia tree

Patricia tree

1 Construct a trie
2 Remove nodes with out-degree 1 and concatenate the labels
of the corresponding edges to one edge
a A r
b

B C D
m n
e o

E F J M
d n n t b g
y

G H I K L N O
y t e

P Q S
y r

R T
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Patricia tree

Patricia tree

1 Construct a trie
2 Remove nodes with out-degree 1 and concatenate the labels
of the corresponding edges to one edge
a A r
b

B C D
m n
e o

E F J M
d n n t b g
y

G H I K L N O
y t e

P Q S
y r

R T
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Patricia tree

Patricia tree

1 Construct a trie
2 Remove nodes with out-degree 1 and concatenate the labels
of the corresponding edges to one edge
a A
ro

B be
n
my
F J M
n n b
dy
G I K N ger
tty

R T
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction
2 Basic Definitions
3 Dictionaries
4 Suffix tree
5 Example
6 Overview
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix trie

Suffix trie

Given some text, i.e. t = banana, construct the suffix trie.


1 Generate the set Suff(t)
2 Construct a trie from Suff(t)
The resulting data structure is called a suffix trie.

Example
Given the t = banana$, the set Suff(t) is

Suff(t) = {banana$, anana$, nana$, ana$, na$, a$}


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix trie

Suffix trie - Example

Given the text t = banana$, construct the suffix trie.


a n
b

$ n
a a

6
$ n
a n

5
$ n
a a

4
a n $

3
$ a

2
$

1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree

Suffix tree

Definition
A suffix tree is a patricia tree of the suffix trie.

Construction
1 Construct a suffix trie of text x

2 Eliminate all nodes with out-degree 1 and concatenate the


labels in the corresponding edges to one edge.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree

Suffix tree - Example

a n
b

$ n
a a

6
$ n
a n

5
$ n
a a

4
a n $

3
$ a

2
$

1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree

Suffix tree - Example

a n
b

$ n
a a

6
$ n
a n

5
$ n
a a

4
a n $

3
$ a

2
$

1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree

Suffix tree - Example

a
na

$
na
6
$

5
banana$ na$
$

4
na$

1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree

Size of suffix tree

Theorem
A suffix tree consists of at most 2n − 1 nodes (or 2n if empty suffix
$ is taken into account).

Proof (by induction)


Base case For 2 leaves we have 1 internal node.
Inductive step Assume that any binary tree with m < N leaves
consists of at exactly m − 1 internal nodes. We must prove that a
binary tree with N leaves has exactly N − 1 internal nodes.
A binary tree with N leaves is made up of:
A root node.
A left binary tree with k leaves.
A right binary tree with N − k leaves.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree

Size of suffix tree

Proof (by induction)


According to the induction assumption
The left binary tree with k leaves consists of k − 1 internal
nodes.
The right binary tree with N − k leaves consists of N − k − 1
internal nodes.
Therefore, the total number of internal nodes in a binary tree with
N leaves is
(k − 1) + (N − k − 1) + 1 = N − 1
and thus, the total number of nodes is 2N − 1. 
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree

Suffix tree construction algorithms


Weiner’s algorithm (1973)
Introduced as position tree
Construction in linear time (for constant size alphabets)
Characterized as algorithm of the year
McCreight’s algorithm (1976)
Improved space requirements over Weiner’s method
Construction in linear time (for constant size alphabets)
Ukkonen’s algorithm (1995)
Same time and space requirements as McCreight’s
Easier to understand
On-line
Farach’s algorithm (1997)
Linear time construction algorithm for any type of alphabet
Hard to implement
The basis for new algorithms i.e. position heaps and suffix
arrays in linear time
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Implicit suffix tree

Definition
An implicit suffix tree for string x is a tree obtained from the suffix
tree of x by
1 Removing $ from all edge labels
2 Removing any edge that has no label
3 Removing any node with only one child

a na a na

$ na banana$ $ na$ na banana na

6 5 3 6 5 3
$ na$ na

4 2 1 4 2 1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Implicit suffix tree

Definition
An implicit suffix tree for string x is a tree obtained from the suffix
tree of x by
1 Removing $ from all edge labels
2 Removing any edge that has no label
3 Removing any node with only one child

a na a na

$ na banana$ $ na$ na banana na

6 5 3 3
$ na$ na

4 2 1 2 1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Implicit suffix tree

Definition
An implicit suffix tree for string x is a tree obtained from the suffix
tree of x by
1 Removing $ from all edge labels
2 Removing any edge that has no label
3 Removing any node with only one child

a na
nana

anana
$ na banana$ $ na$ banana

6 5 3 3
$ na$

4 2 1 2 1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Implicit suffix tree

The implicit suffix tree of a string is what results by applying


Ukkonen’s algorithm to the string without an added end marker $.

All suffixes are included, but not necessarily as labels of complete


paths leading to leaves.

By appending a unique character at the end of the string (in our


case the $), the implicit suffix tree is essentially the same as the
(true) suffix tree (only without $).
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

String paths of implicit suffix trees

Given a string y [1 . . n], an implicit suffix tree Ii contains each suffix


y [1 . . i], y [2 . . i], . . . , y [i] of y as a label of some path (possibly
ending at the middle of an edge)

That is, a string path is

a string that can be matched along the edges, starting from


the root, or equivalently
a prefix of any node label
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Ukkonen’s algorithm

1 Start with T = I1 .
2 Consecutively update T to I2 , I3 , . . . , In+1 in n phases, where
Ii represents the implicit suffix tree of prefix y [1 . . i].
Phase i + 1 updates T from Ii (with all suffixes of y [1 . . i]) to
Ii+1 (with all suffixes of y [1 . . i + 1]).
Each phase i + 1 consists of extensions j = 1, 2, . . . , i + 1 (one for
each suffix of y [1 . . i + 1]).
Extension j ensures that suffix y [j . . i + 1] is in Ii+1 .
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Suffix extension rules

Rule 1 y [j . . i] ends at a leaf


Insert y [i + 1] at the end of the edge label
Rule 2 y [j . . i] doesn’t end at a leaf, and the following
character is not y [i + 1]
Connect the end of the path to a new leaf j by
an edge labeled y [i + 1].
If the path ended at the middle of an edge, split
that edge and insert a new node as the parent of
leaf j.
Rule 3 If the path y [j . . i] is already in the tree.
No update.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Complexity

Complexity
The so-far presented algorithmic approach runs in O(n3 ).

Proof
Consider a single phase i + 1.
Each extension rule can be applied in O(1) ⇒ Applying all i + 1
extensions takes time Θ(i).
Locating the ends of string paths y [1 . . i], . . . , y [i] by traversing
the edge labels takes time Σik=1 = Θ(i 2 ).
⇒ Therefore, the total time for all phases i = 1, 2, . . . , n is
Σni=1 i 2 = Θ(n3 )

Which is even worse than the naive algorithm which runs in O(n2 ).
We will see how this approach, with the use of some simple tricks,
can achieve linear run-time.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Suffix links

The extensions of phase i + 1 need to locate the ends of all i + 1


suffixes of y [1 . . i], and apply Rules 1-3.

How to do this efficiently?

For each internal node v of Ii labeled x α, where x ∈ Σ and


α ∈ Σ∗ , define s(v ) to be the node labeled by α.
(Do these nodes actually exist?)

Then a pointer from v to s(v ) is called the suffix link of v .

Note: If node v is labeled by a single character then α = ε and


s(v ) is the root node.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Example of suffix links

Suffix tree for x = xabxac

bxac xa
c a

3 6
c bxac c bxac

5 2 4 1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Why do we need suffix links?

Extension j (of phase i + 1) finds the end of the path y [j . . i] in


the tree (and extends it with character y [i + 1])

Extension j + 1 similarly finds the end of the path y [j + 1 . . i]

Assume that v is an internal node whose string path y [j]α is


(essentialy) a prefix of y [j . . i]. Then we can avoid traversing path
α when locating the end of path y [j + 1 . . i], by starting from node
s(v ).

Do suffix links always exist?


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Suffix links existence

Observation
If an internal node v is created during extension j (of phase i + 1),
then extension j + 1 will find out the node s(v ).

Let v be labeled x α
Node v can only be created by extension Rule 2.
That is, v is inserted at the end of path y [j . . i], which continued
by some character c 6= y [i + 1].
⇒ Therefore, paths x αc and αc have been entered before phase
i + 1.
⇒ in extension j + 1, node s(v ) is either found or created at the
end of path α = y [j + 1 . . i].
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Speeding up path traversals

Consider extensions of phase i + 1

Extension 1 extends path y [1 . . i] with character y [i + 1].

Extension 1 is easy as path y [1 . . i] always ends at leaf 1, and is


thus extended by Rule 1.

We can perform extension 1 in constant time, if we maintain a


pointer to the edge at the end of y [1 . . i].

What about subsequent extensions j + 1 (for j = 1, 2, . . . , i)?


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Locating subsequent paths

Extension j has located the end of path y [j . . i] and v is the node


last visited.

Starting from there, walk up at most one node either


1 to the root, or
2 to a node s(v ) with a suffix link from v

In case of (1), traverse path y [j + 1 . . i] explicitly down-wards from


the root.
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Locating subsequent paths

In case of (2), let x α be the label of v


⇒ y [j . . i] = x αβ for some β ∈ Σ∗

Then follow the suffix link of v , and continue by matching β


down-wards from node s(v ) (whose string-path is α).

Having found the end of path αβ = y [j + 1 . . i], apply extension


rules to ensure that it extends with y [i + 1].

Finally, if a new internal node w was created in extension j, set its


suffix link to point to the end node of path y [j + 1 . . i]
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Locating subsequent paths - Illustration

In case of (2), let x α be the label of v


⇒ y [j . . i] = x αβ for some β ∈ Σ∗ (in this case β = abcd)

α

s(v )
a
v
bc
abcd

d
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Speeding up explicit traversals

Skip/Count trick
In phase i + 1, each path y [j . . i], which is followed in extension j,
is known to exist in the tree
⇒ The path can be followed by choosing the correct edges, instead
of examining every character
Let y [k] be the next character to be matched on path y [j . . i]
Now an edge labeled by y [p . . q] can be traversed simply by
checking that y [p] = y [k], and skipping the next q − p characters
of y [j . . i]
⇒ The time to traverse a path is proportional to the number of
nodes on the path (instead of its string length)
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Speeding up explicit traversals

Lemma
For any node v with a suffix link to s(v ), it holds that

depth(v ) − 1 ≤ depth(s(v )) ≤ depth(v )

Sketch of proof
The suffix links for any ancestor of v lead to distinct ancestors of
s(v ).
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Linear bound for any single phase

Theorem
Using suffix links and the skip/count trick, a single phase i takes
time O(n)

Proof
There are i + 1 ≤ n + 1 extensions in phase i + 1
In any extension, other work except tree-traversal (that is,
extension rules) takes O(1) time only
How to bound the work for traversing the tree?
To find the end of the next path, an extension first moves at most
one level up. Then a suffix link may be followed, which is followed
by a down-traversal to match the rest of the path
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Linear bound for any single phase

The up-walk in any extension decreases the current node depth by


at most one (since it moves up at most one node) and each suffix
link traversal decreases the node-depth by at most another one
(previous Lemma).
⇒ Thus the current node depth is decremented at most 2n times
during the entire phase.
On the other hand, the current node depth cannot exceed n ⇒ it
is incremented (by following downward edges) at most 3n times ⇒
total run-time of a phase is thus O(n)

Improvement
Since there are n phases, the total run-time is O(n2 )
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Final improvements (1)

Some extensions can be found unnecessary to compute explicitly

Observation 1 - Rule 3 terminates current phase


If path y [j . . i + 1] is already in the tree, so are paths
y [j + 1 . . i + 1] . . . y [i + 1]

⇒ Phase i + 1 can be finished at the first extension j that applies


Rule 3
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Final improvements (2)

Observation 2 - Once a leaf, always a leaf


A node created as a leaf remains a leaf thereafter because no
extension rule adds children to a leaf.
If extension j created a leaf (numbered j), extension j of any later
phase i + 1 applies Rule 1 (appending the next character y [i + 1]
to label of the edge ending at leaf j.

Explicit applications of Rule 1 can be eliminated as follows:

Use “compressed” edge representation (i.e. indices p and q instead


of substring y [p . . q]), and represent the end position of each
terminal edge by a global value e, for the “current end position”
(phase).
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Eliminating extensions

Denote by ji the last non-void extension of phase i (that is,


application of Rule 1 or 2)

Obs 1 ⇒ extensions 1, . . . , ji of phase i are non-void ⇒ leaves


1, . . . , ji have been created at the end of phase i

Obs 2 ⇒ extensions 1, . . . , ji of any subsequent phase all apply


Rule 1

⇒ ji+1 ≥ ji

⇒ Execute only extensions ji + 1, ji + 2, . . . explicitly in phase i + 1


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Single phase algorithm

Algorithm for phase i + 1 with unnecessary extensions eliminated


1 Set e = i + 1 (implements extensions 1, . . . , ji implicitly
2 Compute extensions ji + 1, . . . , j ∗ until j ∗ > i + 1 or Rule 3
was applied in extension j ∗
3 Set ji+1 = j ∗ − 1 (for the next phase)

All these tricks together can be shown to lead to linear run-time


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Complexity of the tuned implementation (1)

Theorem
Ukkonen’s algorithm builds the suffix tree for y [1 . . n] in time
O(n), when implemented using the mentioned tricks.

Proof

The extensions computed explicitly in any two phases i and i + 1


are disjoint except for extension j ∗ , which may be computed anew
in phase i + 1.

The second computation of extension j ∗ can be done in O(1) by


remembering the end of the path entered in the previous
computation
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Complexity of the tuned implementation (2)

Let j = 1, . . . , n + 1 denote the index of the current extension

Over all phases 2, . . . , n + 1 index j never decreases, but it can


remain the same at the start of phases 3, . . . , n + 1 ⇒ at most 2n
extensions are computed explicitly.

Similarly to the previous proof (skip/count), the current node


depth can be decremented at most 4n times, and thus the total
length of all downward traversals is bounded by 5n 
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Obtaining the true suffix tree

Finally, the implicit suffix tree In+1 can be converted to the true
suffix tree of y [1 . . n]$ in the following way

All occurrences of the “current end position” marker e on edge


labels can be replaced by n + 1 (with a simple tree traversal, in
time O(n))
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm

Ukkonen’s algorith

Reads a string x of size n from left to right.

The algorithm is on-line, i.e. at step 1 ≤ i ≤ n it constructs


an implicit suffix tree of prefix y [1 . . i] which can then be
easily converted to the (true) suffix tree by appending a
unique symbol $ that has not appeared before.

Runs in O(n) time for constant-size alphabets or O(n log n)


for general alphabets.

Requires O(n) space.


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction
2 Basic Definitions
3 Dictionaries
4 Suffix tree
5 Example
6 Overview
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 1
y = a b c a b x a b c $
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, e)

1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 2
y = a b c a b x a b c $

(1, e)

1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Implicit
y = a b c a b x a b c $

(1, e)

1
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Explicit
y = a b c a b x a b c $

(1, e)

(2, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 3
y = a b c a b x a b c $

(1, e)

(2, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Implicit
y = a b c a b x a b c $

(1, e)

(2, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Implicit
y = a b c a b x a b c $

(1, e)

(2, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Explicit
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 4
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Implicit
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Implicit
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Implicit
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 3

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 5
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Implicit
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Implicit
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Implicit
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 3

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 6
y = a b c a b x a b c $

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Skip all
y = a b c a b x a b c $ implicit

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, e) (3, e)

(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2)

(3, e)

(3, e)
(2, e)
3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2)

(3, e)

(6, e)
(3, e)
(2, e)
4 3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2)

(3, e)

(6, e)
(3, e)
(2, e)
4 3

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2)

(2, 2)
(3, e)

(6, e)
(3, e)
4 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2)

(2, 2)
(3, e)

(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Create
y = a b c a b x a b c $ suffix link

(1, 2)

(2, 2)
(3, e)

(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Create
y = a b c a b x a b c $ suffix link

(1, 2)

(2, 2)
(3, e)

(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10

y = a b c a b x a b c $

(1, 2)

(2, 2)
(3, e)

(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 7
y = a b c a b x a b c $

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Skip all
y = a b c a b x a b c $ implicit

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 3

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 8
y = a b c a b x a b c $

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Skip all
y = a b c a b x a b c $ implicit

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 3

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 9
y = a b c a b x a b c $

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Skip all
y = a b c a b x a b c $ implicit

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 3

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10
Phase 10
y = a b c a b x a b c $

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Skip all
y = a b c a b x a b c $ implicit

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 3

(1, 2) (6, e)

(2, 2)
(3, e)
6
(6, e) (6, e)
(3, e)
4 5 3

(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2)
(3, e)
6
(3, 3) (6, e) (6, e)

4 5 3
(4, e)
(3, e)

1 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2)
(3, e)
6
(3, 3) (6, e) (6, e)

4 5 3
(4, e) (10, e)
(3, e)

1 7 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Follow
y = a b c a b x a b c $ suffix link

(1, 2) (6, e)

(2, 2)
(3, e)
6
(3, 3) (6, e) (6, e)

4 5 3
(4, e) (10, e)
(3, e)

1 7 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2)
(3, e)
6
(3, 3) (6, e) (6, e)

4 5 3
(4, e) (10, e)
(3, e)

1 7 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2)
(3, e)
6
(3, 3) (6, e)(3, 3) (6, e)

4 5 3
(4, e) (10, e)
(3, e)
1 7 2
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2)
(3, e)
6
(3, 3) (6, e)(3, 3) (6, e)

4 5 3
(4, e) (10, e) (10, e)
(3, e)
1 7 2 8
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Follow
y = a b c a b x a b c $ suffix link

(1, 2) (6, e)

(2, 2)
(3, e)
6
(3, 3) (6, e)(3, 3) (6, e)

4 5 3
(4, e) (10, e) (10, e)
(3, e)
1 7 2 8
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2)
(3, e)
6
(3, 3) (6, e)(3, 3) (6, e)

4 5 3
(4, e) (10, e) (10, e)
(3, e)
1 7 2 8
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2) (3, 3)

6
(3, 3) (6, e)(3, 3) (6, e) (4, e)

4 5 3
(4, e) (10, e) (10, e)
(3, e)
1 7 2 8
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2) (3, 3)

6
(3, 3) (6, e)(3, 3) (6, e) (4, e)
(10, e)
4 5 9 3
(4, e) (10, e) (10, e)
(3, e)
1 7 2 8
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Create
y = a b c a b x a b c $ suffix link

(1, 2) (6, e)

(2, 2) (3, 3)

6
(3, 3) (6, e)(3, 3) (6, e) (4, e)
(10, e)
4 5 9 3
(4, e) (10, e) (10, e)
(3, e)
1 7 2 8
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Create
y = a b c a b x a b c $ suffix link

(1, 2) (6, e)

(2, 2) (3, 3)

6
(3, 3) (6, e)(3, 3) (6, e) (4, e)
(10, e)
4 5 9 3
(4, e) (10, e) (10, e)
(3, e)
1 7 2 8
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10

y = a b c a b x a b c $

(1, 2) (6, e)

(2, 2) (3, 3)

6
(3, 3) (6, e)(3, 3) (6, e) (4, e)
(10, e)
4 5 9 3
(4, e) (10, e) (10, e)
(3, e)
1 7 2 8
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example


1 2 3 4 5 6 7 8 9 10 Explicit -
y = a b c a b x a b c $ Rule 2

(1, 2) (6, e)

(2, 2) (3, 3)
(10, e)
10 6
(3, 3) (6, e)(3, 3) (6, e) (4, e)
(10, e)
4 5 9 3
(4, e) (10, e) (10, e)
(3, e)
1 7 2 8
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Application - finding all occurrences of a query

1 2 3 4 5 6 7 8 9 10

y = a b c a b x a b c $

ab xabc$

b c $

10 6
c xabc$ c xabc$
$ abxabc$

4 5 9 3
abxabc$ $ $
abxabc$
1 7 2 8

Query the string a


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Application - finding all occurrences of a query

Find the node to


1 2 3 4 5 6 7 8 9 10 which the string
y = a b c a b x a b c $ path a leads to

ab xabc$

b c $

10 6
c xabc$ c xabc$
$ abxabc$

4 5 9 3
abxabc$ $ $
abxabc$
1 7 2 8

Query the string a


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Application - finding all occurrences of a query

Get the leafs of


1 2 3 4 5 6 7 8 9 10
that node
y = a b c a b x a b c $

ab xabc$

b c $

10 6
c xabc$ c xabc$
$ abxabc$

4 5 9 3
abxabc$ $ $
abxabc$
1 7 2 8

Query the string a


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Application - finding all occurrences of a query

↓ ↓ ↓ Leaves indicate
1 2 3 4 5 6 7 8 9 10 the starting posi-
y = a b c a b x a b c $ tions of a

ab xabc$

b c $

10 6
c xabc$ c xabc$
$ abxabc$

4 5 9 3
abxabc$ $ $
abxabc$
1 7 2 8

Query the string a


Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction
2 Basic Definitions
3 Dictionaries
4 Suffix tree
5 Example
6 Overview
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Overview

We had a quick look on indexing.


Preprocessing a given text
Efficient querying afterwards

We’ve seen what suffix trees are and some of their properties.
Patricia suffix tries for a string x [1 . . n]
At most 2n − 1 nodes
Exactly n leaves

We’ve seen Ukkonen’s algorithm.


Fairly simple to understand
Linear time construction for constant-size alphabets
Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Reminder - Next week

Next week’s lecture will take place at

SR 148, Building 50.34

You might also like