Lecture 04
Lecture 04
In this part, we will describe algorithms that solve the following problem.
Problem 2.1: Given text T [0..n) and pattern P [0..m), report the first
position in T where P occurs, or n if P does not occur in T .
The algorithms can be easily modified to solve the following problems too.
• Existence: Is P a factor of T ?
74
The naive, brute force algorithm compares P against T [0..m), then against
T [1..1 + m), then against T [2..2 + m) etc. until an occurrence is found or
the end of the text is reached. The text factor T [j..j + m) that is currently
being compared against the pattern is called the text window.
The worst case time complexity is O(mn). This happens, for example, when
P = am−1 b = aaa..ab and T = an = aaaaaa..aa.
75
(Knuth–)Morris–Pratt
The Brute force algorithm forgets everything when it shifts the window.
Example 2.3:
Brute force Morris–Pratt Knuth–Morris–Pratt
ainaisesti-ainainen ainaisesti-ainainen ainaisesti-ainainen
ainai//nen (6 comp.) nen (6)
ainai// nen (6)
ainai//
ainainen (1)
// nainen (1)
ai// ainainen (1)
//
ainainen (1)
/ ainainen (1)
//
nainen (3)
ai//
/inainen (1)
a
ainainen (1)
//
76
MP and KMP algorithms never go backwards in the text. When they
encounter a mismatch, they find another pattern position to compare
against the same text position. If the mismatch occurs at pattern position i,
then f ail[i] is the next pattern position to compare.
The only difference between MP and KMP is how they compute the failure
function f ail.
Algorithm 2.4: Knuth–Morris–Pratt / Morris–Pratt
Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
(1) compute f ail[0..m]
(2) i ← 0; j ← 0
(3) while i < m and j < n do
(4) if i = −1 or P [i] = T [j] then i ← i + 1; j ← j + 1
(5) else i ← f ail[i]
(6) if i = m then return j − m else return n
• When the algorithm finds a mismatch between P [i] and T [j], we know
that P [0..i) = T [j − i..j).
• Now we want to find a new i0 < i such that P [0..i0) = T [j − i0..j).
Specifically, we want the largest such i0 .
• This means that P [0..i0) = T [j − i0..j) = P [i − i0..i). In other words,
P [0..i0 ) is the longest proper border of P [0..i).
78
Example 2.5: Let P = ainainen. i P [0..i) border f ail[i]
0 ε – -1
1 a ε 0
2 ai ε 0
3 ain ε 0
4 aina a 1
5 ainai ai 2
6 ainain ain 3
7 ainaine ε 0
8 ainainen ε 0
Σ a i n a i n e n
-1 0 1 2 3 4 5 6 7 8
79
An efficient algorithm for computing the failure function is very similar to
the search algorithm itself!
• When the algorithm reads f ail[i] on line 4, f ail[i] has already been
computed.
80
Theorem 2.7: Algorithms MP and KMP preprocess a pattern in time O(m)
and then search the text in time O(n) in the general alphabet model.
Proof. We show that the text search requires O(n) time. Exactly the same
argument shows that pattern preprocessing needs O(m) time.
else Here i decreases since f ail[i] < i. Since i only increases in the
then-branch, this branch cannot be taken more often than the
then-branch.
81
Shift-And (Shift-Or)
When the MP algorithm is at position j in the text T , it computes the
longest prefix of the pattern P [0..m) that is a suffix of T [0..j]. The
Shift-And algorithm computes all prefixes of P that are suffixes of T [0..j].
a s s i
-1 0 1 2 3
After processing T [j], D.i = 1 if and only if there is a path from the initial
state (state -1) to state i with the string T [0..j].
84
In the integer alphabet model when m ≤ w:
If m > w, we can store the bitvectors in dm/we machine words and perform
each bitvector operation in O(dm/we) time.
If no pattern prefix longer than w matches a current text suffix, then only
the least significant machine word contains 1’s. There is no need to update
the other words; they will stay 0.
The Karp–Rabin hash function (Definition 1.43) was originally developed for
solving the exact string matching problem. The idea is to compute the hash
values or fingerprints H(P ) and H(T [j..j + m)) for all j ∈ [0..n − m].
The text factor fingerprints are computed in a sliding window fashion. The
fingerprint for T [j + 1..j + 1 + m) = αT [j + m] is computed from the
fingerprint for T [j..j + m) = T [j]α in constant time using Lemma 1.44:
H(T [j + 1..j + 1 + m)) = (H(T [j]α) − H(T [j]) · rm−1 ) · r + H(T [j + m])) mod q
= (H(T [j..j + m)) − T [j] · rm−1 ) · r + T [j + m]) mod q .
A hash function that supports this kind of sliding window computation is
known as a rolling hash function.
86
Algorithm 2.10: Karp-Rabin
Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
(1) Choose q and r; s ← rm−1 mod q
(2) hp ← 0; ht ← 0
(3) for i ← 0 to m − 1 do hp ← (hp · r + P [i]) mod q // hp = H(P )
(4) for j ← 0 to m − 1 do ht ← (ht · r + T [j]) mod q
(5) for j ← 0 to n − m − 1 do
(6) if hp = ht then if P = T [j . . . j + m) then return j
(7) ht ← ((ht − T [j] · s) · r + T [j + m]) mod q
(8) if hp = ht then if P = T [j . . . j + m) then return j
(9) return n
The algorithms we have seen so far access every character of the text. If we
start the comparison between the pattern and the current text position from
the end, we can often skip some text characters completely.
There are many algorithms that start from the end. The simplest are the
Horspool-type algorithms.
The Horspool algorithm checks first the last character of the text window,
i.e., the character aligned with the last pattern character. If that doesn’t
match, it moves (shifts) the pattern forward until there is a match.
88
More precisely, suppose we are currently comparing P against T [j..j + m).
Start by comparing P [m − 1] to T [k], where k = j + m − 1.
The length of the shift is determined by the shift table that is precomputed
for the pattern. shif t[c] is defined for all c ∈ Σ:
90
In the integer alphabet model:
91