ADS UNIT5
ADS UNIT5
• Text: The text (or string) is the sequence where the pattern is searched for.
• Match: A match occurs if the pattern is found within the text. The goal of pattern matching
is to find all instances where this occurs or to determine whether the pattern exists in the text.
Knuth-Morris-Pratt(KMP)
Optimizes the naive approach by avoiding redundant comparisons. It pre-processes the pattern to
determine the longest prefix which is also a suffix, allowing the search to skip some comparisons.
It is Suitable for applications where the same pattern is searched repeatedly in multiple texts.
Boyer-Moore
Works by comparing the pattern to the text from right to left. It uses two heuristics, the bad
character rule and the good suffix rule, to skip sections of the text, offering potentially sub-linear
time complexity. It is Highly efficient for large texts and is considered one of the fastest single-
pattern matching algorithms.
Rabin-Karp
Uses hashing to find any set of pattern occurrences. It hashes the pattern and text's substrings of
the same length and then compares these hashes. If the hashes match, it checks for a direct match.
It is Useful in plagiarism detection or searching for multiple patterns simultaneously.
Finite Automata
Constructs a state machine based on the pattern. The text is then processed character by character,
transitioning between states of the automaton. It is Effective when the same pattern is matched
against many texts, as the automaton needs to be constructed only once.
• A simple example of applying brute force would be linearly searching for an element in an
array. When each and every element of an array is compared with the data to be searched, it might be
termed as a brute force approach, as it is the most direct and simple way one could think of searching
the given data in the array
• At each position of the text, compare the characters in the pattern with the characters in the
text.
• If a mismatch is found, move the pattern window one position to the right in the text.
• Repeat steps 2 and 3 until the pattern window reaches the end of the text.
• If a match is found (all characters in the pattern match the corresponding characters in the
text), record the starting position of the match.
• Move the pattern window one position to the right in the text and repeat steps 2-5.
• Continue this process until the pattern window reaches the end of the text.
Psudo code:
function bruteForcePatternMatch(T, P):
n = length(T)
m = length(P)
for i from 0 to n - m:
j=0
while j < m and T[i + j] == P[j]:
j=j+1
if j == m:
return i // Pattern found at position i
return -1 // Pattern not found
The Boyer-Moore algorithm works by pre-processing the pattern and then scanning the text from
right to left, starting with the rightmost characters. It is based on the principle that if a mismatch
is found, there is no need to match the remaining characters. This backwards approach significantly
reduces the algorithm's time complexity compared to naive string search methods.
1. Preprocessing
Boyer-Moore algorithm uses two heuristics for preprocessing the pattern:
a. Bad Character Heuristic
b. Good Suffix Heuristic
• Create an array (often called the "Bad Character Table") to store the rightmost occurrence
of each character in the pattern. If a character is not in the pattern, its value is set to pattern length.
• For each character in the pattern, update the table with its last occurrence in the pattern
with
pattern length - index - 1.
BCT(ch)={m ifch is not in p
m-j-1otherwise
Example:
2. Searching
Once the preprocessing is done, the actual search begins:
• Align the pattern with the beginning of the text.
• If the character is not in the pattern: Shift the pattern such that the mismatched character
in the text aligns with the rightmost occurrence of it in the pattern.
• If the character is not in the pattern: shift the entire pattern past the character.
•
3. Repeat Comparison
After shifting the pattern according to the above rules, repeat the comparison process:
• Continue comparing the pattern with the text from right to left.
4. Termination
The algorithm terminates when either
• The pattern has been shifted past the end of the text, indicating no more matches are
possible.
n: length of T, i: index of T
m: length of P, j: index of P
3. Initialize i = m-1
5. j := m-1
(a) decrement i
(b) decrement j
8. i := i+ max(bctable[Ti], m-j)
9. return false //at the end, if pattern matching fails, return false
The KMP algorithm compares the pattern to the text in left-to-right, but shifts the pattern, P more
intelligently than the brute-force algorithm. When a mismatch occurs, what is the most we can
shift the pattern so as to avoid redundant comparisons. The answer is that the largest prefix of
P[0..j] that is a suffix of P[1.j].
Here's a step-by-step explanation of how the KMP algorithm works:
• Start by initializing the first element of lps[] to 0, as a single character can't have any proper
prefix or suffix.
• Maintain two pointers, len and i, where len is the length of the last longest prefix suffix.
Initially, len := 0 and i := 1.
• If pattern[len] equals pattern[i], set lps[i] = len + 1, increment both i and len.
• If they don't match and len is not 0, update len to lps[len - 1].
• If j equals the pattern length, a match is found. Optionally report the match, then set j to
lps[j - 1].
• If they don't match and j is not 0, set j to lps[j - 1]. Do not increment i here.
4. Repeat Comparison
1.Continue comparing the pattern with the text from left to right.
2.Apply the shifting rules whenever a mismatch is encountered.
3.Continue this process until the end of the text is reached or all occurrences of the pattern are
found.
5. Termination
The algorithm terminates when either
• The pattern has been shifted past the end of the text, indicating no more matches are
possible.
lps[]=computeLPSArray(P)
i:=0, j:=0
while i < n
if(P(j) = T(i)) then
if(j = m-1)then
return true
increment i and j
else if(j > 0) then
j = lps[j-1]
else
increment i
return false
Function computeLPSArray(P:pattern)
m: lenght of P
len:=0, i:=1
while(i < m)
if(P[len] = P[i]) then
lps[i]=len+1;
increment len and i
else if(j>0) then
len=lps[len-1]
else
lps[i]=0
increment i
Naïve String, :
A Naïve String Data Structure: typically refers to basic or simple implementations for
managing and manipulating strings, often used in the context of string matching or pattern
searching. These are relatively inefficient compared to more advanced algorithms or data
structures, but they are conceptually simple and easy to understand.
In the context of string matching, the naïve approach for searching a pattern in a text is the Naïve
String Matching Algorithm. Here's an explanation of the basic idea:
• The algorithm slides the pattern P over the text T from left to right, checking at
each position if the substring of T starting at that position matches P.
• For each shift of the pattern, we compare the pattern with the substring of text
character by character.
• If a match is found, we return the position of the match.
• Pseudocode:
for i from 0 to n-m:
if T[i:i+m] == P:
return i # Found match at index i
return -1 # No match found
• Time Complexity:
• Worst-case time complexity is O(n * m), where n is the length of the text and m is
the length of the pattern.
• In the worst case, every character in T has to be compared with every character in
P.
• Space Complexity:
• A string can also be represented as a linked list of characters, where each node
contains one character of the string.
• This is less common but can be useful in certain situations where characters need
to be frequently inserted or deleted at different positions.
Horspool's algorithm utilizes a shift table to determine how many characters can be safely
skipped when a mismatch occurs. The algorithm checks the text character aligned with the last
character of the pattern. If it doesn't match, the pattern is shifted forward until a match is found.
The search procedure then reports the index of the first occurrence of the pattern in the text:
```
function search(needle, haystack)
T:= preprocess(needle)
skip:= 0
while length(haystack) - skip >= length(needle)
if same(haystack[skip:], needle, length(needle))
return skip
skip:= skip + T[haystack[skip + length(needle) - 1]]
return -1
```
The shift table is created during the initialization of the algorithm. The pattern is checked with
the text from right to left and progresses left to right through the text.
The length of the shift is determined by the shift table, `shift[c]`, which is defined for all
characters $$c$$ in the alphabet Σ:
* If $$c$$ does not occur in the pattern $$P$$, then `shift[c] = m.
* Otherwise, `shift[c] = m - 1 - i`, where $$P[i] = c$$ is the last occurrence of $$c$$.
The space complexity for the shift table is determined by the size of the alphabet, not the length
of the pattern.
Rabin Karp :
The Rabin-Karp algorithm is a pattern-matching algorithm that uses hashing to compare patterns and
text. Here, the term Hashing refers to the process of mapping a larger input value to a smaller output
value, called the hash value. This process will help in avoiding unnecessary comparison which optimizes
the complexity of this algorithm. Therefore, the Rabin-Karp algorithm has a time complexity of O(n +
m), where n is the length of the text and m is the length of the pattern.
Now, compare the hash value of pattern and the substring. If they match, check whether characters are
matching or not. If they do, we found our match otherwise, move to the next characters.
In the above example, hash value did not matched. Hence, we move to the next character.
Example
The following example practically demonstrates the working of Rabin-Karp algorithm.
#include <iostream>
#include <string>
#include <cmath>
using namespace std;
// Compute the hash value for the first window in the text (of length m)
long long textHash = 0;
for (int i = 0; i < m; i++) {
textHash = (textHash * p + text[i]) % q;
}
// Calculate the hash for the next window in the text, if we're not at the end
if (i < n - m) {
textHash = (textHash - text[i] * p_m) % q;
textHash = (textHash * p + text[i + m]) % q;
if (textHash < 0) {
textHash += q;
}
}
}
}
int main() {
string text = "abcabcabc";
string pattern = "abc";
// Call the Rabin-Karp function to search for the pattern in the text
rabinKarp(text, pattern);
return 0;
}