0% found this document useful (0 votes)
92 views

ADS UNIT5

Pattern matching is the process of identifying specific sequences of characters within a larger text or data structure. Various algorithms exist for pattern matching, including Brute Force, Knuth-Morris-Pratt (KMP), Boyer-Moore, Rabin-Karp, and Aho-Corasick, each with different efficiencies and applications. The document provides detailed explanations of these algorithms, their methodologies, and pseudocode for implementation.

Uploaded by

tabasum1382
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

ADS UNIT5

Pattern matching is the process of identifying specific sequences of characters within a larger text or data structure. Various algorithms exist for pattern matching, including Brute Force, Knuth-Morris-Pratt (KMP), Boyer-Moore, Rabin-Karp, and Aho-Corasick, each with different efficiencies and applications. The document provides detailed explanations of these algorithms, their methodologies, and pseudocode for implementation.

Uploaded by

tabasum1382
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT-5

What is Pattern Matching?


It's the process of identifying specific sequences of characters or elements within a larger structure
like text, data, or images. Think of it like finding a specific word in a sentence or a sequence of
symbols or values, within a larger sequence or text.

Basic Concepts Pattern Matching


• Pattern: A pattern is a sequence of characters, symbols, or other data that forms a search
criterion. In text processing, a pattern could be a string of characters.

• Text: The text (or string) is the sequence where the pattern is searched for.

• Match: A match occurs if the pattern is found within the text. The goal of pattern matching
is to find all instances where this occurs or to determine whether the pattern exists in the text.

Common Algorithms of Pattern Matching


Brute Force Pattern Matching Algorithm
Checks for the pattern at every possible position in the text. For each position, it compares the
pattern with the corresponding substring in the text. It is Effective for small texts or patterns, but
inefficient for large texts.

Knuth-Morris-Pratt(KMP)
Optimizes the naive approach by avoiding redundant comparisons. It pre-processes the pattern to
determine the longest prefix which is also a suffix, allowing the search to skip some comparisons.
It is Suitable for applications where the same pattern is searched repeatedly in multiple texts.
Boyer-Moore
Works by comparing the pattern to the text from right to left. It uses two heuristics, the bad
character rule and the good suffix rule, to skip sections of the text, offering potentially sub-linear
time complexity. It is Highly efficient for large texts and is considered one of the fastest single-
pattern matching algorithms.

Rabin-Karp
Uses hashing to find any set of pattern occurrences. It hashes the pattern and text's substrings of
the same length and then compares these hashes. If the hashes match, it checks for a direct match.
It is Useful in plagiarism detection or searching for multiple patterns simultaneously.

Finite Automata
Constructs a state machine based on the pattern. The text is then processed character by character,
transitioning between states of the automaton. It is Effective when the same pattern is matched
against many texts, as the automaton needs to be constructed only once.

Aho-Corasick Pattern Matching Algorithm


A more complex algorithm used for finding all occurrences of any of a finite number
of patterns within the text. It constructs a trie of patterns and then a state machine
from the trie. Ideal for matching a large number of patterns simultaneously, like in
virus scanning or "grep" utilities.

Brute force Pattern Matching


A brute force algorithm is a straight forward approach to solving a problem. It also refers to a
programming style that does not include any shortcuts to improve performance.
• It is based on trial and error where the programmer tries to merely utilize the computer's
fast processing power to solve a problem, rather than applying some advanced algorithms and
techniques developed with human intelligence.

• It might increase both space and time complexity.

• A simple example of applying brute force would be linearly searching for an element in an
array. When each and every element of an array is compared with the data to be searched, it might be
termed as a brute force approach, as it is the most direct and simple way one could think of searching
the given data in the array

Brute Force Pattern Matching Algorithm


• Start at the beginning of the text and slide the pattern window over it.

• At each position of the text, compare the characters in the pattern with the characters in the
text.

• If a mismatch is found, move the pattern window one position to the right in the text.

• Repeat steps 2 and 3 until the pattern window reaches the end of the text.

• If a match is found (all characters in the pattern match the corresponding characters in the
text), record the starting position of the match.

• Move the pattern window one position to the right in the text and repeat steps 2-5.

• Continue this process until the pattern window reaches the end of the text.
Psudo code:
function bruteForcePatternMatch(T, P):
n = length(T)
m = length(P)

for i from 0 to n - m:
j=0
while j < m and T[i + j] == P[j]:
j=j+1
if j == m:
return i // Pattern found at position i
return -1 // Pattern not found

Boyer–Moore Pattern Matching


The Boyer–Moore Pattern Matching algorithm is one of the most efficient string-searching
algorithm that is the standard benchmark for practical pattern matching. It was developed
by Robert Stephen Boyer and J Strother Moore in the year 1977.

The Boyer-Moore algorithm works by pre-processing the pattern and then scanning the text from
right to left, starting with the rightmost characters. It is based on the principle that if a mismatch
is found, there is no need to match the remaining characters. This backwards approach significantly
reduces the algorithm's time complexity compared to naive string search methods.

Here's a step-by-step explanation of how the Boyer-Moore algorithm works:

1. Preprocessing
Boyer-Moore algorithm uses two heuristics for preprocessing the pattern:
a. Bad Character Heuristic
b. Good Suffix Heuristic

Bad Character Heuristic (Building Bad Character Table)


Determines how far the pattern can be shifted when a character mismatch occurs.

• Create an array (often called the "Bad Character Table") to store the rightmost occurrence
of each character in the pattern. If a character is not in the pattern, its value is set to pattern length.

• For each character in the pattern, update the table with its last occurrence in the pattern
with
pattern length - index - 1.
BCT(ch)={m ifch is not in p
m-j-1otherwise
Example:
2. Searching
Once the preprocessing is done, the actual search begins:
• Align the pattern with the beginning of the text.

• Compare the pattern with the text from right to left.


• If all characters of the pattern match, a valid occurrence is found.

3. Shifting the Pattern:


When a mismatch occurs

• If the character is not in the pattern: Shift the pattern such that the mismatched character
in the text aligns with the rightmost occurrence of it in the pattern.

• If the character is not in the pattern: shift the entire pattern past the character.


3. Repeat Comparison
After shifting the pattern according to the above rules, repeat the comparison process:
• Continue comparing the pattern with the text from right to left.

• Apply the shifting rules whenever a mismatch is encountered.


• Continue this process until the end of the text is reached or all occurrences of the pattern
are found.

4. Termination
The algorithm terminates when either

• The pattern has been shifted past the end of the text, indicating no more matches are
possible.

• All occurrences of the pattern has have been found.

• Pseudo code for Boyer–Moore Pattern Matching algorithm


Pseudo-code

Boolean BoyerMoore( T: text, P: pattern)

n: length of T, i: index of T

m: length of P, j: index of P

bctable[ch] is an array, for ch = 0 to 127 (ASCIl values)

1. Initialize all elements of array bctable to m.

2. Set bctable[Pj] = m-j-1, for j= 0 to m-1.

3. Initialize i = m-1

4. Repeat steps 5 to 8, while(i < n)

5. j := m-1

6. Repeat steps (a) and (b), while (j ≥ 0 and Pj = Ti)

(a) decrement i
(b) decrement j

7. if (j = -1) return true //pattern matching successful

8. i := i+ max(bctable[Ti], m-j)

9. return false //at the end, if pattern matching fails, return false

KMP Pattern Matching

The Knuth-Morris-Pratt (KMP) pattern matching algorithm is an efficient string searching


method developed by Donald Knuth, James H. Morris, and Vaughan Pratt in the year 1970. It
is used to find the occurrences of a "pattern" within a "text" without checking every single
character in the text, which is a significant improvement over the brute-force approach.

The KMP algorithm compares the pattern to the text in left-to-right, but shifts the pattern, P more
intelligently than the brute-force algorithm. When a mismatch occurs, what is the most we can
shift the pattern so as to avoid redundant comparisons. The answer is that the largest prefix of
P[0..j] that is a suffix of P[1.j].
Here's a step-by-step explanation of how the KMP algorithm works:

1. Preprocessing(Building the LPS Array)


The core idea to preprocess the pattern to construct an LPS(Longest Prefix Suffix) array. This
array stores the length of the longest proper prefix which is also a suffix for each sub-pattern of
the pattern. This preprocessing helps in determining the next positions in the pattern to be
compared, thus avoiding redundant comparisons.

• Start by initializing the first element of lps[] to 0, as a single character can't have any proper
prefix or suffix.

• Maintain two pointers, len and i, where len is the length of the last longest prefix suffix.
Initially, len := 0 and i := 1.

• Repeat steps 4 to 6, while (i < m)

• If pattern[len] equals pattern[i], set lps[i] = len + 1, increment both i and len.

• If they don't match and len is not 0, update len to lps[len - 1].

• If they don't match and len is 0, set lps[i] = 0 and increment i.


2. Searching
Once the preprocessing is done, the actual search begins:
• Align the pattern with the beginning of the text.

• Compare the pattern with the text from left to right.

• If all characters of the pattern match, a valid occurrence is found.

3. Shifting the Pattern:


• Compare pattern[j] with text[i].

• If they match, increment both i and j.

• If j equals the pattern length, a match is found. Optionally report the match, then set j to
lps[j - 1].

• If they don't match and j is not 0, set j to lps[j - 1]. Do not increment i here.

• If they don't match and j is 0, increment i

4. Repeat Comparison
1.Continue comparing the pattern with the text from left to right.
2.Apply the shifting rules whenever a mismatch is encountered.
3.Continue this process until the end of the text is reached or all occurrences of the pattern are
found.
5. Termination
The algorithm terminates when either
• The pattern has been shifted past the end of the text, indicating no more matches are
possible.

• All occurrences of the pattern has have been found.

Pseudo code for KMP Pattern Matching algorithm


Pseudo-code

Function KMP(T:text, P:pattern)


n: lenght of T, i: index of T
m: lenght of P, j: index of P
lps[j] is a array, for j=0 to m-1

lps[]=computeLPSArray(P)
i:=0, j:=0
while i < n
if(P(j) = T(i)) then
if(j = m-1)then
return true
increment i and j
else if(j > 0) then
j = lps[j-1]
else
increment i
return false

Function computeLPSArray(P:pattern)
m: lenght of P

len:=0, i:=1
while(i < m)
if(P[len] = P[i]) then
lps[i]=len+1;
increment len and i
else if(j>0) then
len=lps[len-1]
else
lps[i]=0
increment i

Naïve String, :
A Naïve String Data Structure: typically refers to basic or simple implementations for
managing and manipulating strings, often used in the context of string matching or pattern
searching. These are relatively inefficient compared to more advanced algorithms or data
structures, but they are conceptually simple and easy to understand.
In the context of string matching, the naïve approach for searching a pattern in a text is the Naïve
String Matching Algorithm. Here's an explanation of the basic idea:

1. Naïve String Matching Algorithm:


This is the simplest way to search for a pattern P in a text T.
• Input:

• Text string T of length n.


• Pattern string P of length m.
• Procedure:

• The algorithm slides the pattern P over the text T from left to right, checking at
each position if the substring of T starting at that position matches P.
• For each shift of the pattern, we compare the pattern with the substring of text
character by character.
• If a match is found, we return the position of the match.
• Pseudocode:
for i from 0 to n-m:
if T[i:i+m] == P:
return i # Found match at index i
return -1 # No match found

• Time Complexity:

• Worst-case time complexity is O(n * m), where n is the length of the text and m is
the length of the pattern.
• In the worst case, every character in T has to be compared with every character in
P.
• Space Complexity:

• O(1): This algorithm only needs a constant amount of extra space.

2. Naïve String Data Structures:


In general, naïve string data structures are characterized by straightforward storage and access
methods, without the sophisticated indexing, hashing, or advanced preprocessing techniques
found in more efficient data structures.
• Array-based String:

• A string can be represented as an array of characters.


• It supports simple operations like accessing individual characters, concatenation,
and length calculation, but it does not support advanced operations like substring search
efficiently.
• Linked List-based String:

• A string can also be represented as a linked list of characters, where each node
contains one character of the string.
• This is less common but can be useful in certain situations where characters need
to be frequently inserted or deleted at different positions.

3. Limitations of Naïve Approaches:


• Inefficient Searching: As mentioned, the naïve string matching algorithm has a poor
time complexity of O(n * m).
• Lack of Advanced Features: Simple data structures do not allow fast substring searches
or operations like pattern matching and text indexing.
• Memory Overhead: Depending on the implementation (array or linked list), there might
be memory overhead compared to more compact or optimized structures like tries or
suffix trees.

4. More Advanced Approaches:


As the problems involving string manipulation grow more complex, more efficient algorithms
and data structures are developed, such as:
• Knuth-Morris-Pratt (KMP) Algorithm
• Rabin-Karp Algorithm
• Boyer-Moore Algorithm
• Suffix Trees and Arrays
• Trie Data Structures
While the naïve approach is useful for understanding the basic principles, advanced algorithms
significantly reduce time complexity and improve performance for large-scale string
manipulation.
Harspoo :
The Boyer–Moore–Horspool algorithm, also known as Horspool's algorithm, is an efficient
method used in computer science for locating substrings within a string. Nigel Horspool
published it in 1980 as SBM and is considered a simplification of the Boyer–Moore string-search
algorithm. The algorithm enhances efficiency by trading space for time, achieving an average-
case complexity of $$O(n)$$ on random text, though it has a worst-case complexity of
$$O(nm)$$, where $$m$$ is the length of the pattern and $$n$$ is the length of the search
string.

Horspool's algorithm utilizes a shift table to determine how many characters can be safely
skipped when a mismatch occurs. The algorithm checks the text character aligned with the last
character of the pattern. If it doesn't match, the pattern is shifted forward until a match is found.

Here's how the preprocessing phase works in pseudocode:


```
function preprocess(pattern)
T:= new table of 256 integers
for i from 0 to 256 exclusive
T[i]:= length(pattern)
for i from 0 to length(pattern) - 1 exclusive
T[pattern[i]]:= length(pattern) - 1 - i
return T
```

The search procedure then reports the index of the first occurrence of the pattern in the text:
```
function search(needle, haystack)
T:= preprocess(needle)
skip:= 0
while length(haystack) - skip >= length(needle)
if same(haystack[skip:], needle, length(needle))
return skip
skip:= skip + T[haystack[skip + length(needle) - 1]]
return -1
```

The shift table is created during the initialization of the algorithm. The pattern is checked with
the text from right to left and progresses left to right through the text.

The length of the shift is determined by the shift table, `shift[c]`, which is defined for all
characters $$c$$ in the alphabet Σ:
* If $$c$$ does not occur in the pattern $$P$$, then `shift[c] = m.
* Otherwise, `shift[c] = m - 1 - i`, where $$P[i] = c$$ is the last occurrence of $$c$$.

The space complexity for the shift table is determined by the size of the alphabet, not the length
of the pattern.

Rabin Karp :
The Rabin-Karp algorithm is a pattern-matching algorithm that uses hashing to compare patterns and
text. Here, the term Hashing refers to the process of mapping a larger input value to a smaller output
value, called the hash value. This process will help in avoiding unnecessary comparison which optimizes
the complexity of this algorithm. Therefore, the Rabin-Karp algorithm has a time complexity of O(n +
m), where n is the length of the text and m is the length of the pattern.

How does Rabin Karp Algorithm work?


The Rabin-Karp algorithm checks the given pattern within a text by moving window one by one, but
without checking all characters for all cases, it finds the hash value. Then, compare it with the hash values
of all the substrings of the text that have the same length as the pattern.
If the hash values match, then there is a possibility that the pattern and the substring are equal, and we can
verify it by comparing them character by character. If the hash values do not match, then we can skip the
substring and move on to the next one. In the next section, we will understand how to calculate hash
values.

Calculating hash value in Rabin Karp Algorithm


The steps to calculate hash values are as follows −

Step 1: Assign modulus and a base value


Suppose we have a text Txt = "DAACABCDBA" and a pattern Ptrn = "CAB". We will first assign
numerical values to the characters of text based on their ranking. The leftmost character will have rank 1
and the rightmost ranks 10. Also, use base b = 10 (number of characters in the text) and modulus m =
11 for our hash function. It should be noted that the modulus m needs to be a prime number as it will help
in avoiding overflow issues.

Step 2: Calculate hash value of Pattern


The equation to calculate the hash value of the pattern is as follows −
hash value(Ptrn) = Σ(r * bl-i-1) mod 11
where, r: ranking of character
l: length of Pattern
i: index of character within the pattern

Therefore, the hash value of Patrn is −


h(Ptrn) = ((4 * 102) + (5 * 101) + (6 * 100)) mod 11
= 456 mod 11
=5

Step 3: Calculate hash value of first Text window


Start calculating the hash value for all characters in the text by sliding over them. We will start with the
first substring as shown below −
h(DAA) = ((1 * 102) + (2 * 101) + (3 * 100)) mod 11
= 123 mod 11
=6

Now, compare the hash value of pattern and the substring. If they match, check whether characters are
matching or not. If they do, we found our match otherwise, move to the next characters.
In the above example, hash value did not matched. Hence, we move to the next character.

Step 4: Updating the hash value


Now, we need to remove the previous character and move to the next character. In this process, the hash
value should also be updated till we find the match.

Example
The following example practically demonstrates the working of Rabin-Karp algorithm.
#include <iostream>
#include <string>
#include <cmath>
using namespace std;

// Function to implement the Rabin-Karp algorithm


void rabinKarp(const string &text, const string &pattern) {
int n = text.length(); // Length of the text
int m = pattern.length(); // Length of the pattern
int p = 101; // A prime number for hashing
int q = 1000000007; // A large prime number for modulus

// Compute the hash value for the pattern


long long patternHash = 0;
for (int i = 0; i < m; i++) {
patternHash = (patternHash * p + pattern[i]) % q;
}

// Compute the hash value for the first window in the text (of length m)
long long textHash = 0;
for (int i = 0; i < m; i++) {
textHash = (textHash * p + text[i]) % q;
}

// The value of p^(m-1) % q, used for removing the leading digit


long long p_m = 1;
for (int i = 1; i < m; i++) {
p_m = (p_m * p) % q;
}

// Sliding window: check each substring in the text


for (int i = 0; i <= n - m; i++) {
// If the hash values match, check the actual substring
if (patternHash == textHash) {
if (text.substr(i, m) == pattern) {
cout << "Pattern found at index " << i << endl;
}
}

// Calculate the hash for the next window in the text, if we're not at the end
if (i < n - m) {
textHash = (textHash - text[i] * p_m) % q;
textHash = (textHash * p + text[i + m]) % q;
if (textHash < 0) {
textHash += q;
}
}
}
}
int main() {
string text = "abcabcabc";
string pattern = "abc";

// Call the Rabin-Karp function to search for the pattern in the text
rabinKarp(text, pattern);

return 0;
}

You might also like