0% found this document useful (0 votes)
2 views

Design & Analysis of algorithm- 6

The document provides an overview of various algorithms for pattern matching, including the Naïve Matching, KMP (Knuth Morris Pratt), and Rabin-Karp algorithms. It details the working principles, time complexities, and preprocessing techniques used in these algorithms, particularly focusing on the efficiency improvements offered by KMP through the use of the lps array. Additionally, it includes examples and illustrations to clarify the implementation and functioning of these algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Design & Analysis of algorithm- 6

The document provides an overview of various algorithms for pattern matching, including the Naïve Matching, KMP (Knuth Morris Pratt), and Rabin-Karp algorithms. It details the working principles, time complexities, and preprocessing techniques used in these algorithms, particularly focusing on the efficiency improvements offered by KMP through the use of the lps array. Additionally, it includes examples and illustrations to clarify the implementation and functioning of these algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Parul University

Session: July-Dec. 2024

05/28/2025 Department of Computer Science Slide No. 1


Parul University

Design & Analysis of Algorithm

Date 05/28/2025 Department of Computer Science Slide No.


Unit-7
● Naïve Matching
● Robin Karp Algorithm
● KMP Algorithm
● Pattern Matching using finite automata

05/28/2025 Department of Computer Science Slide No.


Naive algorithm for Pattern Searching
● Naive Algorithm:
● i) It is the simplest method which uses brute force approach.
● ii) It is a straight forward approach of solving the problem.
● iii) It compares first character of pattern with searchable text. If match is found,
pointers in both strings are advanced. If match not found, pointer of text is
incremented and pointer ofpattern is reset. This process is repeated until the end of the
text.
● iv) It does not require any pre-processing. It directly starts comparing both strings
character by character.
● v) Time Complexity = O(m* (n-m))

05/28/2025 Department of Computer Science Slide No.


Algorithm for Naive string matching :
● Algorithm-NAVE_STRING_MATCHING (T, Input: txt[] = “THIS IS A TEST TEXT”, pat[] =
P) “TEST”
● for i←0 to n-m do Output: Pattern found at index 10
● if P[1......m] == T[i+1.....i+m] then
● print "Match Found" Input: txt[] = “AABAACAADAABAABA”, pat[]
● end = “AABA”
● End Output: Pattern found at index 0, Pattern
found at index 9, Pattern found at index 12
Given a text of length N txt[0..N-1] and a
pattern of length M pat[0..M-1], write a
function search(char pat[], char txt[]) that
prints all occurrences of pat[] in txt[]. You
may assume that N > M.
Example:
05/28/2025 Department of Computer Science Slide No.
05/28/2025 Department of Computer Science Slide No.
KMP (Knuth Morris Pratt) Pattern Searching
● The Naive pattern-searching algorithm doesn’t work well in cases where we see many
matching characters followed by a mismatching character.
● Examples:
● 1) txt[] = “AAAAAAAAAAAAAAAAAB”, pat[] = “AAAAB”
● 2) txt[] = “ABABABCABABABCABABABC”, pat[] = “ABABAC” (not a worst case, but a bad
case for Naive)
● The KMP matching algorithm uses degenerating property (pattern having the same sub-
patterns appearing more than once in the pattern) of the pattern and improves the
worst-case complexity to O(n+m).
● The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some
matches), we already know some of the characters in the text of the next window. We
take advantage of this information to avoid matching the characters that we know will
anyway match.

05/28/2025 Department of Computer Science Slide No.


● Matching Overview Naive. In this second window, we only
● txt = “AAAAABAAABA” compare fourth A of pattern
● pat = “AAAA” ● with fourth character of current window
of text to decide whether current window
● We compare first window of txt with pat matches or not. Since we know
● txt = “AAAAABAAABA” ● first three characters will anyway match,
● pat = “AAAA” [Initial position] we skipped matching first three
characters.
● We find a match. This is same as Naive
String Matching. Need of Preprocessing?
● In the next step, we compare next ● An important question arises from the
window of txt with pat. above explanation, how to know how
many characters to be skipped. To know
● txt = “AAAAABAAABA” this,
● pat = “AAAA” [Pattern shifted one ● we pre-process pattern and prepare an
position] integer array lps[] that tells us the count
● This is where KMP does optimization over of characters to be skipped
05/28/2025 Department of Computer Science Slide No.
Pre-processing Overview:
● KMP algorithm preprocesses pat[] and constructs an auxiliary lps[] of size m (same as the
size of the pattern) which is used to skip characters while matching.
● Name lps indicates the longest proper prefix which is also a suffix. A proper prefix is a
prefix with a whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB”
and “ABC”. Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC”, and
“ABC”.
● We search for lps in subpatterns. More clearly we focus on sub-strings of patterns that
are both prefix and suffix.
● For each sub-pattern pat[0..i] where i = 0 to m-1, lps[i] stores the length of the maximum
matching proper prefix which is also a suffix of the sub-pattern pat[0..i].
● lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i].

05/28/2025 Department of Computer Science Slide No.


● Note: lps[i] could also be defined as the ● Preprocessing Algorithm:
longest prefix which is also a proper suffix. ● In the preprocessing part,
We need to use it properly in one place to
make sure that the whole substring is not ● We calculate values in lps[]. To do that, we
considered. keep track of the length of the longest
prefix suffix value (we use len variable for
● Examples of lps[] construction: this purpose) for the previous index
● For the pattern “AAAA”, lps[] is [0, 1, 2, 3] ● We initialize lps[0] and len as 0.
● For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, ● If pat[len] and pat[i] match, we increment
0] len by 1 and assign the incremented value
● For the pattern “AABAACAABAA”, lps[] is to lps[i].
[0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5] ● If pat[i] and pat[len] do not match and len
● For the pattern “AAACAAAAAC”, lps[] is [0, is not 0, we update len to lps[len-1]
1, 2, 0, 1, 2, 3, 3, 3, 4] ● See computeLPSArray() in the below code
● For the pattern “AAABAAA”, lps[] is [0, 1, for details
2, 0, 1, 2, 3]

05/28/2025 Department of Computer Science Slide No.


Illustration of preprocessing (or construction of lps[]):
● pat[] = “AAACAAAA” ● Since pat[len] and pat[i] do not match,
=> len = 0, i = 0: and len > 0,
● lps[0] is always 0, we move to i = 1 ● Set len = lps[len-1] = lps[1] = 1

=> len = 0, i = 1: => len = 1, i = 3:


● Since pat[len] and pat[i] match, do len++, ● Since pat[len] and pat[i] do not match and
len > 0,
● store it in lps[i] and do i++.
● len = lps[len-1] = lps[0] = 0
● Set len = 1, lps[1] = 1, i = 2
=> len = 0, i = 3:
=> len = 1, i = 2:
● Since pat[len] and pat[i] do not match and
● Since pat[len] and pat[i] match, do len++, len = 0,
● store it in lps[i] and do i++. ● Set lps[3] = 0 and i = 4
● Set len = 2, lps[2] = 2, i = 3
=> len = 2, i = 3:
05/28/2025 Department of Computer Science Slide No.
=> len = 0, i = 4: ● Since pat[len] and pat[i] do not match and
● Since pat[len] and pat[i] match, do len++, len > 0,
● Store it in lps[i] and do i++. ● Set len = lps[len-1] = lps[2] = 2
● Set len = 1, lps[4] = 1, i = 5 => len = 2, i = 7:
=> len = 1, i = 5: ● Since pat[len] and pat[i] match, do len++,
● Since pat[len] and pat[i] match, do len++, ● Store it in lps[i] and do i++.
● Store it in lps[i] and do i++. ● len = 3, lps[7] = 3, i = 8
● Set len = 2, lps[5] = 2, i = 6 ● We stop here as we have constructed the
whole lps[].
=> len = 2, i = 6:
● Since pat[len] and pat[i] match, do len++,
● Store it in lps[i] and do i++.
● len = 3, lps[6] = 3, i = 7
=> len = 3, i = 7:
05/28/2025 Department of Computer Science Slide No.
● For example, in a string:
○ abab

○ The prefix for the above string can be: a, ab, or aba. The prefix excludes the last string
character.

○ The suffix for the above string can be: b, ba, or bab. It excludes the first character of the
string.

○ Consider an example to make an LPP table.

○ String = ababcb

○ And the LPP table is shown below:

○ Index value: 1 2 3 4 5

○ a b a b d

○ 0 0 1 2 0

05/28/2025 Department of Computer Science Slide No.


Implementation of KMP algorithm:
Unlike the Naive algorithm, where we slide the pattern by one and compare all characters at
each shift, we use a value from lps[] to decide the next characters to be matched. The idea is
to not match a character that we know will anyway match.
How to use lps[] to decide the next positions (or to know the number of characters to be
skipped)?
● We start the comparison of pat[j] with j = 0 with characters of the current window of text.
● We keep matching characters txt[i] and pat[j] and keep incrementing i and j while pat[j]
and txt[i] keep matching.
● When we see a mismatch
○ We know that characters pat[0..j-1] match with txt[i-j…i-1] (Note that j starts with 0 and
increments it only when there is a match).

○ We also know (from the above definition) that lps[j-1] is the count of characters of pat[0…j-1]
that are both proper prefix and suffix.

05/28/2025 Department of Computer Science Slide No.


● LLP values
● We check that, in the string the first “a” comes for the first time so its value is 0.
● “b” comes the first time in the string so its value is 0.
● The next character in the string is “a” it is already present with an index value of 1 so its
value will be 1.
● The next character “b” is repeating and it is present at index value 2, so its value will be
2.
● String character “c” comes the first time in the string, so its value will be 0.
● This is how you can make an LPP table by using the same index values for repeating
characters.

05/28/2025 Department of Computer Science Slide No.


Working of the Knuth Morris Pratt Algorithm
Example: Given a string 'T' and pattern 'P' as follows:

Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'
For 'p' the prefix function, ? was computed previously and is as follows:

05/28/2025 Department of Computer Science Slide No.


● Initially: n = size of T = 15
● m = size of P = 7

05/28/2025 Department of Computer Science Slide No.


05/28/2025 Department of Computer Science Slide No.
05/28/2025 Department of Computer Science Slide No.
05/28/2025 Department of Computer Science Slide No.
05/28/2025 Department of Computer Science Slide No.
05/28/2025 Department of Computer Science Slide No.
Example: 2

05/28/2025 Department of Computer Science Slide No.


● Time Complexity: O(N+M) where N is the length of the text
and M is the length of the pattern to be found.
● Auxiliary Space: O(M)

05/28/2025 Department of Computer Science Slide No.


Rabin-Karp Algorithm
Rabin-Karp algorithm is an algorithm used for searching/matching patterns in the text using
a hash function. Unlike Naive string matching algorithm, it does not travel through every
character in the initial phase rather it filters the characters that do not match and then
performs the comparison.
A hash function is a tool to map a larger input value to a smaller output value. This output
value is called the hash value.
How Rabin-Karp Algorithm Works?
A sequence of characters is taken and checked for the possibility of the presence of the
required string. If the possibility is found then, character matching is performed.

05/28/2025 Department of Computer Science Slide No.


Example
1. Let the text be:

And the string to be searched in the above text be:

05/28/2025 Department of Computer Science Slide No.


2. Let us assign a numerical value(v)/weight for the characters we will be using in the problem. Here,
we have taken first ten alphabets only (i.e. A to J).

3. n be the length of the pattern and m be the length of the text. Here, m = 10 and n = 3.
Let d be the number of characters in the input set. Here, we have taken input set {A, B, C, ..., J}. So, d =
10. You can assume any suitable value for d.
4. Let us calculate the hash value of the pattern.

05/28/2025 Department of Computer Science Slide No.


hash value for pattern(p) = Σ(v * dm-1) mod 13
= ((3 * 102) + (4 * 101) + (4 * 100)) mod 13
= 344 mod 13
=6

In the calculation above, choose a prime number (here, 13) in such a way that we can perform all the
calculations with single-precision arithmetic.
5. Calculate the hash value for the text-window of size m.
For the first window ABC,
hash value for text(t) = Σ(v * dn-1) mod 13
= ((1 * 102) + (2 * 101) + (3 * 100)) mod 13
= 123 mod 13
=6

05/28/2025 Department of Computer Science Slide No.


6. Compare the hash value of the pattern with the hash value of the text. If they match then,
character-matching is performed.
7. In the above examples, the hash value of the first window (i.e. t) matches with p so, go for character
matching between ABC and CDD. Since they do not match so, go for the next window.
We calculate the hash value of the next window by subtracting the first term and adding the next term
as shown below.

t = ((1 * 102) + ((2 * 101) + (3 * 100)) * 10 + (3 * 100)) mod 13

= 233 mod 13

= 12
In order to optimize this process, we make use of the previous hash value in the following way.
t = ((d * (t - v[character to be removed] * h) + v[character to be added] ) mod 13
= ((10 * (6 - 1 * 9) + 3 )mod 13
= 12
Where, h = dm-1 = 103-1 = 100.

05/28/2025 Department of Computer Science Slide No.


For BCC, t = 12 (≠6). Therefore, go for the next window.
After a few searches, we will get the match for the window CDA in the text.

05/28/2025 Department of Computer Science Slide No.


Algorithm
● n = t.length ● If s < n-m
● m = p.length ● ts + 1 = (d (ts - t[s + 1]h) + t[s + m +
● h = dm-1 mod q 1]) mod q
● p=0
● t0 = 0
● for i = 1 to m
● p = (dp + p[i]) mod q
● t0 = (dt0 + t[i]) mod q
● for s = 0 to n - m
● if p = ts
● if p[1.....m] = t[s + 1..... s + m]
● print "pattern found at position" s
05/28/2025 Department of Computer Science Slide No.
Limitations of Rabin-Karp Algorithm
Spurious Hit
When the hash value of the pattern matches with the hash value of a window of the text but
the window is not the actual pattern then it is called a spurious hit.
Spurious hit increases the time complexity of the algorithm. In order to minimize spurious
hit, we use modulus. It greatly reduces the spurious hit.

Rabin-Karp Algorithm Complexity


The average case and best case complexity of Rabin-Karp algorithm is O(m + n) and the worst
case complexity is O(mn).

The worst-case complexity occurs when spurious hits occur a number for all the windows.

05/28/2025 Department of Computer Science Slide No.

You might also like