String Matching
String Matching
Sushma Prajapati
Assistant Professor
CO Dept
CKPCET,surat
Email:[email protected]
Outline
● Introduction
● The naive string matching algorithm
● The Rabin-Karp algorithm
● String Matching with finite automata
● The Knuth-Morris-Pratt algorithm.
Introduction
● String Matching Algorithm is also called "String Searching Algorithm”
● As with most algorithms, the main considerations for string searching are speed
and efficiency.
● Problem is to find all occurrences of pattern P[1..m] within text T[1..n]
● P occurs with shift s (beginning at s+1):
○ P[1]=T[s+1], P[2]=T[s+2],…,P[m]=T[s+m].
● If so, call s is a valid shift, otherwise, an invalid shift.
The naive string matching algorithm
Input: P and T, the pattern and text strings; m, the length of P. n, length of T. The pattern
is assumed to be nonempty.
Output: The return value is the index in T where a copy of P begins, or -1 if no match for
P is found.
The naive string matching algorithm:
Introduction
● Naïve pattern searching is the simplest method among other pattern searching
algorithms
● It checks for all character of the main string to the pattern. This algorithm is helpful
for smaller texts. It does not need any pre-processing phases. We can find substring
by checking once for the string. It also does not occupy extra space to perform the
operation.
● The naive approach tests all the possible placement of Pattern P [1…….m] relative
to text T [1……n]. We try shift s = 0, 1…….n-m, successively and for each shift s.
Compare T [s+1…….s+m] to P [1……m].It returns all the valid shifts found.
The naive string matching : Algorithm
●
The naive string matching : Time Analysis
● Best case occurs when the first character of the pattern is not present in text at all.
○ Example: T[] = "BBACCAADDEE"; P[] = "HBB";
○ The number of comparisons in best case is O(n)
● worst case occurs in following scenarios.
○ When all characters of the text and pattern are same.
■ T[] = "DDDDDDDDDDDD" ; P[]="DDDDD"
○ Occurs when only the last character is different.
■ T[] = "VVVVVVVVVVVVK" ; P[]="VVVK"
○ The number of comparisons in the worst case is O(m*(n-m+1))
Problem with naive string matching
algorithm
● The naive string matcher is inefficient because information gained about the text
for one value of s is entirely ignored in considering other values of s.
● Example
T=xabxyabxyabxz P=abxyabxz
abxyabxz
X Whenever a character mismatch occurs after
abxyabxz matching of several characters, the comparison
vvvvvvvX begins by going back in T from the character
abxyabxz which follows the last beginning character.
Better Approach for string matching
● To do some preprocessing based on either pattern or text
● Some of String matching algorithms based on these are
○ The Rabin-Karp Algorithm
○ String Matching with finite automata
○ The Knuth-Morris-Pratt algorithm.
Rabin – Karp Algorithm
Rabin – Karp Algorithm
● The Rabin-Karp string searching algorithm calculates a hash value for the pattern,
and for each M-character subsequence of text to be compared.
● If the hash values are unequal, the algorithm will calculate the hash value for next
M-character sequence.
● If the hash values are equal, the algorithm will do a Brute Force comparison
between the pattern and the M-character sequence.
● In this way, there is only one comparison per text subsequence, and Brute Force is
only needed when hash values match.
Notation used in algorithm
● Let Σ = {0,1,2, . . .,9}.
● We can view a string of k consecutive characters as representing a length-k decimal
number.
● Let p denote the decimal number(hashcode) for P[1..m]
● Let ts denote the decimal value(hashcode) of the length-m substring T[s+1..s+m]
of T[1..n] for s = 0, 1, . . ., n-m.
● ts = p if and only if
○ T[s+1..s+m] = P[1..m], and s is a valid shift.
● p = P[m] + 10(P[m-1] +10(P[m-2]+ . . . +10(P[2]+10(P[1]))
We can compute p in O(m) time
● Similarly we can compute t0 from T[1..m] in O(m) time.
Notation used in algorithm(Contd…)
● ts+1 can be computed from ts in constant time.
● Example : T = 314152
ts = 31415, s = 0, m= 5, T[s+1]=3 and T[s+m+1] = 2
● Thus p and t0, t1, . . ., tn-m can all be computed in O(n+m) time.
● And all occurences of the pattern P[1..m] in the text T[1..n] can be found in time
O(n+m).
Notation used in algorithm(Contd…)
● However, p and ts may be too large to work with conveniently.
Do we have a simple solution!!
● mod all calculations by a selected value, q
● for a d-ary alphabet select q to be a large prime such that dq fits into one computer
word
where h = dm-1(mod q) is the value of the digit “1” in the high order position of an
m-digit text window.
Rabin – Karp : Algorithm
RABIN-KARP-MATCHER(T,P,d,q)