String Matching

Uploaded by

smtptesting021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views

String Matching

Uploaded by

smtptesting021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

String Matching:

Sushma Prajapati
Assistant Professor
CO Dept
CKPCET,surat
Email:[email protected]
Outline
● Introduction
● The naive string matching algorithm
● The Rabin-Karp algorithm
● String Matching with finite automata
● The Knuth-Morris-Pratt algorithm.
Introduction
● String Matching Algorithm is also called "String Searching Algorithm”
● As with most algorithms, the main considerations for string searching are speed
and eﬀiciency.
● Problem is to find all occurrences of pattern P[1..m] within text T[1..n]
● P occurs with shift s (beginning at s+1):
○ P[1]=T[s+1], P[2]=T[s+2],…,P[m]=T[s+m].
● If so, call s is a valid shift, otherwise, an invalid shift.
The naive string matching algorithm
Input: P and T, the pattern and text strings; m, the length of P. n, length of T. The pattern
is assumed to be nonempty.

Output: The return value is the index in T where a copy of P begins, or -1 if no match for
P is found.
The naive string matching algorithm:
Introduction
● Naïve pattern searching is the simplest method among other pattern searching
algorithms
● It checks for all character of the main string to the pattern. This algorithm is helpful
for smaller texts. It does not need any pre-processing phases. We can find substring
by checking once for the string. It also does not occupy extra space to perform the
operation.
● The naive approach tests all the possible placement of Pattern P [1…….m] relative
to text T [1……n]. We try shift s = 0, 1…….n-m, successively and for each shift s.
Compare T [s+1…….s+m] to P [1……m].It returns all the valid shifts found.
The naive string matching : Algorithm
●
The naive string matching : Time Analysis
● Best case occurs when the first character of the pattern is not present in text at all.
○ Example: T[] = "BBACCAADDEE"; P[] = "HBB";
○ The number of comparisons in best case is O(n)
● worst case occurs in following scenarios.
○ When all characters of the text and pattern are same.
■ T[] = "DDDDDDDDDDDD" ; P[]="DDDDD"
○ Occurs when only the last character is diﬀerent.
■ T[] = "VVVVVVVVVVVVK" ; P[]="VVVK"
○ The number of comparisons in the worst case is O(m*(n-m+1))
Problem with naive string matching
algorithm
● The naive string matcher is ineﬀicient because information gained about the text
for one value of s is entirely ignored in considering other values of s.
● Example

T=xabxyabxyabxz P=abxyabxz
abxyabxz
X Whenever a character mismatch occurs after
abxyabxz matching of several characters, the comparison
vvvvvvvX begins by going back in T from the character
abxyabxz which follows the last beginning character.
Better Approach for string matching
● To do some preprocessing based on either pattern or text
● Some of String matching algorithms based on these are
○ The Rabin-Karp Algorithm
○ String Matching with finite automata
○ The Knuth-Morris-Pratt algorithm.
Rabin – Karp Algorithm
Rabin – Karp Algorithm
● The Rabin-Karp string searching algorithm calculates a hash value for the pattern,
and for each M-character subsequence of text to be compared.
● If the hash values are unequal, the algorithm will calculate the hash value for next
M-character sequence.
● If the hash values are equal, the algorithm will do a Brute Force comparison
between the pattern and the M-character sequence.
● In this way, there is only one comparison per text subsequence, and Brute Force is
only needed when hash values match.
Notation used in algorithm
● Let Σ = {0,1,2, . . .,9}.
● We can view a string of k consecutive characters as representing a length-k decimal
number.
● Let p denote the decimal number(hashcode) for P[1..m]
● Let ts denote the decimal value(hashcode) of the length-m substring T[s+1..s+m]
of T[1..n] for s = 0, 1, . . ., n-m.
● ts = p if and only if
○ T[s+1..s+m] = P[1..m], and s is a valid shift.
● p = P[m] + 10(P[m-1] +10(P[m-2]+ . . . +10(P[2]+10(P[1]))
We can compute p in O(m) time
● Similarly we can compute t0 from T[1..m] in O(m) time.
Notation used in algorithm(Contd…)
● ts+1 can be computed from ts in constant time.

ts+1 = 10(ts –10m-1 T[s+1])+ T[s+m+1]

● Example : T = 314152
ts = 31415, s = 0, m= 5, T[s+1]=3 and T[s+m+1] = 2

ts+1= 10(31415 –10000*3) +2 = 14152

● Thus p and t0, t1, . . ., tn-m can all be computed in O(n+m) time.
● And all occurences of the pattern P[1..m] in the text T[1..n] can be found in time
O(n+m).
Notation used in algorithm(Contd…)
● However, p and ts may be too large to work with conveniently.
Do we have a simple solution!!
● mod all calculations by a selected value, q
● for a d-ary alphabet select q to be a large prime such that dq fits into one computer
word

● The recurrence equation can be rewritten as

where h = dm-1(mod q) is the value of the digit “1” in the high order position of an
m-digit text window.
Rabin – Karp : Algorithm
RABIN-KARP-MATCHER(T,P,d,q)

Input : Text T, pattern P, radix d ,

and the prime q.

Rabin – Karp Algorithm : Example
Rabin - Karp : Time Analysis
● The average and best case O(n+m)
● worst-case time is O(nm).
● Worst case of Rabin-Karp algorithm occurs when all characters of pattern and text
are same as the hash values of all the substrings of txt[] match with hash value of
pat[]. For example pat[] = “AAA” and txt[] = “AAAAAAA”.
Rabin - Karp : Problem to Solve
Working modulo q=11, how many spurious hits does the Rabin-Karp matcher
encounter in the text T=3141592653589793 when looking for the pattern P=26?
String Matching with Fininite Automata
String Matching with Finite Automata
● In this algorithm we preprocess the pattern and build a 2D array that represents a
Finite Automata
● Construction of the FA is the main tricky part of this algorithm
● Once the FA is built, the searching is simple. In search, we simply need to start from
the first state of the automata and the first character of the text
● At every step, we consider next character of text, look for the next state in the built
FA and move to a new state
● If we reach the final state, then the pattern is found in the text.
● The matching takes O(n) time since each character is examined once.
Finite Automata
● A finite automaton (FA) is a simple idealized machine used to recognize patterns
within input taken from some character set (or alphabet) . The job of an FA is to
accept or reject an input depending on whether the pattern defined by the FA
occurs in the input.
● A finite automaton is a 5 tuple (Q,Σ,δ, q0, F):where,
○ Q: the finite set of states
○ Σ: the finite input alphabet
○ δ: the “transition function” from Q × Σ → Q
○ q0: the start state
○ F : the set of final (accepting) states
Finite Automata
Finite Automata
Finite Automata : Algorithm
● Once we have constructed a finite automaton
for the pattern,searching a text t1,t2....tn for
the pattern works wonderfully.
● Search time is O(n). Each character in the text
is examined just once, in sequential order.
KMP Algorithm
● Knuth, Morris and Pratt proposed a linear time algorithm for the string matching
problem
● The key observation
● This approach is similar to the finite state automaton
● When there is a mismatch after several characters match, then the pattern and
search string contain the same values; therefore we can match the pattern
against itself by precomputing a prefix function to find out how far we can
shift ahead
● This means we can dispense with computing the transition function 𝛿
altogether
Components of KMP Algorithm
● The Prefix Function (Π):
○ The Prefix Function, Π for a pattern encapsulates knowledge about how the pattern matches against
the shift of itself. This information can be used to avoid a useless shift of the pattern 'p.' In other
words, this enables avoiding backtracking of the string 'S.'
● The KMP Matcher:
○ With string 'S,' pattern 'p' and prefix function 'Π' as inputs, find the occurrence of 'p' in 'S' and returns
the number of shifts of 'p' after which occurrences are found.
Computing the preﬁx function
COMPUTE- PREFIX- FUNCTION (P)
1. m ←length [P] //'p' pattern to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
Example of preﬁx function
KMP Matcher
The KMP Matcher, with pattern ‘p’, string ‘S’ and KMP-MATCHER (T, P)
prefix function ‘Π’ as input, finds a match of p in S. 1. n ← length [T]
Following pseudocode computes the matching 2. m ← length [P]
component of KMP algorithm: 3. Π← COMPUTE-PREFIX-FUNCTION (P)
4. q ← 0 // numbers of characters matched
5. for i ← 1 to n // scan S from left to right
6. do while q > 0 and P [q + 1] ≠ T [i]
7. do q ← Π [q] // next character does not
match
8. If P [q + 1] = T [i]
9. then q ← q + 1 // next character matches
10 . If q = m // is all of p matched?
11. then print "Pattern occurs with shift" i - m
12. q ← Π [q] // look for the next
match
KMP Runtime Analysis
The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of the string 'S.'
Since step 1 to step 4 take constant times, the running time is dominated by this for the
loop. Thus running time of the matching function is O (n).

DSA-Module 4 Notes
100% (2)
DSA-Module 4 Notes
23 pages
DSA-Module 3Notes
100% (2)
DSA-Module 3Notes
70 pages
Unit-5 Oose Question and Answers
100% (1)
Unit-5 Oose Question and Answers
14 pages
3hh13842fgaatczza01 - v1 - r6.6 Cli Commands For 7362 Isam DFSF
100% (3)
3hh13842fgaatczza01 - v1 - r6.6 Cli Commands For 7362 Isam DFSF
4,628 pages
PMI ACP Course Workbook
75% (4)
PMI ACP Course Workbook
273 pages
UNIT-1-EC8702-Adhoc and Wireless Sensor Networks
No ratings yet
UNIT-1-EC8702-Adhoc and Wireless Sensor Networks
37 pages
String Matching
100% (1)
String Matching
27 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
Functional Requirements Non Functional Requirements: 4. I. Ii. Iii. Iv. V
No ratings yet
Functional Requirements Non Functional Requirements: 4. I. Ii. Iii. Iv. V
7 pages
DSA-Module 2 Notes
100% (1)
DSA-Module 2 Notes
18 pages
DSA-Module 1_ Notes on Search Trees and Their Operations
No ratings yet
DSA-Module 1_ Notes on Search Trees and Their Operations
29 pages
String Matching
100% (1)
String Matching
12 pages
64 Bit Processors - 1
100% (1)
64 Bit Processors - 1
15 pages
Classical Problems of Synchronization
No ratings yet
Classical Problems of Synchronization
10 pages
Daa
No ratings yet
Daa
113 pages
Blockchain Technology
No ratings yet
Blockchain Technology
33 pages
Interprocess Communication and Synchronization
No ratings yet
Interprocess Communication and Synchronization
9 pages
Process Synchronization: Critical Section Problem
No ratings yet
Process Synchronization: Critical Section Problem
8 pages
Software Requirements Elicitation
No ratings yet
Software Requirements Elicitation
10 pages
Unit - 1 Block Chain
No ratings yet
Unit - 1 Block Chain
81 pages
CPU Scheduling
No ratings yet
CPU Scheduling
48 pages
Advanced Computer Networking
No ratings yet
Advanced Computer Networking
1 page
Evaluation of Postfix Expressions
No ratings yet
Evaluation of Postfix Expressions
7 pages
OS PPT Introduction
No ratings yet
OS PPT Introduction
43 pages
Mobile Computing EEM 825/ PEE411 Credits:4: Syllabus
100% (1)
Mobile Computing EEM 825/ PEE411 Credits:4: Syllabus
6 pages
Sppu CN Insem Solved Paper Aug 2018
No ratings yet
Sppu CN Insem Solved Paper Aug 2018
14 pages
6CS5 DS Unit-4
No ratings yet
6CS5 DS Unit-4
64 pages
Deadlock Assignment
No ratings yet
Deadlock Assignment
6 pages
Mc9233 Software Engineering
No ratings yet
Mc9233 Software Engineering
10 pages
Question Bank Unit 1 PDF
No ratings yet
Question Bank Unit 1 PDF
27 pages
Aditya Engineering College (A) : Python Data Structures
No ratings yet
Aditya Engineering College (A) : Python Data Structures
7 pages
Distributed System
No ratings yet
Distributed System
162 pages
Web Development Using PHP
No ratings yet
Web Development Using PHP
65 pages
Chapter 6: Synchronization Tools: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
No ratings yet
Chapter 6: Synchronization Tools: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
61 pages
Kernel I/O Subsystem in Operating System
No ratings yet
Kernel I/O Subsystem in Operating System
2 pages
Chapter 13
No ratings yet
Chapter 13
21 pages
DSA RTU 2022 Paper
No ratings yet
DSA RTU 2022 Paper
15 pages
Hierarchical Clustering in Machine Learning - GeeksforGeeks
No ratings yet
Hierarchical Clustering in Machine Learning - GeeksforGeeks
8 pages
Question TOC
100% (1)
Question TOC
6 pages
Formal Language & Automata Theory
No ratings yet
Formal Language & Automata Theory
19 pages
Unit No.4 Parallel Database
No ratings yet
Unit No.4 Parallel Database
32 pages
IOT Module 1 Ch1 PG Maya Final
No ratings yet
IOT Module 1 Ch1 PG Maya Final
74 pages
Distributed Operating Systems: Unit - 2
No ratings yet
Distributed Operating Systems: Unit - 2
48 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
DAA With Ans Wheebox
No ratings yet
DAA With Ans Wheebox
485 pages
Unit I Fundamentals of Computer Design and Ilp-1-14
No ratings yet
Unit I Fundamentals of Computer Design and Ilp-1-14
14 pages
6.2 Elements of Transport Protocols PDF
No ratings yet
6.2 Elements of Transport Protocols PDF
12 pages
Vtunotesbysri: Module 1: Application Layer
No ratings yet
Vtunotesbysri: Module 1: Application Layer
30 pages
OS Module Wise Question Bank
No ratings yet
OS Module Wise Question Bank
2 pages
Eqps CN
No ratings yet
Eqps CN
43 pages
Distributed Deadlock Detection
No ratings yet
Distributed Deadlock Detection
18 pages
STM Unit - 3 Notes
No ratings yet
STM Unit - 3 Notes
38 pages
Unit 3 Path Testing
No ratings yet
Unit 3 Path Testing
2 pages
Questions Pool For Distributed Systems
No ratings yet
Questions Pool For Distributed Systems
15 pages
CS401 Bankers Algorithm Exercise FINAL
No ratings yet
CS401 Bankers Algorithm Exercise FINAL
6 pages
MC - Ii Unit
No ratings yet
MC - Ii Unit
11 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Java Reflection Complete Self-Assessment Guide
From Everand
Java Reflection Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Ada Notes Unit 4
No ratings yet
Ada Notes Unit 4
28 pages
DAA_unit_5
No ratings yet
DAA_unit_5
22 pages
String Matching
No ratings yet
String Matching
63 pages
Cambium: PMP 400/430 Series Networks PTP 200/230 Series Bridges
No ratings yet
Cambium: PMP 400/430 Series Networks PTP 200/230 Series Bridges
79 pages
Import Data
No ratings yet
Import Data
41 pages
Gann Part 01
No ratings yet
Gann Part 01
14 pages
Real-Time Iot Stream Processing and Large-Scale Data Analytics For Smart City Applications
No ratings yet
Real-Time Iot Stream Processing and Large-Scale Data Analytics For Smart City Applications
50 pages
Message
No ratings yet
Message
53 pages
Instruction Manual For DODGE Setscrew, Eccentric Collar, D-Lok, H-E Series, E-Z Kleen, Ultra Kleen and Food Safe Mounted Ball Bearings
No ratings yet
Instruction Manual For DODGE Setscrew, Eccentric Collar, D-Lok, H-E Series, E-Z Kleen, Ultra Kleen and Food Safe Mounted Ball Bearings
2 pages
2 Computer Networks
No ratings yet
2 Computer Networks
13 pages
Experiment - 6 - Water Level Controller - Auto Mode
No ratings yet
Experiment - 6 - Water Level Controller - Auto Mode
5 pages
MANUAL ADMINISTRACION ZULTYS MX - Admin
No ratings yet
MANUAL ADMINISTRACION ZULTYS MX - Admin
672 pages
PDF (Ebook) Foundations and Methods of Stochastic Simulation: A First Course by Barry L. Nelson, Linda Pei ISBN 9783030861933, 9783030861940, 3030861937, 3030861945 download
100% (3)
PDF (Ebook) Foundations and Methods of Stochastic Simulation: A First Course by Barry L. Nelson, Linda Pei ISBN 9783030861933, 9783030861940, 3030861937, 3030861945 download
81 pages
Corona 2.4Ghz Spread Spectrum Receiver Instruction Manual: Compatibility
No ratings yet
Corona 2.4Ghz Spread Spectrum Receiver Instruction Manual: Compatibility
2 pages
Dublin Core Lit Review
No ratings yet
Dublin Core Lit Review
13 pages
SL Hackathon-2022 IoT Based Solar Still Team Id TM001837
No ratings yet
SL Hackathon-2022 IoT Based Solar Still Team Id TM001837
43 pages
Computer Science Universities List
No ratings yet
Computer Science Universities List
14 pages
OPC Exercises
No ratings yet
OPC Exercises
14 pages
C'Pol - 3801 - Overall Migration Certificate
No ratings yet
C'Pol - 3801 - Overall Migration Certificate
2 pages
L@TiViewInstallation Hochiki
No ratings yet
L@TiViewInstallation Hochiki
101 pages
PPM 300 Picus Manual 4189341080 Uk
No ratings yet
PPM 300 Picus Manual 4189341080 Uk
127 pages
Pixhawk: User Manual
100% (1)
Pixhawk: User Manual
53 pages
Aspnet 8 Best Practices 1 Converted Jonathan R Danylko download
No ratings yet
Aspnet 8 Best Practices 1 Converted Jonathan R Danylko download
86 pages
Signals Sampling Theorem
No ratings yet
Signals Sampling Theorem
3 pages
Membership Form-SYF
No ratings yet
Membership Form-SYF
2 pages
Windows 11 Simplified Paul Mcfedries download
100% (1)
Windows 11 Simplified Paul Mcfedries download
29 pages
LGGoldstarEZ Digital OS3020 OS3040 OS3060 Manual
No ratings yet
LGGoldstarEZ Digital OS3020 OS3040 OS3060 Manual
149 pages
Ai Lab-gr6 Research-final Paper
No ratings yet
Ai Lab-gr6 Research-final Paper
8 pages
Time Attendance and Access Control Management System User Manual
No ratings yet
Time Attendance and Access Control Management System User Manual
49 pages
Ricoh DD 6650P: Digital Duplicator
No ratings yet
Ricoh DD 6650P: Digital Duplicator
4 pages
IN ZURICH WE FELL Deforselina All Chapter Instant Download
No ratings yet
IN ZURICH WE FELL Deforselina All Chapter Instant Download
32 pages

String Matching

Uploaded by

String Matching

Uploaded by

String Matching:

ts+1 = 10(ts –10m-1 T[s+1])+ T[s+m+1]

ts+1= 10(31415 –10000*3) +2 = 14152

● The recurrence equation can be rewritten as

Input : Text T, pattern P, radix d ,

and the prime q.

You might also like