0% found this document useful (0 votes)
5 views

Unit 3

The document discusses various string pattern matching algorithms, focusing on the brute force method and the Knuth-Morris-Pratt (KMP) algorithm, which improves efficiency by avoiding unnecessary comparisons. It also introduces tries, a tree-based data structure for efficient pattern and prefix matching, along with their advantages and limitations. Additionally, it covers compressed tries, which optimize standard tries by reducing redundant nodes and improving search efficiency.

Uploaded by

lixaxaj663
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit 3

The document discusses various string pattern matching algorithms, focusing on the brute force method and the Knuth-Morris-Pratt (KMP) algorithm, which improves efficiency by avoiding unnecessary comparisons. It also introduces tries, a tree-based data structure for efficient pattern and prefix matching, along with their advantages and limitations. Additionally, it covers compressed tries, which optimize standard tries by reducing redundant nodes and improving search efficiency.

Uploaded by

lixaxaj663
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Pattern Matching Algorithms

In the classic pattern matching problem on strings, we are given a text string
T of length n and a pattern string P of length m, and want to find whether P is
a substring of T.

The notion of a “match” is that there is a substring of T starting at some index


i that matches P, character by character, so that T [i] = P[0], T [i + 1] =
P[1], ..., T[i+m−1] = P[m−1].
That is, P = T[i..i+m−1].

Thus, the output from a pattern matching algorithm could either be some
indication that the pattern P does not exist in T or an integer indicating the
starting index in T of a substring matching P.
Brute Force
• Brute force is a
problem-solving
technique that
involves
systematically
exploring all possible
solutions or
combinations to find
the correct one,
without using any
shortcuts or
optimizations
• Test all the possible
placements of P relative to
T.
Performance
The outer for loop is executed at most n − m + 1 times, and the inner loop is executed at most m times.
Thus, the running time of the brute-force method is O((n − m + 1)m), which is simplified as O(nm).
Note that when m = n/2, this algorithm has quadratic running time O(n2).

Limitations of Brute Force


1.Time Complexity:
1. Brute force methods are often inefficient, with high time complexity, especially for large
input sizes (e.g., exponential or factorial time).
2.Resource Intensive:
1. Consumes significant computational resources, including processing power and memory.
3.Scalability Issues:
1. Not practical for problems with large input spaces due to the combinatorial explosion of
possibilities.
4.Impractical for Real-Time Applications:
1. Cannot be used in situations requiring fast or real-time solutions due to its slow
processing.
The Knuth-Morris-Pratt Algorithm
Knuth, Morris and Pratt proposed a linear time algorithm
for the string matching problem.
A matching time of O(n) is achieved by avoiding
comparisons with elements of ‘S’ that have previously
been involved in comparison with some element of the
pattern ‘p’ to be matched. i.e., backtracking on the
string ‘S’ never occurs

“No backtracking”
Components of KMP algorithm
• The prefix function/ Failure Function, Π
The prefix function,Π for a pattern encapsulates knowledge
about how the pattern matches against shifts of itself. It is
the length o longest proper prefix which is also a suffix of
same substring. This information can be used to avoid
useless shifts of the pattern ‘p’. In other words, this enables
avoiding backtracking on the string ‘S’.

• The KMP Matcher


With string ‘S’, pattern ‘p’ and prefix function ‘Π’ as inputs,
finds the occurrence of ‘p’ in ‘S’ and returns the number of
shifts of ‘p’ after which occurrence is found.
The prefix function, Π
The KMP Matcher
Illustration: given a String ‘S’ and pattern ‘p’
as follows:

S b a c b a b a b a b a c a c a

p a b a b a c a
Let us execute the KMP algorithm to
find whether ‘p’ occurs in ‘S’.
For ‘p’ the prefix function, Π was computed previously and is as
follows:
q 1 2 3 4 5 6 7

p a b A b a c a

Π 0 0 1 2 3 1 1
Initially: n = size of S = 15;
m = size of p = 7

Step 1: i = 1, q = 0
comparing p[1] with
S[1]
S b a c b a b a b a b a c a a b

p a b a b a c a
P[1] does not match with S[1]. ‘p’ will be shifted one position to
the right.
Step 2: i = 2, q = 0
comparing p[1] with S[2]

S b a c b a b a b a b a c a a b

p a b a b a c a
P[1] matches S[2]. Since there is a match, p is not shifted.
Step 3: i = 3, q = 1
Comparing p[2] with S[3]
p[2] does not match with S[3]
S b a c b a b a b a b a c a a b

p a b a b a c a
Backtracking on p, comparing p[1] and S[3]
Step 4: i = 4, q
comparing p[1] with p[1] does not match with S[4]
=0
S[4]
S b a c b a b a b a b a c a a b

p a b a b a c a
Step 5: i = 5, q
comparing p[1] with S[5] p[1] matches with S[5]
=0
S b a c b a b a b a b a c a a b

p a b a b a c a
Step 6: i = 6, q = 1
Comparing p[2] with S[6]
p[2] matches with S[6]
S b a c b a b a b a b a c a a b

p a b a b a c a

Step 7: i = 7, q = 2
Comparing p[3] with S[7]
p[3] matches with S[7]

S b a c b a b a b a b a c a a b

p a b a b a c a
Step 8: i = 8, q = 3
Comparing p[4] with S[8]
p[4] matches with S[8]

S b a c b a b a b a b a c a a b

p a b a b a c a
Step 9: i = 9, q = 4
Comparing p[5] with S[9]
p[5] matches with S[9]
S b a c b a b a b a b a c a a b

p a b a b a c a

Step 10: i = 10, q = 5


p[6] doesn’t match with S[10]
Comparing p[6] with S[10]

S b a c b a b a b a b a c a a b

p a b a b a c a
Backtracking on p, comparing p[4] with S[10] because after mismatch q = Π[5] = 3

Step 11: i = 11, q = 4


Comparing p[5] with S[11]p[5] matches with S[11]
b a c b a b a b a b a c a a b
S
p a b a b a c a
Step 12: i = 12, q = 5
Comparing p[6] with S[12] p[6] matches with S[12]
S b a c b a b a b a b a c a a b

p a b a b a c a

Step 13: i = 13, q = 6


Comparing p[7] with S[13] p[7] matches with S[13]

S b a c b a b a b a b a c a a b

p a b a b a c a

Pattern ‘p’ has been found to completely occur in string ‘S’. The total number of shifts
that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.
• KMP Matcher
Running - time analysis 1 n  length[S]
2 m  length[p]
• Compute-Prefix-Function (Π) 3 Π  Compute-Prefix-Function(p)
1 m  length[p] //’p’ pattern to be 4q0
matched 5 for i  1 to n
2 Π[1]  0 6 do while q > 0 and p[q+1] != S[i]
3 k0 7 do q  Π[q]
4 for q  2 to m 8 if p[q+1] = S[i]
5 do while k > 0 and p[k+1] != p[q] 9 then q  q + 1
6 do k  Π[k] 10 if q = m
7 If p[k+1] = p[q] 11 then print “Pattern occurs with shift” i –
8 then k  k +1 m
9 Π[q]  k 12 q  Π[ q]
10 return Π
The for loop beginning in step 5 runs ‘n’ times,
i.e., as long as the length of the string ‘S’.
In the above pseudocode for computing the Since step 1 to step 4 take constant time,
prefix function, the for loop from step 4 to the running time is dominated by this for
step 10 runs ‘m’ times. Step 1 to step 3 loop. Thus running time of matching function
take constant time. Hence the running is Θ(n).
time of compute prefix function is Θ(m).
TRIES
• A trie is a tree-based data structure designed for storing strings to enable
efficient pattern matching and prefix matching operations. The term "trie"
originates from "retrieval" and is pronounced as "try.“
• Key Features:
1.Tree Structure:
1. Each node represents a character.
2. A path from the root to a node corresponds to a prefix of the strings stored in the
trie.
2.Storage:
1. Strings share common prefixes, saving space.
3.Applications:
1. Information retrieval systems.
2. DNA sequence searches.
3. Autocomplete in search engines.
4. Spell-checking.
TRIES
• Use Cases:
1.Pattern Matching: Find all occurrences of a pattern in a collection of strings.
2.Prefix Matching: Retrieve all strings starting with a given prefix.
• Advantages:
1.Efficient Search: Time complexity for searching is proportional to the length of the
string O(m), where m is the string length.
2.Space Optimization: Common prefixes are stored only once.
3.Preprocessing Benefit: Preprocesses text for repeated queries, making subsequent
searches faster.

• Limitations:
1.Initial Overhead: High cost of preprocessing the text.
2.Space Usage: Can be memory-intensive for large alphabets or sparse data.
3.Dynamic Updates: Adding or removing strings can complicate the structure.
STANDARD TRIES
A standard trie is an ordered tree structure used to represent a set S of s strings over
an alphabet Σ, with no string in S being a prefix of another. Each string in S is uniquely
associated with a path from the root to an external node.
• Key Properties:
1.Node Labels: Each node, except the root, is labeled with a character from the alphabet
Σ.
2.Canonical Ordering: Children of an internal node are ordered based on a canonical
ordering of Σ (e.g., lexicographical).
3.External Nodes:
1. There are s external nodes, each representing a string in S.
2. The path from the root to an external node concatenates to form the string it
represents.
4.Prefix Representation:
1. Paths from the root to any internal node represent prefixes of strings in S
2. Common prefixes among strings are stored compactly.
STANDARD TRIES

Structural Characteristics:
1.Children Count:
•Each internal node can have between 1 and d children, where d is the size of the alphabet Σ.
2.Multi-Way Tree:
•A standard trie is a multi-way tree; for binary alphabets, it behaves as a binary tree.
3.Efficient Storage:
•Common prefixes are shared across paths, reducing redundancy.

Advantages:
1.Efficient Prefix Matching:
•Quickly retrieves strings sharing a common prefix.
2.Compact Representation:
•Stores shared prefixes only once.
3.Flexibility:
•Supports multi-way branching based on alphabet size.
Properties of a Standard Trie T:
1.Children Count:
•Each internal node has at most d children, where d is the size of the alphabet.
2.External Nodes:
•There are s external nodes, one for each string in the collection S.
3.Height of the Trie:
•The height is equal to the length of the longest string in S
4.Node Count:
•The total number of nodes is O(n), where n is the total length of all strings in S
Worst-Case Node Count:
•Occurs when no two strings share a common nonempty prefix.
•All internal nodes have only one child.

Applications of Tries: Dictionary Implementation


•Search: Trace the path in T for a string X.
•If the path exists and ends at an external node, X is in the dictionary.
•If the path ends at an internal node or cannot be fully traced, X is not in the dictionary.
•Time Complexity:
•Search time for a string of size m:O(dm), where d is the size of the alphabet.
•With optimizations like hash tables or search tables, time per node can be reduced to O(1) or
O(log⁡d)
Word Matching Using Tries:
•Determines if a pattern matches a word exactly.
•Query time: O(dm) where m is the pattern length and d is the alphabet size.
•For constant-size alphabets, the query time becomes O(m)
•Efficient for prefix matching queries.
•Limitation: Cannot efficiently handle arbitrary substring matches.

Construction of a Standard Trie:


1.Incrementally insert strings from S
2.For a string X:
1. Trace its path in T.
2. Create new nodes for unmatched characters of X.
3.Time Complexity:
1. Insert one string: O(dm), where m is its length.
2. Build the entire trie: O(dn), where n is the total length of all strings in S.
COMPRESSED TRIES
A compressed trie is an optimized version of a standard trie that minimizes the
number of nodes by compressing chains of redundant internal nodes. Here are the key
points about compressed tries:
Definition
•A compressed trie retains the structure of a standard trie but ensures that each internal
node has at least two children.
•It achieves this by merging consecutive internal nodes that each have only one child
into a single edge.

Characteristics
•Redundant Nodes: An internal node is considered redundant if it has only one child
and is not the root. For example, if a trie has multiple consecutive nodes leading to the
same character, those nodes can be compressed into a single node.
•Redundant Chains: A chain of edges is defined as redundant if it contains redundant
nodes and begins and ends with non-redundant nodes.
COMPRESSED TRIES
Transformation Process
•In a compressed trie, every chain of redundant edges (v0,v1,…,vk)(v_0, v_1, \ldots, v_k)
(v0​,v1​,…,vk​) (where k≥2k \geq 2k≥2) is replaced by a single edge from v0v_0v0​to
vkv_kvk​. The label of the new edge is the concatenation of the labels of the nodes that
were compressed.
•The resulting nodes in the compressed trie are labeled with strings instead of individual
characters.

Advantages
•Node Count: The number of nodes in a compressed trie is proportional to the number
of distinct strings stored rather than the total length of all strings. This significantly
reduces the space complexity compared to standard tries.
•Efficiency: The compressed structure enables faster searches since there are fewer
nodes to traverse.
Properties of Compressed Tries
1.Children Count:
1. Every internal node has at least two children and at most d children, where d is
the size of the alphabet.
2.External Nodes:
1. The trie contains s external nodes, each representing a distinct string in the
collection.
3.Node Count:
1. The number of nodes in the compressed trie is O(s), meaning it scales linearly
with the number of strings rather than their total length.
SUFFIX TRIES
A suffix trie (also known as a suffix tree or position tree) is a specialized trie data
structure that stores all the suffixes of a given string X. This allows for efficient pattern
matching and substring searching.
Key Characteristics
1.Suffix Representation:
1. Each node in a suffix trie can be labeled with a pair (i,j), indicating the substring
X[i..j].
2. To avoid ambiguity (where one suffix is a prefix of another), a special character
(often denoted as "$") is appended to the end of the string X.
2.Space Efficiency:
1. Storing all suffixes of a string X of length n explicitly would take O(n^2) space, as
the total length of suffixes is

2. However, using a compact representation, a suffix trie can be constructed to use


only O(n) space
Construction
•The construction of a suffix trie typically takes O(d) time using a naive approach.
•A more efficient, specialized algorithm can construct it in O(n) time, although this
algorithm is more complex and is not detailed here.
SEARCH ENGINES
The World Wide Web is a vast collection of text documents (web pages), and search
engines are designed to help users retrieve relevant information from this extensive
database. The information is gathered through a program known as a web crawler and
stored in a dictionary database.

Inverted Index
An inverted index is a data structure commonly used in search engines and databases
to facilitate fast full-text searches. It maps words (or terms) to their locations in a
collection of documents, enabling efficient retrieval of documents that contain specific
words.

The core component of a search engine is an inverted index (or inverted file), which is
structured as key-value pairs (w,L):
•Key (w): A word (index term).
•Value (L): A collection of pages containing the word w.

Key Concepts
1.Index Terms: These are words that form the keys in the inverted index. They should
encompass a comprehensive set of vocabulary entries and proper nouns to improve
search effectiveness.
2.Occurrence Lists: These lists cover as many web pages as possible where each word
appears.
SEARCH ENGINES
Data Structure Implementation
An inverted index can be efficiently implemented using:
1.Array: Stores occurrence lists for terms (unordered).
2.Compressed Trie: Used for storing index terms. Each external node in the trie points
to the index of the corresponding occurrence list.
The separation of occurrence lists from the trie helps keep the trie small enough to fit in
internal memory, while the larger occurrence lists can be stored on disk.

Querying
•Single Keyword Search: To retrieve pages containing a keyword, the search engine
finds the keyword in the trie and returns the associated occurrence list.
•Multiple Keyword Search: For queries with multiple keywords, the engine retrieves
the occurrence lists for each keyword and computes their intersection, typically using
sorted sequences or dictionaries for efficient merging.

Ranking Pages
Beyond simply returning a list of pages that match the query, search engines also rank
these pages based on relevance. Developing efficient and accurate ranking algorithms is
a significant area of research and development for search engines and e-commerce
companies.

You might also like