DAA_unit_5
DAA_unit_5
• Algebraic Computation,
• Fast Fourier Transform,
• String Matching,
• Theory of NP-Completeness,
• Approximation Algorithms
• Randomized Algorithms
String Matching Introduction
String Matching Algorithm is also called "String Searching Algorithm." This is a vital class of string
algorithm is declared as "this is the method to find a place where one is several strings are found within the
larger string."
Given a text array, T [1.....n], of n character and a pattern array, P [1......m], of m characters. The problems
are to find an integer s, called valid shift where 0 ≤ s < n-m and T [s+1......s+m] = P [1......m]. In other words,
to find even if P in T, i.e., where P is a substring of T. The item of P and T are character drawn from some
finite alphabet such as {0, 1} or {A, B .....Z, a, b..... z}.
Given a string T [1......n], the substrings are represented as T [i......j] for some 0≤i ≤ j≤n-1, the string formed
by the characters in T from index i to index j, inclusive. This process that a string is a substring of itself (take
i = 0 and j =m).
The proper substring of string T [1......n] is T [1......j] for some 0<i ≤ j≤n-1. That is, we must have either
i>0 or j < m-1.
ADVERTISEMENT
Using these descriptions, we can say given any string T [1......n], the substrings are
1. T [i.....j] = T [i] T [i +1] T [i+2]......T [j] for some 0≤i ≤ j≤n-1.
And proper substrings are
1. T [i.....j] = T [i] T [i +1] T [i+2]......T [j] for some 0≤i ≤ j≤n-1.
Note: If i>j, then T [i.....j] is equal to the empty string or null, which has length zero.
Algorithms used for String Matching:
There are different types of method is used to finding the string
1. The Naive String Matching Algorithm
2. The Rabin-Karp-Algorithm
3. Finite Automata
4. The Knuth-Morris-Pratt Algorithm
5. The Boyer-Moore Algorithm
The Naive String Matching Algorithm
The naïve approach tests all the possible placement of Pattern P [1.......m] relative to text T [1......n]. We try
shift s = 0, 1.......n-m, successively and for each shift s. Compare T [s+1.......s+m] to P [1......m].
The naïve algorithm finds all valid shifts using a loop that checks the condition P [1.......m] = T [s+1.......s+m]
for each of the n - m +1 possible value of s.
NAIVE-STRING-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. for s ← 0 to n -m
4. do if P [1.....m] = T [s + 1....s + m]
5. then print "Pattern occurs with shift" s
Analysis: This for loop from 3 to 5 executes for n-m + 1(we need at least m characters at the end) times and
in iteration we are doing m comparisons. So the total complexity is O (n-m+1).
Example:
1. Suppose T = 1011101110
2. P = 111
3. Find all the Valid Shift
Solution:
The Rabin-Karp-Algorithm
The Rabin-Karp string matching algorithm calculates a hash value for the pattern, as well as for each M-
character subsequences of text to be compared. If the hash values are unequal, the algorithm will determine
the hash value for next M-character sequence. If the hash values are equal, the algorithm will analyze the
pattern and the M-character sequence. In this way, there is only one comparison per text subsequence, and
character matching is only required when the hash values match.
RABIN-KARP-MATCHER (T, P, d, q)
1. n ← length [T]
2. m ← length [P]
3. h ← dm-1 mod q
4. p ← 0
5. t0 ← 0
6. for i ← 1 to m
7. do p ← (dp + P[i]) mod q
8. t0 ← (dt0+T [i]) mod q
9. for s ← 0 to n-m
10. do if p = ts
11. then if P [1.....m] = T [s+1.....s + m]
12. then "Pattern occurs with shift" s
13. If s < n-m
14. then ts+1 ← (d (ts-T [s+1]h)+T [s+m+1])mod q
Example: For string matching, working module q = 11, how many spurious hits does the Rabin-Karp
matcher encounters in Text T = 31415926535.......
1. T = 31415926535.......
2. P = 26
3. Here T.Length =11 so Q = 11
4. And P mod Q = 26 mod 11 = 4
5. Now find the exact match of P mod Q...
Solution:
Complexity:
The running time of RABIN-KARP-MATCHER in the worst case scenario O ((n-m+1) m but it has a good
average case running time. If the expected number of strong shifts is small O (1) and prime q is chosen to be
quite large, then the Rabin-Karp algorithm can be expected to run in time O (n+m) plus the time to require
to process spurious hits.
String Matching with Finite Automata
The string-matching automaton is a very useful tool which is used in string matching algorithm. It examines
every character in the text exactly once and reports all the valid shifts in O (n) time. The goal of string
matching is to find the location of specific text pattern within the larger body of text (a sentence, a paragraph,
a book, etc.)
Finite Automata:
A finite automaton M is a 5-tuple (Q, q0,A,∑δ), where
ADVERTISEMENT
ADVERTISEMENT
o Q is a finite set of states,
o q0 ∈ Q is the start state,
o A ⊆ Q is a notable set of accepting states,
o ∑ is a finite input alphabet,
o δ is a function from Q x ∑ into Q called the transition function of M.
The finite automaton starts in state q0 and reads the characters of its input string one at a time. If the
automaton is in state q and reads input character a, it moves from state q to state δ (q, a). Whenever its current
state q is a member of A, the machine M has accepted the string read so far. An input that is not allowed
is rejected.
A finite automaton M induces a function ∅ called the called the final-state function, from ∑* to Q such that
∅(w) is the state M ends up in after scanning the string w. Thus, M accepts a string w if and only if ∅(w) ∈
A.
The function f is defined as
∅ (∈)=q0
∅ (wa) = δ ((∅ (w), a) for w ∈ ∑*,a∈ ∑)
FINITE- AUTOMATON-MATCHER (T,δ, m),
1. n ← length [T]
2. q ← 0
3. for i ← 1 to n
4. do q ← δ (q, T[i])
5. If q =m
6. then s←i-m
7. print "Pattern occurs with shift s" s
The primary loop structure of FINITE- AUTOMATON-MATCHER implies that its running time on a text
string of length n is O (n).
Computing the Transition Function: The following procedure computes the transition function δ from
given pattern P [1......m]
COMPUTE-TRANSITION-FUNCTION (P, ∑)
1. m ← length [P]
2. for q ← 0 to m
3. do for each character a ∈ ∑*
4. do k ← min (m+1, q+2)
5. repeat k←k-1
6. Until
7. δ(q,a)←k
8. Return δ
Example: Suppose a finite automaton which accepts even number of a's where ∑ = {a, b, c}
Solution:
q0 is the initial state.
The Knuth-Morris-Pratt (KMP)Algorithm
Knuth-Morris and Pratt introduce a linear time algorithm for the string matching problem. A matching time
of O (n) is achieved by avoiding comparison with an element of 'S' that have previously been involved in
comparison with some element of the pattern 'p' to be matched. i.e., backtracking on the string 'S' never
occurs
Components of KMP Algorithm:
1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates knowledge about how the
pattern matches against the shift of itself. This information can be used to avoid a useless shift of the pattern
'p.' In other words, this enables avoiding backtracking of the string 'S.'
2. The KMP Matcher: With string 'S,' pattern 'p' and prefix function 'Π' as inputs, find the occurrence of 'p'
in 'S' and returns the number of shifts of 'p' after which occurrences are found.
The Prefix Function (Π)
Following pseudo code compute the prefix function, Π:
COMPUTE- PREFIX- FUNCTION (P)
1. m ←length [P] //'p' pattern to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
Running Time Analysis:
In the above pseudo code for calculating the prefix function, the for loop from step 4 to step 10 runs 'm'
times. Step1 to Step3 take constant time. Hence the running time of computing prefix function is O (m).
Example: Compute Π for the pattern 'p' below:
Solution:
Initially: m = length [p] = 7
Π [1] = 0
k=0
After iteration 6 times, the prefix function computation is complete:
Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'
For 'p' the prefix function, ? was computed previously and is as follows:
Solution:
Initially: n = size of T = 15
m = size of P = 7
Pattern 'P' has been found to complexity occur in a string 'T.' The total number of shifts that took place for
the match to be found is i-m = 13 - 7 = 6 shifts.
The Boyer-Moore Algorithm
Robert Boyer and J Strother Moore established it in 1977. The B-M String search algorithm is a particularly
efficient algorithm and has served as a standard benchmark for string search algorithm ever since.
The B-M algorithm takes a 'backward' approach: the pattern string (P) is aligned with the start of the text
string (T), and then compares the characters of a pattern from right to left, beginning with rightmost character.
If a character is compared that is not within the pattern, no match can be found by analyzing any further
aspects at this position so the pattern can be changed entirely past the mismatching character.
For deciding the possible shifts, B-M algorithm uses two preprocessing strategies simultaneously. Whenever
a mismatch occurs, the algorithm calculates a variation using both approaches and selects the more significant
shift thus, if make use of the most effective strategy for each case.
The two strategies are called heuristics of B - M as they are used to reduce the search. They are:
1. Bad Character Heuristics
2. Good Suffix Heuristics
1. Bad Character Heuristics
This Heuristics has two implications:
o Suppose there is a character in a text in which does not occur in a pattern at all. When a mismatch
happens at this character (called as bad character), the whole pattern can be changed, begin matching
form substring next to this 'bad character.'
o On the other hand, it might be that a bad character is present in the pattern, in this case, align the
nature of the pattern with a bad character in the text.
Thus in any case shift may be higher than one.
Example1: Let Text T = <nyoo nyoo> and pattern P = <noyo>
This means that we need some extra information to produce a shift on encountering a bad character. This
information is about the last position of every aspect in the pattern and also the set of characters used in a
pattern (often called the alphabet ∑of a pattern).
COMPUTE-LAST-OCCURRENCE-FUNCTION (P, m, ∑ )
1. for each character a ∈ ∑
2. do λ [a] = 0
3. for j ← 1 to m
4. do λ [P [j]] ← j
5. Return λ
2. Good Suffix Heuristics:
A good suffix is a suffix that has matched successfully. After a mismatch which has a negative shift in bad
character heuristics, look if a substring of pattern matched till bad character has a good suffix in it, if it is so
then we have an onward jump equal to the length of suffix found.
Example:
COMPUTE-GOOD-SUFFIX-FUNCTION (P, m)
1. Π ← COMPUTE-PREFIX-FUNCTION (P)
2. P'← reverse (P)
3. Π'← COMPUTE-PREFIX-FUNCTION (P')
4. for j ← 0 to m
5. do ɣ [j] ← m - Π [m]
6. for l ← 1 to m
7. do j ← m - Π' [L]
8. If ɣ [j] > l - Π' [L]
9. then ɣ [j] ← 1 - Π'[L]
10. Return ɣ
BOYER-MOORE-MATCHER (T, P, ∑)
1. n ←length [T]
2. m ←length [P]
3. λ← COMPUTE-LAST-OCCURRENCE-FUNCTION (P, m, ∑ )
4. ɣ← COMPUTE-GOOD-SUFFIX-FUNCTION (P, m)
5. s ←0
6. While s ≤ n - m
7. do j ← m
8. While j > 0 and P [j] = T [s + j]
9. do j ←j-1
10. If j = 0
11. then print "Pattern occurs at shift" s
12. s ← s + ɣ[0]
13. else s ← s + max (ɣ [j], j - λ[T[s+j]])
Complexity Comparison of String Matching Algorithm:
Algorithm Preprocessing Time Matching Time
Naive O (O (n - m + 1)m)
Intuitively, the approximation ratio measures how bad the approximate solution is distinguished with the
optimal solution. A large (small) approximation ratio measures the solution is much worse than (more or less
the same as) an optimal solution.
Observe that P (n) is always ≥ 1, if the ratio does not depend on n, we may write P. Therefore, a 1-
approximation algorithm gives an optimal solution. Some problems have polynomial-time approximation
algorithm with small constant approximate ratios, while others have best-known polynomial time
approximation algorithms whose approximate ratios grow with n
Vertex Cover
A Vertex Cover of a graph G is a set of vertices such that each edge in G is incident to at least one of these
vertices.
The decision vertex-cover problem was proven NPC. Now, we want to solve the optimal version of the vertex
cover problem, i.e., we want to find a minimum size vertex cover of a given graph. We call such vertex cover
an optimal vertex cover C*.
The idea is to take an edge (u, v) one by one, put both vertices to C, and remove all the edges incident to u
or v. We carry on until all edges have been removed. C is a VC. But how good is C?
VC = {b, c, d, e, f, g}
Randomized Algorithms | Set 0 (Mathematical Background)
Conditional Probability Conditional probability P(A | B) indicates the probability of even ‘A’ happening
given that the even B happened.
𝑃(𝐴∩𝐵)
P(A|B)= 𝑃(𝐵)
We can easily understand above formula using below diagram. Since B has already happened, the sample
space reduces to B. So the probability of A happening becomes P(A ? B) divided by P(B)
The formula provides relationship between P(A|B) and P(B|A). It is mainly derived from conditional
probability formula discussed in the previous post.
Consider the below formulas for conditional probabilities P(A|B) and P(B|A)
𝑃(𝐴∩𝐵)
P(A|B)= 𝑃(𝐵) -------------
𝑃(𝐵∩𝐴)
P(A|B)= -------------
𝑃(𝐴)
Since P(B ? A) = P(A ? B), we can replace P(A ? B) in first formula with P(B|A)P(A)
After replacing, we get the given formula. Refer this for examples of Bayes’s formula.
Random Variables:
A random variable is actually a function that maps outcome of a random event (like coin toss) to a real value.
Example :
Coin tossing game :
A player pays 50 bucks if result of coin
toss is "Head"
The person gets 50 bucks if the result is
Tail.
A random variable profit for person can
be defined as below :
Profit = +50 if Head
-50 if Tail
Generally gambling games are not fair for players,
the organizer takes a share of profit for all
arrangements. So expected profit is negative for
a player in gambling and positive for the organizer.
That is how organizers make money.
Expected Value of Random Variable :
Expected value of a random variable R can be defined as following
E[R] = r1*p1 + r2*p2 + ... rk*pk
ri ==> Value of R with probability pi
Expected value is basically sum of product of following two terms (for all possible events)
a) Probability of an event.
b) Value of R at that even
Example 1:
In above example of coin toss,
Expected value of profit = 50 * (1/2) +
(-50) * (1/2)
=0
Example 2:
Expected value of six faced dice throw is
= 1*(1/6) + 2*(1/6) + .... + 6*(1/6)
= 3.5
Linearity of Expectation:
Let R1 and R2 be two discrete random variables on some probability space, then
E[R1 + R2] = E[R1] + E[R2]
For example, expected value of sum for 3 dice throws is = 3 * 7/2 = 7
Typically, randomized Quick Sort is implemented by randomly picking a pivot (no loop). Or by shuffling
array elements. Expected worst case time complexity of this algorithm is also O(n Log n), but analysis is
complex, the MIT prof himself mentions same in his lecture here.
Example :
import random
import time
def find_solution(n):
# seed the random number generator with the current time
random.seed(time.time())
# randomly select a number between 1 and n and return it as the solution
return random.randint(1, n)
def main():
n = 10 # the range of possible solutions is 1 to n
print("Solution:", find_solution(n))
if __name__ == '__main__':
main()
Output
Solution: 10
import random
def random_permutation(array):
# Shuffle the array using the random number generator
random.shuffle(array)
array = [1, 2, 3, 4, 5]
# Generate a random permutation of the array
random_permutation(array)
Output
51423
Example 2 :
def find_median(numbers):
n = len(numbers)
if n == 0:
return None
if n == 1:
return numbers[0]
# Example usage
print(find_median([1, 2, 3, 4, 5])) # Output: 3
print(find_median([1, 2, 3, 4, 5, 6])) # Output: 3 or 4 (randomly chosen)
print(find_median([])) # Output: None
print(find_median([7])) # Output: 7
Output
4
6
None
7
Randomized Algorithms | Set 3 (1/2 Approximate Median)
In this post, a Monte Carlo algorithm is discussed.
Problem Statement : Given an unsorted array A[] of n numbers and ε > 0, compute an element whose rank
(position in sorted A[]) is in the range [(1 – ε)n/2, (1 + ε)n/2].
For ½ Approximate Median Algorithm &epsilom; is 1/2 => rank should be in the range [n/4, 3n/4]
We can find k’th smallest element in O(n) expected time and O(n) worst case time.
What if we want in less than O(n) time with low probable error allowed?
Following steps represent an algorithm that is O((Log n) x (Log Log n)) time and produces incorrect result
with probability less than or equal to 2/n2.
1. Randomly choose k elements from the array where k=c log n (c is some constant)
2. Insert then into a set.
3. Sort elements of the set.
4. Return median of the set i.e. (k/2)th element from the set
if (n==0)
return 0;
Time Complexity:
We use a set provided by the STL in C++. In STL Set, insertion for each element takes O(log k). So
For k insertions, time taken is O (k log k).
Now replacing k with c log n
=>O(c log n (log (clog n))) =>O (log n (log log n))
How is probability of error less than 2/n2?
Algorithm makes an error if the set S has at least k/2 elements are from the Left Quarter or Right
Quarter.
It is quite easy to visualize this statement since the median which we report will be (k/2)th element
and if we take k/2 elements from the left quarter(or right quarter) the median will be from the left
quarter (or the right quarter).
An array can be divided into 4 quarters each of size n/4. So P(selecting left quarter) is 1/4. So what
is the probability that at least k/2 elements are from the Left Quarter or Right Quarter? This
probability problem is same as below :
Given a coin which gives HEADS with probability 1/4 and TAILS with 3/4. The coin is tossed k
times. What is the probability that we get at least k/2 HEADS is less than or equal to?
Explanation:
If we put k = c log n for c = 10, we get
P <= (1/2)2log n
P <= (1/2)log n2
P <= n-2
Probability of selecting at least k/2 elements from the left quarter) <= 1/n2
Probability of selecting at least k/2 elements from the left or right quarter) <= 2/n2
Therefore algorithm produces incorrect result with probability less than or equal to 2/n2.