0% found this document useful (0 votes)
44 views5 pages

HW 2

The document discusses 6 problems related to suffix trees and suffix arrays. Problem 1 describes how a generalized suffix array and LCP array can be used to construct the L1 and L2 lists for the Unique Decipherability algorithm instead of a suffix tree. Problem 2 analyzes the runtime of using a suffix tree to find all occurrences of codewords in a message. Problem 3 discusses modifying the Kasai algorithm to use quadruplets instead of triplets and analyzes the runtime. Problem 4 describes using an LCP array instead of lowest common ancestor computations. Problem 5 provides an algorithm using a generalized suffix tree to find the longest common substring between two strings. Problem 6 explains how to construct a suffix tree from a suffix array in linear time

Uploaded by

JubBoy3338468
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views5 pages

HW 2

The document discusses 6 problems related to suffix trees and suffix arrays. Problem 1 describes how a generalized suffix array and LCP array can be used to construct the L1 and L2 lists for the Unique Decipherability algorithm instead of a suffix tree. Problem 2 analyzes the runtime of using a suffix tree to find all occurrences of codewords in a message. Problem 3 discusses modifying the Kasai algorithm to use quadruplets instead of triplets and analyzes the runtime. Problem 4 describes using an LCP array instead of lowest common ancestor computations. Problem 5 provides an algorithm using a generalized suffix tree to find the longest common substring between two strings. Problem 6 explains how to construct a suffix tree from a suffix array in linear time

Uploaded by

JubBoy3338468
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ECS 224 Homework 2

Tal Levy October 18, 2011

Problem 1.
The use of the sux tree in the algorithm discussed was to be able to traverse the suxes of the code words in such a way that one can take note of a sux that is a prex to a code word, or when a code word is a prex to a sux. A Generalized Sux Tree (GST) was used in the discussion. claim: a generalized sux array (GSA) and an LCP array are sucient to construct the L1 and L2 lists which the algorithm then processes to extract the type 1,2 edges in the Unique Decipherability algorithm. Given code words C = {c1 , c2 , ..., ck }, we construct a generalized sux array by simply constructing a sux array, but setting $ delimiters between code words to distinguish which sux belongs to which code word. Using the sux array we use the LCP building algorithm to construct the LCP array. Now, we can mimic the eect of doing a DFS traversal through a sux tree by simply traversing the LCP and the GSA in a linear fasion because the elements in the GSA are leaves sorted by a lexicographic DFS traversal through the GST. There are two traversals we must do, one for the L1 rules and one for the L2 rules. L1 : The original algorithm looks for nodes with edges labeled $ to signify an end of a sux, this is equivelent to nding a consecutive pair of suxes in the GSA with LCP value greater than 0. The rst sux in this computation is the leaf (i, j) the algorithm discusses, and you can simply add it to a stack for processing. When you get to a code word then you look at what you have in the stack and add the code word to the L1 list for that sux element, just as the algorithm describes. You can do this without worrying about location because the sux array is already sorted in such a way that it is a specic DFS of the sux tree. When you reach an LCP value of 0, then you know that you are starting a new branch and can pop the previous pointers from the stack, because they are no longer valid prexes of the suxes to come. L2 :

Just as for L1 list building, you now stack suxes that represent code words as the form (k, 1), where k is a codeword. When you reach a sux in the array (by linear traversal) that is of the form (i, j), then add k to the L2 list of (i, j). This indicates that the codeword k is a prex of the sux specied by the path to the leaf with the given label. note: You can know where you are in the tree by observing the LCP values. 0 - new separate branch, > 0 - same branch and rst sux which computes the value has a smaller depth in the tree (has edge e with $ label on it.) Now the lists of the algorithm are constructed, and you can complete the algorithm as specied by processing the L1 and L2 lists. The use of the sux tree and array is to get the pointers to these nodes and construct the two lists, we have done so using the sux array and its nice properties.

Problem 2.
Construct a sux tree, T , for the message M . For each of the n codewords, do a pattern match for the codeword c for each c C. Each codeword pattern search through the tree T can be done in O(m + k) time where k is the number of occurences of the codeword c in M . To run through all of these codewords will take O(m + nk) time, or O(m + n|M |) because the number of occurences for each codeword is bounded by he length of M . and m is the total length of all fo the codewords being searched. It follows that this pattern match can simply be done by traversing down the sux tree until the codeword is completely matched, if a mismatch occurs, then no occurances of the pattern are to be found in M . Now that all occurances and their locations in M are accounted for, one can apply the greedy algorithm to solving the interval scheduling problem. We essentially have an interval (the message M ) and then sub-intervals containing matched codewords found in M . the reason we have this complexity is because if a codeword i is a substring of j, then a simple traversal through M is complicated because multiple parses are found greedily. And it is possible to solve the interval scheduling problem in O(|M |log(|M |)), the log is there because the greedy algorithm for solving this must sort the codewords by their index in M , where matches at the front of M are of lesser value than ones with larger indexes, and there are at most |M | codeword matches to t uniquely in message M . We can speed this up by using bucket sort to sort the codewords by index values. This will reduce the problem into O(|M |), leaving our total time to nd the unique sequence of codewords in M as O(m + (n + 1)|M |) = O(m + n|M |).

Problem 3.
3(a)
Just as the KS algorithm looks at the rst 3 characters of each sux and splits the sufxes: 1, 2mod3 and 0mod3, one can do an analogous split of quadrupals and forcing the string S to be a multiple of four. and splitting the suxes into 1, 2, 3mod4 and 0mod4. we can do this by the following steps (similiar with KS):

3 1. Recursively sort the 4 n suxes suf fi with imod4 = 0.

this is done by the same method described by the KS algorithm, just with quadruplets instead of triplets 2. Sort the 1 n suxes suf fi with imod3 = 0 using the result of step (1) 4 This can be done just as in KS by performing a radix sort on tuples (s[i], rank(suf fi+1 )) where the rank is the rank of the sux obtained in step 1. 3. Merge the two sorted arrays Just as with KS, we now have all suxes imod4 = 0 compared so we can merge in linear time by comparing the rst characters of suf fi (imod4 = 1or2) with suf fj (jmod4 = 0), s[i], s[j]. If they are unequal, then their ordering is clear. otherwise you move to the next character of each sux and compare s[i+1], s[j+1], where we have already compared these two in step 1 by comparing suxes suf fi (imod4 = 0). if we compare suf fi (imod4 = 3) with suf fj (jmod4 = 0), s[i], s[j], then we have to compare the rst two characters, and then rely on the comparison made in step 1. With this, just as in the KS algorithm, we can merge in linear time, where each comparison takes constant time.

With this we see that with the steps combined, the running time of this new algorithm, similiar to the KS algorithm, is T (n) = T ( 3n ) + O(n). and each step after that 4

3(b)
Given the analysis done in part a, the KS algorithm is superior in running time because of the deterioration of the recursion in step 1. The splitting that the KS algorithm does is superior to all other variants of their algorithm with dierent splitting constants.

Problem 4.
4(a)
The original algorithm solves for h(w) values using the times the lca() computation is done at each node w. We can replace this step by using the LCP array of the sux tree/array. For each list Li , look at the LCP array for each consecutive elements in the list. These lists are ordered by how they are collected, and then can the LCP value can be reached from the array. With this value, you can traverse the tree to reach the node labeled with the LCP of the elements. Then increment h(w) by one where w is the node we reach via the LCP, because the LCP value between two nodes is essentially an equivelent representation of the LCA.

4(b)

Problem 5.
To solve the longest common substring problem, you can simply construct a Generalized sux tree (GST) for the pair of strings, S and S . If the GST is constructed where each sux from both strings (if the path labels are the same) are on the same leaf. A solution can be found by traversing through the tree in DFS order, and keeping track of the depth of each node. So that when you follow the nodes and the internal node has a leaf which has suxes from both strings, mark the leaf and the depth. this will be the prex of the sux (i.e. the substring) which is shared between strings S and S . The longest string depth of these marked nodes will be the length of the longest substring shared between S and S . Each of the nodes with this longest depth will be a longest common substring shared by the strings. The running time of this algorithm is linear in the traversal of the GST, O(|S| + |S |).

Because all the children of such a node which is the longest common substring are leaves (who are labeled by suxes from both strings), andy consecutive pair of suxes in a Generalized sux array will have a common substring. The length of this can be found by looking at the LCP array for the GSA. when you reach two consecutive suxes, one from each string, the LCP value is essentially the length of that longest prex matched, or substring. To nd the occurances in the GSA for all of the longest common substring lengths, one can simply do a linear traversal through the LCP array in search of occurances of the largest value. The starting position of these suxes with largest LCP value reveals the position in the strings where the common substring occurs, and the LCP tells you the length. Just like the variant of this problem using a tree

traversal, this algorithm runs in O(|S| + |S |) time.

Problem 6.
Because each of the consecutive suxes in the array are lexicographically smaller than the next. You can start by constructing the tree by adding suxes one by one in the order found in the sux array. Initially, you can add the rst sux (the empty string $). Then you can move back up the tree to the root using the LCP values. The LCP values will help you traverse along the tree to reach the next location in which you must attach the next sux. This traversal can be done in constant time because it is simply arithmetic movement to the new position. And each time you use the LCP to move along to the next point where to start attaching the next sux. So the construction of the Tree, with the help of the LCP to guide positioning, can be done in linear time. The paths are only walked once because of the order in which the suxes are appended to the tree.

You might also like