0% found this document useful (0 votes)
136 views

Linear Suffix Array Construction by Almost Pure Induced-Sorting

The document describes a new linear-time and space suffix array construction algorithm called SA-IS. SA-IS uses a technique called "pure induced sorting" to reduce the problem size and propagate the suffix order. It reduces the problem by identifying "leftmost S-type substrings" (LMS substrings) as basic blocks, sorts the LMS substrings using induced sorting, and replaces the substrings with names. If the reduced string has unique characters, its suffix array is directly computed; otherwise recursion is used. The order is then induced back to solve the original problem in linear time and space, with a simpler implementation than prior algorithms.

Uploaded by

Rob Wentworth
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

Linear Suffix Array Construction by Almost Pure Induced-Sorting

The document describes a new linear-time and space suffix array construction algorithm called SA-IS. SA-IS uses a technique called "pure induced sorting" to reduce the problem size and propagate the suffix order. It reduces the problem by identifying "leftmost S-type substrings" (LMS substrings) as basic blocks, sorts the LMS substrings using induced sorting, and replaces the substrings with names. If the reduced string has unique characters, its suffix array is directly computed; otherwise recursion is used. The order is then induced back to solve the original problem in linear time and space, with a simpler implementation than prior algorithms.

Uploaded by

Rob Wentworth
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Linear Sux Array Construction by Almost Pure Induced-Sorting

Ge Nong
Computer Science Department Sun Yat-Sen University Guangzhou 510275, P.R.C. [email protected]

Sen Zhang
Dept. of Math., Comp. Sci. & Stat. SUNY College at Oneonta NY 07104, U.S.A. [email protected]

Wai Hong Chan


Department of Mathematics Hong Kong Baptist University Kowloon, Hong Kong [email protected]

Abstract We present a linear time and space sux array (SA) construction algorithm called the SA-IS algorithm. The SA-IS algorithm is novel because of the LMS-substrings used for the problem reduction and the pure induced-sorting (specially coined for this algorithm) used to propagate the order of suxes as well as that of LMS-substrings, which makes the algorithm almost purely relying on induced sorting at both its crucial steps. The pure induced-sorting renders the algorithm an elegant design and in turn a surprisingly compact implementation which consists of less than 100 lines of C code. The experimental results demonstrate that this newly proposed algorithm yields noticeably better time and space eciencies than all the currently published linear time algorithms for SA construction.

Introduction

A sux of a string is a substring ending at the last character of the string. A sux array for a given string is an array of specially arranged indexes pointing to all the suxes of the string ascendingly sorted according to their lexicographical order. Sux arrays were rst introduced by Manber and Myers in SODA90 [5] as a space ecient alternative to sux trees. Since then, sux arrays have been used as fundamental data structures in a broad spectrum of applications, e.g., data indexing, compressing, retrieving, storing and processing. Very recently, the research on both time and space more ecient sux array construction algorithms (SACAs) has become an increasingly hotter pursuit due to the growing need of constructions of sux arrays for large-scale applications, e.g., web searching and biological genome database, where the magnitude of huge datasets is measured often in billions of characters [1, 2, 6, 7]. The fastest linear SACA algorithm among all the latest results obtained so far is the KA algorithm from Aluru and Ko [4]. The framework of the KA algorithm can be recapitulated as the following 3-step recursive process. 1) First the input string is reduced into a smaller string capturing only S-substrings using more compact coding words of them, so that the original problem is divided into a reduced part and an unreduced part. This step is commonly known as substring naming for all the algorithms sharing the same recursive framework. 2) Then the sux array of the reduced problem is recursively computed. To be specic, if the sux array
Nong

was partially supported by the National Science Foundation of P.R.C. (Grant No. 60873056). was partially supported by the SUNY College Oneonta W.B. Ford Grant. Chan was partially supported by the Faculty Research Grant (FRG/07-08/II-30), HKBU.
Zhang

of the reduced problem is not immediately obtainable, it will further trigger a recursive call to solve the reduced problem. Otherwise, the recursion is terminated. 3) Finally as the recursive calls returning level by level, the order of the suxes of the reduced problem is propagated to the sux array of the unreduced problem all the way back to the original problem. The elegance of the KA algorithm lies in the step 3 where a new concept called induced sorting has been proposed to eciently sort suxes in unreduced problems at all levels of the recursive calls. The intuition behind the induced sorting is that the signicant redundancies buried in the suxes that nest one in another can be used to propagate the order of certain representative suxes to their left-hand longer neighbours. However, compared with its close competitor the KS algorithm [3] which shares a similar recursive framework, the KA algorithm suers a relatively complicated step 1, where the selected varying-length Ssubstrings need to be sorted and re-named by their order indexes, which turns out to be the bottleneck of the KA algorithm (or any algorithm relying on induced sorting at step 3). For this task, Ko and Aluru proposed to use the S-distance lists where each list contains all the characters with the same S-distance to facilitate the sorting. Maintaining the S-distance lists, however, demands not only extra space but also an involved subroutine to process it, which is well evidenced by Kos sample implementation of the KA algorithm with over 1000 lines in C, as opposed to only about 100 lines of the KS algorithm. In this paper, we discuss a novel linear SACA called the SA-IS algorithm which diers from the KA algorithm in two aspects. First, the SA-IS samples the leftmost S-type (LMS) substrings (will be formally dened in section 2) instead of S-substrings (Due to space limit, we refer readers to their seminal paper [4] for its more detailed discussion). Since LMSsubstrings are likely longer blocks than S-substrings, our algorithm can reduce problem faster than the KA algorithm. Second, the SA-IS uses induced-sorting NOT ONLY in step 3 to induce the ordering of the unreduced problem, BUT ALSO in step 1 to induce the ordering of the reduced problem. The new algorithm is presented and analyzed in Sections 2-3, followed by a running example in Section 4 and an experimental study for performance evaluation in Section 5. Finally, Section 6 gives the closing remarks. 2 Our Solution

SA-IS(S, SA) S is the input string; SA is the output sux array of S; t: array [0..n 1] of boolean; S1 : array [0..n1 1] of integer; P1 : array [0..n1 1] of integer; B: array [0.. (S) 1] of integer; Scan S once to classify all the characters as L- or S-type into t; Scan t once to nd all the LMS-substrings in S into P1 ; Induced sort all the LMS-substrings using P1 and B; Name each LMS-substring in S by its bucket index to get a new shortened string S1 ; if Each character in S1 is unique then Directly compute SA1 from S1 ; else SA-IS(S1 , SA1 ); where recursive call happens Induce SA from SA1 ; return

1 2 3 4 5 6 7 8 9 10 11

Figure 1: The algorithm framework for induced sorting sux array in linear time/space, where (S) denotes the alphabet of S.

2.1 Algorithm Framework Our linear sux array sorting algorithm SA-IS is outlined in Fig. 1. Lines 1-4 rst produce the reduced problem, which is then solved recursively by Lines 5-9, and nally from the solution of the reduced problem, Line 10 induces the nal solution for the original problem. The time and space bottleneck of this algorithm resides at reducing the problem in Lines 1-4. 2.2 Reducing the Problem Let S be a string of n characters, represented by an array indexed by [0..n 1]. For presentation simplicity, S is supposed to be terminated by a sentinel $, which is the unique lexicographically smallest character in S. Let suf (S, i) be the sux in S starting at S[i] and running to the sentinel. A sux suf (S, i) is said to be S- or L-type if suf (S, i) < suf (S, i + 1) or suf (S, i) > suf (S, i + 1), respectively. The last sux suf (S, n 1) consisting of only the single character $ (the sentinel) is dened as S-type. Correspondingly, we can classify a character S[i] to be S- or L-type if suf (S, i) is Sor L-type, respectively. From the above denitions, we observe the following properties: (i) S[i] is S-type if (i.1) S[i] < S[i + 1] or (i.2) S[i] = S[i + 1] and suf (S, i + 1) is S-type; and (ii) S[i] is L-type if (ii.1) S[i] > S[i + 1] or (ii.2) S[i] = S[i + 1] and suf (S, i + 1) is L-type. These properties suggest that by scanning S once from right to left, we can determine the type of each character in O(1) time and ll out the type array t in O(n) time. Further, we introduce two new concepts called LMS character and LMS-substring as following. Definition 2.1. (LMS character) A character S[i], i [1, n 1], is called leftmost S-type (LMS) if S[i] is S-type and S[i 1] is L-type. Definition 2.2. (LMS-substring) A LMS-substring is (i) a substring S[i..j] with both S[i] and S[j] being LMS characters, and there is no other LMS character in the substring, for i = j; or (ii) the sentinel itself. Intuitively, if we treat the LMS-substrings as basic blocks of the string, and if we can eciently sort all the LMS-substrings, then we can use the order index of each LMSsubstring as its name, and replace all the LMS-substrings in S by their names. As a result, S can be represented by a shorter string, denoted by S1 , thus the problem size can be reduced to facilitate solving the problem in a manner of divide-and-conquer. Now, we dene the order for any two LMS-substrings. Definition 2.3. (Substring Order) To determine the order of any two LMS-substrings, we compare their corresponding characters from left to right: for each pair of characters, we compare their lexicographical values rst, and next their types if the two characters are of the same lexicographical value, where the S-type is of higher priority than the L-type. From this order denition for LMS-substring, we see that two LMS-substrings can be of the same order index, i.e. the same name, iif they are equal in terms of lengths, characters and types. When S[i] = S[j], we assign a higher priority to S-type because suf (S, i) > suf (S, j) if suf (S, i) is S-type and suf (S, j) is L-type. To sort all the LMS-substrings, we dont need extra physical space; instead, we simply maintain a pointer array P1 , which contains the pointers for all the LMS-substrings in S. This can be done by scanning S (or t) once from right to left in O(n) time. Definition 2.4. (Sample Pointer Array) P1 is an array containing the pointers for all the LMS-substrings in S preserving their original positional order.

Suppose we have all the LMS-substrings sorted into the buckets in their lexicographical orders where all the LMS-substrings in a bucket are identical. Then we name each item of P1 by the index of its bucket to produce a new string S1 . Here, we say two equal-size substrings S[i..j] and S[i ..j ] are identical iif S[i + k] = S[i + k] and t[i + k] = t[i + k], for k [0, j i]. We have the following observation on S1 . Lemma 2.1. (1/2 Reduction Ratio) S1 is at most half of S , i.e. n1 n/2 . Proof. The rst character in S must not be LMS and no any two consecutive characters in S are both LMS. Lemma 2.2. (Sentinel) The last character of S1 is the unique smallest character in S1 . Proof. From the denition of LMS character, S[n 1] must be a LMS character and the LMS-substring starting at S[n 1] must be the unique smallest among all sampled by P1 . Lemma 2.3. (Coverage) For S1 [i] = S1 [j], there must be P1 [i + 1] P1 [i] = P1 [j + 1] P1 [j]. Proof. Given S1 [i] = S1 [j], from the denition of S1 , there must be (1) S[P1 [i]..P1 [i + 1]] = S[P1 [j]..P1 [j + 1]] and (2) t[P1 [i]..P1 [i + 1]] = t[P1 [j]..P1 [j + 1]]. Hence, the two LMSsubstrings in S starting at S[P1 [i]] and S[P1 [j]] must have the same length. Lemma 2.4. (Order Preservation) The relative order of any two suxes suf (S1 , i) and suf (S1 , j) in S1 is the same as that of suf (S, P1 [i]) and suf (S, P1 [j]) in S. Proof. The proof is due to the following consideration on two cases: Case 1: S1 [i] = S1 [j]. There must be a pair of characters in the two substrings of either dierent lexicographical values or dierent types. Given the former, the statement is obviously correct. For the latter, because we assume the S-type is of higher priority (see Denition 2.3), the statement is also correct. Case 2: S1 [i] = S1 [j]. In this case, the order of suf (S1 , i) and suf (S1 , j) is determined by the order of suf (S1 , i + 1) and suf (S1 , j + 1). The same argument can be recursively applied on S1 [i + 1] = S1 [j + 1], ..., S1 [i + k 1] = S1 [j + k 1] until S1 [i + k] = S1 [j + k]. Because S1 [i..i + k 1] = S1 [j..j + k 1], from Lemma 2.3, P1 [i + k] P1 [i] = P1 [j + k] P1 [j], i.e. the lengths of substrings S[P1 [i]..P1 [i + k]] and S[P1 [j]..P1 [j + k]] are the same. Thus, sorting S1 [i..i + k] and S1 [j..j + k] is equal to sorting S[P1 [i]..P1 [i + k]] and S[P1 [j]..P1 [j + k]]. Hence, the statement is correct. This lemma suggests that to sort all the LMS-suxes in S, we can sort S1 instead. As S1 is at least 1/2 shorter than S, the computation on S1 can be done in one half the complexity for S. Let SA and SA1 be the sux arrays for S and S1 , respectively. Assume SA1 has been solved, we proceed to show how to induce SA from SA1 in linear time/space. 2.3 Inducing SA from SA1 As dened, SA maintains the indexes of all the suxes of S according to their lexicographical order. Trivially, we can see that in SA, the pointers for all the suxes starting with a same character must span consecutively. Lets call a subarray in SA for all the suxes with a same rst character as a bucket. Further, there must be no tie between any two suxes sharing the identical character but of dierent types.

Therefore, in the same bucket, all the suxes of the same type are clustered together, and the S-type suxes are behind, i.e. to the right of the L-type suxes. Hence, each bucket can be further split into two sub-buckets for the L- and S-type buckets respectively. Further, when we say to put an item SA1 [i] to its bucket in SA, it means that we put P1 [SA1 [i]] to the bucket in SA for the sux suf (S, P1 [SA1 [i]]) in S. With these notations, we describe our algorithm for inducing SA from SA1 in linear time/space as below: 1. Find the end of each S-type bucket; put all the items of SA1 into their corresponding S-type buckets in SA, with their relative orders unchanged as that in SA1 ; 2. Find the head of each L-type bucket in SA; scan SA from head to end, for each item SA[i], if S[SA[i] 1] is L-type, put SA[i] 1 to the current head of the L-type bucket for S[SA[i] 1] and forward the current head one item to the right. 3. Find the end of each S-type bucket in SA; scan SA from end to head, for each item SA[i], if S[SA[i] 1] is S-type, put SA[i] 1 to the current end of the S-type bucket for S[SA[i] 1] and forward the current end one item to the left. Obviously, each of the above steps can be done in linear time. We now consider the correctness of this inducing algorithm by investigating each of the three steps in reversed order. First the correctness of step 3, which is about how to sort all the suxes from the sorted L-type suxes by induction, is endorsed by the Lemma 3 established in [4]. Lemma 2.5. [4] Given all the L-type (or S-type) suxes of S sorted, all the suxes of S can be sorted in O(n) time. In our context, Lemma 2.5 can be translated into the following statement. Lemma 2.6. Given all the L-type suxes of S sorted, all the suxes of S can be sorted by step 3 in O(n) time. From the above lemma, we have the below result to support the correctness of step 2. Lemma 2.7. Given all the LMS suxes of S sorted, all the L-type suxes of S can be sorted by the step 2 in O(n) time. Proof. From Lemma 2.5, if all the S-type suxes have been sorted, we can sort all the (Sand L-type) suxes by traversing SA once from left to right in O(n) time through induction. Notice that not every S-type sux is useful for induced sorting the L-type suxes; instead a S-type sux is useful only when it is also a LMS sux. In order words, the correct order of all the LMS suxes suce to induce the order of all the L-type suxes in O(n) time/space. The rst step is to put all the sorted LMS suxes into their S-type buckets in SA, from ends to heads. Hence, from lemma 2.7, step 2 will sort all the L-type suxes correctly; and from the lemma 2.6, step 3 will sort all the suxes from the sorted L-type suxes. 2.4 Induced Sorting Substrings In the above discussion, we have safely assumed that how to sort substrings is not an issue. However, it turns out that this is the step that we had made the SA-IS algorithm to further optimize the KA algorithm and others. As we have mentioned earlier, sorting the variable-size S- or L-substrings constitutes the bottleneck of the KA algorithm, and solving it demands the usage of a S-distance list structure and

correspondingly an involved algorithm to process the data structure. However, surprisingly we discovered that this seemingly dicult problem can be easily solved by using induced sorting too, i.e. the induced-sorting idea originally used in the KA algorithm for inducing the order of suxes SA from SA1 can be extended to induce the order of substrings. Specically, we only need to make a single change to the algorithm in the section 2.3 in order to eciently sort all the variable-length LMS-substrings. This single change is to revise step 1 to: Find the end of each S-type bucket; put all the LMS suxes in S into their S-type buckets in SA according to their rst characters. To facilitate the following discussion, lets dene a LMS-prex pre(S, i) to be (1) a single LMS character; or (2) the prex S[i..k] in suf (S, i) where S[k] is the rst LMS character after S[i]. Similarly, we dene a LMS-prex pre(S, i) to be S- or L-type if suf (S, i) is S- or L-type, respectively. From this denition, any L-type LMS-prex has at least two characters. We establish the following result for sorting all the non-size-one LMS-prexes. Theorem 2.1. The above modied induced sorting algorithm will correctly sort all the nonsize-one LMS-prexes and the sentinel. Proof. Initially, in step 1, all the size-one S-type LMS-prexes are put into their buckets in SA. Now, all the LMS-prexes in SA are sorted in their order. We next prove, by induction, step 2 will sort all the L-type LMS-prexes. When we append the rst L-type LMS-prex to its bucket, it must be sorted correctly with all the existing S-type LMS-prexes. We assume this step has correctly sorted k L-type LMSprexes, where k 1, and suppose contrary that when we append the (k + 1)th L-type LMS-prex pre(S, i) to the current head of its bucket, there is already another greater L-type LMS-prex pre(S, j) in front of (i.e. on the left hand side of) pre(S, i). In this case, we must have S[i] = S[j], pre(S, j + 1) > pre(S, i + 1) and pre(S, j + 1) is in front of pre(S, i + 1) in SA. This implies that when we scanned SA from left to right, before appending pre(S, i) to its bucket, we must have seen the LMS-prexes being not sorted correctly. This contradicts our assumption. As a result, all the L-type and the size-one S-type LMS-prexes are sorted in their correct order by this step. Now we prove step 3 will sort all the non-size-one LMS-prexes, which is conducted symmetrically to that for step 2. When we append the rst S-type LMS-prex to its bucket, it must be sorted correctly with all the existing L-type LMS-prexes. Notice that in step 1, all the size-one LMS-prexes were put into the ends of their buckets. Hence, in this step, when we append a non-size-one S-type LMS-prex to the current end of its bucket, it will overwrite the size-one LMS-prex already there, if there is any. Assume this step has correctly sorted k S-type LMS-prexes, for k 1, and suppose contrary that when we append the (k + 1)th S-type LMS-prex pre(S, i) to the current end of its bucket, there is already another less S-type LMS-prex pre(S, j) behind (i.e. on the right hand side of) pre(S, i). In this case, we must have S[i] = S[j], pre(S, j + 1) < pre(S, i + 1) and pre(S, j + 1) is behind pre(S, i + 1) in SA. This implies that when we scanned SA from right to left, before appending pre(S, i) to its bucket, we must have seen the non-size-one LMS-prexes are not sorted correctly. This contradicts our assumption. As a result, all the non-size-one LMS-prexes and the sentinel (which is unique and was sorted into its bucket in step 1) are sorted in their correct order by this step. From this theorem, we can immediately derive the following two results. (1) Any LMSsubstring is also a non-size-one LMS-prex or the sentinel; given all the non-size-one LMSprexes and the sentinel ordered, all the LMS-substrings are ordered too. (2) Any S-

substring is a prex of a non-size-one LMS-prex or the sentinel; given all the LMS-prexes and the sentinel ordered, all the S-substrings are ordered too. Hence, our algorithm for induced sorting all the non-size-one LMS-prexes and the sentinel can be used for sorting the LMS-substrings in our SA-IS algorithm in Fig. 1, as well as for sorting the S- or Lsubstrings in the KA algorithm. It is clear that the induced sorting algorithm has been heavily used in both the renaming step and the propagating step, while only one round bucket sort is needed to usher the rest induced-sorting. That is the reason why we say the SA-IS algorithm almost purely relies on induced sorting. 3 Complexity Analysis

Theorem 3.1. (Time/Space Complexities) Given S is of a constant or integer alphabet, the time and space complexities for the algorithm SA-IS in Fig. 1 to compute SA(S) are O(n) and O(n log n ) bits, respectively. Proof. Because the problem is reduced at least 1/2 at each recursion, we have the time complexity governed by the equation below, where the reduced problem is of size at most n/2 . The rst O(n) in the equation counts for reducing the problem and inducing the nal solution from the reduced problem. T (n) = T ( n/2 ) + O(n) = O(n) The space complexity is dominated by the space needed to store the sux array for the reduced problem at each iteration. Because the size of sux array at the rst iteration is upper bounded by n log n bits, and decreases at least a half for each iteration thereafter, the space complexity is obvious O(n log n ) bits. To investigate the accurate space requirement, we show in Fig. 2 a space allocation scheme, where the worst-case space consumption at each level is proportional to the total length of bars at this level, and the bars for dierent levels are arranged vertically. In this gure, we have not shown the spaces for the input string S and the type array t, the former is xed for a given S and the later varies from level to level. Let Si and ti denote the string and the type array at level i, respectively. If we keep ti throughout the lifetime of Si , i.e., ti is freed only when we return to the upper level i 1, we need at most 2n bits for all the type arrays in the worst-case. However, we can also free ti when we are going to the level i + 1, and restore ti from Si when we return from the level i + 1. In this way, we need at most n bits to reuse for all the type arrays. Because the space consumed by the type arrays is negligible when compared with SA, it is omitted in the gure.
Level 0 SA0 B0

Level 1

S1

SA1

B1

Level 2

S1

S2

SA2

B2

Figure 2: The worst-case space requirement at each recursion level. The space at each level consists of two components: SAi for the sux array of Si , and Bi the bucket array at level i. In the worst case, each array requires a space as large as Si (when the alphabet of Si is integer). For S with an integer alphabet, the peak space is at the top level. Meanwhile, for constant alphabet of S, B0 and B1 are O(1) and O(n), respectively, so the maximum space is required by the 2nd level when n increases. Hence, the space requirement is as follows, where the type arrays occupy n bits in both cases.

Corollary 3.1. The worst-case working space requirements for SA-IS in Fig. 1 to compute the sux array of S are: (1) 0.5n log n + n + O(1) bits, for the alphabet of S is constant; and (2) n log n + n + O(1) bits, for the alphabet of S is integer. For the algorithms space requirement in practice, we have the below probabilistic result. Theorem 3.2. Given the probabilities for each character to be S- or L-type are i.i.d as 1/2, the mean size of a non-sentinel LMS-substring is 4, i.e, the reduction ratio is at most 1/3. Proof. Lets consider a non-sentinel LMS-substring S[i..j], where i < j. By the denition, S[i] and S[j] are LMS characters, and there must be at least one L-type character S[k] in between them. Given the i.i.d probability of 1/2 for each character to be S- or L-type, the mean number of L-type characters in between S[k] and S[j] is governed by a geometry distribution with the mean of 1. Hence, the mean size of S[i..j] is 4. Because all the LMSsubstrings are located consecutively, the end of a LMS-substring is also the head of another succeeding LMS-substring. This implies that the mean size of a non-sentinel LMS-substring excluding its end is 3, resulting in the reduction ratio not greater than 1/3. This theorem and Fig. 2 imply that if the probabilities for a character in S to be S- or L-type are equal and the alphabet of S is constant, the maximum space for our algorithm is contributed to the level 1 where |S1 | n/3. Hence, the maximum working space is determined by the type arrays, which can be n + O(1) bits in the worst case. As we will see from the experiment, this theorem well approximates the results on realistic data. 4 A Running Example

We provide below a running example of the induced sorting algorithm on naming all the LMS-substrings of a sample string S = mmiissiissiippii$, where $ is the sentinel. First, we scan S from right to left to produce the type array t at line 2, and all the LMS suxes in S are marked by under t. Then, we continue to run the algorithm step by step: Step 1: The LMS suxes are 2, 6, 10 and 16. There are 5 buckets, namely $, i, m, p and s, for all the suxes marked by their rst characters as shown in lines 5 and 6. We initialize SA by setting all its items to be -1 and then scan S from left to right to put all the LMS suxes into their buckets according to their rst characters. Noted that the sorting in this step is not required to be stable, the suxes belonging to a bucket can be put into the bucket in any order. In this example, we record the end of each bucket, and the LMS suxes are put into the bucket from end to head. Hence, in the bucket for i, we put the suxes rst 2, next 6 and then 10. Now, SA contains all the size-one LMS-prexes sorted. Step 2: We rst nd and mark the head of each bucket with , and then start to scan SA from left to right, for which the current item of SA being visited is marked by @. When we are visiting SA[0] = 16 in line 9, we check the type array t to know S[15] = i is L-type. Hence, sux 15 is appended to the bucket for i, and the current head of the bucket is forwarded one step to the right. In line 13, the scanning reaches SA[2] = 14. As S[13] = p is L-type, we put sux 13 to the bucket for p, and forward the buckets head one step. To continue scanning SA, we can get all the L-type LMS-prexes sorted in SA as shown in line 28, where between two buckets means that the left bucket is full. Step 3: In this step, we induced sorting all non-size-one LMS-prexes from the sorted L-type prexes. We rst mark the end of each bucket and then scan SA from right to left. At SA[16] = 4, we see S[3] = i is S-type, and then put sux 3 into the bucket for i and forward the buckets current end one step to the left. When we visit the next character,

i.e. SA[15] = 8, we see S[7] = i is S-type, and then we put sux 7 to the bucket for i and forward the bucket head one step to the left. To continue scanning SA in this way, all the LMS-prexes are sorted in their order shown in line 43. Given all the non-size-one LMSprexes and the sentinel are sorted in SA, we scan SA once from left to right to calculate the name for each LMS-substring. As a result, we get the shortened string S1 shown in line 45, where the names for the suxes 2, 6, 10 and 16 are 2, 2, 1, 0, respectively.
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Index: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 S: m m i i s s i i s s i i p p i i $ t: L L S S L L S S L L S S L L L L S LMS: * * * * Step 1: Bucket: $ i m p SA: {16} {-1 -1 -1 -1 -1 10 06 02} {-1 -1} {-1 -1} {-1 Step 2: SA: {16} {-1 -1 -1 -1 -1 10 06 02} {-1 -1} {-1 -1} {-1 @^ ^ ^ ^ ^ {16} {15 -1 -1 -1 -1 10 06 02} {-1 -1} {-1 -1} {-1 ^ @ ^ ^ ^ ^ {16} {15 14 -1 -1 -1 10 06 02} {-1 -1} {-1 -1} {-1 ^ @ ^ ^ ^ ^ {16} {15 14 -1 -1 -1 10 06 02} {-1 -1} {13 -1} {-1 ^ ^ @ ^ ^ ^ {16} {15 14 -1 -1 -1 10 06 02} {-1 -1} {13 -1} {09 ^ ^ @ ^ ^ {16} {15 14 -1 -1 -1 10 06 02} {-1 -1} {13 -1} {09 ^ ^ @ ^ ^ {16} {15 14 -1 -1 -1 10 06 02} {01 -1} {13 -1} {09 ^ ^ @ ^ ^ {16} {15 14 -1 -1 -1 10 06 02} {01 00} {13 -1} {09 ^ ^ ^ @ ^ {16} {15 14 -1 -1 -1 10 06 02} {01 00} {13 12} {09 ^ ^ ^ ^ @ {16} {15 14 -1 -1 -1 10 06 02} {01 00} {13 12} {09 ^ ^ ^ ^ {16} {15 14 -1 -1 -1 10 06 02} {01 00} {13 12} {09 ^ ^ ^ ^ Step 3: SA: {16} {15 14 -1 -1 -1 10 06 02} {01 00} {13 12} {09 ^ ^ ^ ^ {16} {15 14 -1 -1 -1 10 06 03} {01 00} {13 12} {09 ^ ^ ^ ^ {16} {15 14 -1 -1 -1 10 07 03} {01 00} {13 12} {09 ^ ^ ^ @^ {16} {15 14 -1 -1 -1 11 07 03} {01 00} {13 12} {09 ^ ^ @ ^ ^ {16} {15 14 -1 -1 02 11 07 03} {01 00} {13 12} {09 ^ ^ @ ^ ^ {16} {15 14 -1 06 02 11 07 03} {01 00} {13 12} {09 ^ ^ @ ^ ^ {16} {15 14 10 06 02 11 07 03} {01 00} {13 12} {09 ^ ^ ^ ^ S1: 2 2 1 0

s -1 -1 -1} -1 -1 -1} -1 -1 -1} -1 -1 -1} -1 -1 -1} -1 -1 -1} ^ 05 -1 -1} ^ 05 -1 -1} ^ 05 -1 -1} ^ 05 -1 -1} ^ 05 08 -1} @ ^ 05 08 04} ^ 05 08 04} @^ 05 08 04} @ ^ 05 08 04} ^ 05 08 04} ^ 05 08 04} ^ 05 08 04} ^ 05 08 04} ^

Experiments

All the datasets used were downloaded from the Manzini-Ferragina [6] corpora, the ad hoc benchmark repository for sux array construction and compression algorithms. The experiments were performed on Linux (Sabayon Linux distribution) with AMD Athlon(tm) 64x2 Dual Core Processor 4200+ 2.20GHz and 2.00GB RAM. All the programs were implemented in C/C++ and compiled by g++ with the -O3 option. The performance measurements are the costs of time and space, recursion depth and mean reduction ratio. The time for each algorithm is the mean of 3 runs, and the space is the heap peak measured by using the memusage command to re the running of each program. The total time (in seconds) and space (in million bytes, MBytes) for each algorithm are the sums of the times and spaces used to run the algorithm for all the input data, respectively. The mean time (in seconds per MBytes) and space (in bytes per character) for each algorithm are the total time and space divided by the total number of characters in all input data. Table 1 shows the results of the experiments. For comparison convenience, we normalize all the results by that from IS. To understand the dierences of time and space used in the two algorithms we also record and compare the recursion depths

(the number of iterations) and mean reduction ratios (the mean of reduction ratios for all iterations) of the two algorithms. Intuitively, the smaller the reduction ratio, the less the number of recursions, and the faster and better the algorithm. Our IS algorithm achieves all the best results among all measurements for all datasets. Table 1: Experimental Results: Time, Space, Recursion Depth and Ratio
Data world bible chr22 E.coli sprot etext howto Total Mean Norm. 94 63 4 4 66 146 197 Characters 2 473 400 4 047 392 34 553 758 4 638 690 109 617 186 105 277 340 39 422 105 300 029 871 Description CIA world fact book King James Bible Human chromosome 22 Escherichia coli genome SwissProt V34 database Project Gutenberg Linux Howto les Time (Seconds) IS KA 1.3 1.9 2.7 3.51 24.7 33.41 2.8 3.98 94.6 132.89 101 149.67 30.4 42.85 257.5 368.21 0.86 1.23 1 1.43 Space (MBytes) IS KA 12.70 21.24 20.86 34.45 178.09 289.97 24.29 40.01 554.58 930.06 542.17 907.34 203.16 331.54 1535.85 2554.61 5.12 8.51 1 1.66 Depth IS KA 6 6 6 6 6 8 7 8 7 9 11 12 9 13 52 62 7.43 8.86 1 1.19 Ratio IS KA .32 .42 .34 .45 .31 .43 .32 .42 .31 .45 .33 .45 .32 .45 2.25 3.07 .32 .44 1 1.38

In this experiment, the implementation of IS keeps the type array ti throughout the lifetime of Si at level i, which may lead to a usage of up to 2n bits in the worst case, i.e. 0.25 byte per character. As a result, the mean space is 5.12 bytes per character shown in table 1. Such a space complexity is approaching the space extreme (at least 5n bytes [6]) for sux array construction, which leaves the margin for further improvement negligible. 6 Closing Remarks

For more details of the SA-IS algorithm, our technical report Two Ecient Algorithms for Linear Sux Array Construction is available at http://www.cs.sysu.edu.cn/nong/. It is also worth noting that our SA-IS algorithm has been adopted in two other projects: 1)Burrows-Wheeler Alignment Tool (BWA) at http://maq.sourceforge.net/bwa-man.shtml, where IS is the default algorithm for constructing BWT index; and 2)In-place Update of Sufx Array while Recoding Words (http://www.irisa.fr/symbiose/mgalle/sux array update) by the Symbiose Project Team at INRIA/Irisa. Finally, we appreciate Mori as an independent party to have re-implemented our SA-IS algorithm and performed more extensive experiments. His experimental results published at http://yuta.256.googlepages.com/sais conrm that the SA-IS algorithm is currently the most time and space ecient algorithm among all the published linear algorithms for SA construction. References
[1] Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter. Geometric burrows-wheeler transform: Linking range searching and text indexing. In Proceedings DCC 2008 Data Compression Conference, pages 252261, Snowbird, UT, USA, March 2008. [2] R. Grossi and J. S. Vitter. Compressed sux arrays and sux trees with applications to text indexing and string matching. In Proceedings of STOC00, pages 397406, 2000. [3] J. Krkkinen, P. Sanders, and S. Burkhardt. Linear work sux array construction. Journal a a of the ACM, (6):918936, November 2006. [4] P. Ko and S. Aluru. Space ecient linear time construction of sux arrays. In Proceedings 14th CPM, LNCS 2676, Springer-Verlag, pages 200210, 2003. [5] U. Manber and G. Myers. Sux arrays: A new method for on-line string searches. In Proceedings of the rst ACM-SIAM SODA, pages 319327, 1990. [6] G. Manzini and P. Ferragina. Engineering a lightweight sux array construction algorithm. Algorithmica, 40(1):3350, September 2004. [7] S. Zhang and G. Nong. Fast and space ecient linear sux array construction. In Proceedings DCC 2008 Data Compression Conference, page 553, Snowbird, UT, USA, March 2008.

You might also like