Common Sub Strings
Common Sub Strings
strings
One of the most important questions asked about a set of strings is what substrings
are common to a large number of the distinct strings. This is in contrast to the
important problem of finding substrings that occur repeatedly in a single string.
In biological strings (DNA, RNA or protein) the problem of finding substrings
common to a large number of distinct strings arises in many different contexts. In
fact, the task of finding (inexactly matching) substrings that appear frequently in
a set of strings is almost an industry in some subareas of molecular biology. We
will say much more about this when we discuss database searching in Chapter ??
and multiple string comparison in Chapter ??. Most directly, the problem of finding
common substrings arises because mutations that occur in DNA after two species
diverge will more rapidly change those parts of the DNA or protein that are less
functionally important. The parts of the DNA or protein that are critical for the
correct functioning of the molecule will be more highly conserved, because mutations
that occur in those regions will more likely be lethal. So finding DNA or protein
substrings that occur commonly in a wide range of species helps point to regions or
characters that may be critical for the function or structure of the biological string.
Less directly, the problem of finding (exactly matching) common substrings in
a set of distinct strings arises as a subproblem of many heuristics developed in the
biological literature to align a set of strings. That problem, called multiple alignment,
will be discussed in some detail in Section ??.
The biological applications motivate the following exact matching problem: Given
a set of strings, find substrings “common” to a large number of those strings. The
word “common” here means “occurring with equality”. A more difficult problem is
to find “similar” substrings in many given strings, where “similar” allows a small
number of differences. Problems of this type will be discussed in Part III.
1
--------------------------------------
2 4 sand
3 3 and
4 3 and
5 2 an
Surprisingly, the problem can be solved in linear, O(n), time [1]. It really is
amazing that so much information about the contents and substructure of the strings
can be extracted in time proportional to the time needed just to read in the strings.
The linear time algorithm will be fully discussed in Chapter ?? after the constant
time lowest common ancestor method has been discussed.
To prepare for the O(n) result, we show here how to solve the problem in O(Kn)
time. That time bound is also non-trivial, but is achieved by a generalization of the
longest common substring method for two strings. First, build a generalized suffix tree
T for the K strings. Each leaf of the tree represents a suffix from one of the K strings,
and is marked with one of K unique string identifiers, 1 to K, to indicate which string
the suffix is from. Each of the K strings is given a distinct termination symbol, so
that identical suffixes appearing in more than one string end at distinct leaves in the
generalized suffix tree. Hence, each leaf in T has only one string identifier.
Definition For every internal node v of T , define C(v) to be the number of
distinct string identifiers that appear at the leaves in the subtree of v.
Once the C(v) numbers are known, and the string-depth of every node is known,
the desired l(k) values can be easily accumulated with a linear time traversal of the
tree. That traversal builds a vector V where, for each value of k from 2 to K, V (k)
holds the string-depth (and location if desired) of the deepest (string-depth) node v
encountered with C(v) = k. (When encountering a node v with C(v) = k compare
the string-depth of v to the current value of V (k) and if v’s depth is greater than
V (k), change V (k) to the depth of v.) Essentially, V (k) reports the length of the
longest string that occurs exactly k times. Therefore V (k) ≤ l(k). To find l(k) simply
scan V from largest to smallest index writing into each position the maximum V (k)
value seen. That is, if V (k) is empty or V (k) < V (k + 1) then set V (k) to V (k + 1).
The resulting vector holds the desired l(k) values.
2
below any node. For each internal node v, a K-length bit-vector is created which
has a 1 in bit i if there is a leaf with identifier i in the subtree of v. Then C(v) is
just the number of 1-bits in that vector. The vector for v is obtained by OR’ing the
vectors of the children of v. For l children, this takes lK time. Therefore over the
entire tree, since there are O(n) edges, the time needed to build the entire table is
O(Kn). We will return to this problem in Section 2 where an O(n) time solution will
be presented.
We show below that all the correction factors for all internal nodes can be com-
puted in O(n) total time. That then gives an O(n) time solution to the k-common
substring problem.
1
In the introduction of an earlier unpublished manuscript [3], Pratt claims a linear time solution
to the problem but the claim doesn’t specify whether the problem is for a fixed k or for all values
of k. The section where the details were to be presented is not available and was apparently never
finished [2].
3
2.1 The method
The algorithm first does a depth-first traversal of T , numbering the leaves in the
order that they are encountered. That numbering has the familiar property that for
any internal node v, the numbers given to the leaves in the subtree rooted at v are
consecutive, i.e., they form a consecutive interval.
For purposes of the exposition, let us focus on the single identifier i and show
how to compute ni (v) − 1 for each internal node v. Let Li be the list of leaves with
identifier i, in increasing order of their dfs numbers. For example, in Figure 1, the
leaves with identifier i are shown boxed and the corresponding Li is 1, 3, 6, 8, 10. By
the properties of depth-first numbering, for the subtree rooted at any internal node
v, all the ni (v) leaves with identifier i occur in a consecutive interval of list Li . Call
that interval Li (v). If x and y are any two leaves in Li (v), then the lca of x and y
is a node in the subtree of v. So if we compute the lca for each consecutive pair of
leaves in Li (v), then all of the ni (v) − 1 computed lca’s will be found in the subtree
of v. Further, if x and y are not both in the subtree of v, then the lca of x and y will
not be a node in v’s subtree. This leads to the following lemma and method.
Lemma 2.2 If we compute the lca for each consecutive pair of leaves in Li , then for
any node v, exactly ni (v) − 1 of the computed lca’s will lie in the subtree of v.
Lemma 2.2 is illustrated in Figure 1.
lca(3,6)
lca(1,3) lca(8,10)
lca(6,8)
3 4
2 8 9 10
1
5 6 7
Figure 1: The boxed leaves have identifier i. The circled internal nodes are the lowest
common ancestors of the four adjacent pairs of leaves from list Li .
Given the lemma, we can compute ni (v) − 1 for each node v by the following
method. Compute the lca of each consecutive pair of leaves in Li , and accumulate for
4
each node w a count of the number of times that w is the computed lca. Let h(w)
P
denote that count for node w. Then for any node v, ni (v) − 1 is exactly [h(w) : w
is in the subtree of v]. A standard O(n)-time bottom-up traversal of T can therefore
be used to find ni (v) − 1 for each node v.
P
To find U(v), we don’t want ni (v) − 1 but rather i [ni (v) − 1]. But the algorithm
must not do a separate bottom-up traversal for each identifier, since then the time
bound would then be O(Kn). Instead, the algorithm should defer the bottom-up
traversal until each list Li has been processed, and it should let h(w) count the total
number of times that w is the computed lca over all of the lists. Only then is a single
P P
bottom-up traversal of T done. At that point, U(v) = i:ni>0 [ni (v)−1] = [h(w) : w
is in the subtree of v].
We can now summarize the entire O(n) method for solving the k-common sub-
string problem.
Begin
1. Build a generalized suffix tree T for the K strings.
2. Number the leaves of T as they are encountered in a depth-first traversal of T .
3. For each string identifier i, extract the ordered list Li of leaves with identifier
i. (The minor implementation detail needed to do this in O(n) total time is left to
the reader.)
4. For each node w in T set h(w) to zero.
5. For each identifier i, compute the lca of each consecutive pair of leaves in Li ,
and increment h(w) by one each time that w is the computed lca.
6. With a bottom-up traversal of T , compute, for each node v, S(v) and U(v) =
P P
i:ni >0 [ni (v) − 1] = [h(w) : w is in the subtree of v].
7. Set C(v) = S(v) − U(v) for each node v.
8. Accumulate the table of l(k) values as detailed in Section 1.
End.
Theorem 2.1 Let S be a set of K strings of total length n, and let l(k) denote the
length of the longest substring that appears in at least k distinct strings of S. A table
of all l(k) values, for k from 2 to K, can be built in O(n) time.
5
That so much information about the substrings of S can be obtained in time
proportional to the time needed just to read the strings is very impressive. It would
be a good challenge to try to obtain this result without the use of suffix trees (or a
similar data structure).
References
[1] L. Hui. Color set size problem with applications to string matching. Proc. 3rd
Symp. on Combinatorial Pattern Matching. Springer LNCS 644, pages 227–240,
1992.