Common Sub Strings

The document discusses finding common substrings that occur in multiple DNA, RNA, or protein strings. It presents an algorithm that can find the longest common substrings in O(n) linear time, which is surprising given how much information about the string structures it provides. The algorithm works by building a generalized suffix tree of the strings and computing values for each node to represent the number of distinct strings its subtree contains.

Uploaded by

Aditya Singh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Common Sub Strings

Uploaded by

Aditya Singh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

1 APL6: Common substrings of more than two

strings
One of the most important questions asked about a set of strings is what substrings
are common to a large number of the distinct strings. This is in contrast to the
important problem of finding substrings that occur repeatedly in a single string.
In biological strings (DNA, RNA or protein) the problem of finding substrings
common to a large number of distinct strings arises in many different contexts. In
fact, the task of finding (inexactly matching) substrings that appear frequently in
a set of strings is almost an industry in some subareas of molecular biology. We
will say much more about this when we discuss database searching in Chapter ??
and multiple string comparison in Chapter ??. Most directly, the problem of finding
common substrings arises because mutations that occur in DNA after two species
diverge will more rapidly change those parts of the DNA or protein that are less
functionally important. The parts of the DNA or protein that are critical for the
correct functioning of the molecule will be more highly conserved, because mutations
that occur in those regions will more likely be lethal. So finding DNA or protein
substrings that occur commonly in a wide range of species helps point to regions or
characters that may be critical for the function or structure of the biological string.
Less directly, the problem of finding (exactly matching) common substrings in
a set of distinct strings arises as a subproblem of many heuristics developed in the
biological literature to align a set of strings. That problem, called multiple alignment,
will be discussed in some detail in Section ??.
The biological applications motivate the following exact matching problem: Given
a set of strings, find substrings “common” to a large number of those strings. The
word “common” here means “occurring with equality”. A more difficult problem is
to find “similar” substrings in many given strings, where “similar” allows a small
number of differences. Problems of this type will be discussed in Part III.

Formal problem statement and first method

Suppose we have K strings whose lengths sum to n.
Definition For each k between 2 and K, we define l(k) to be the length of the
longest substring common to at least k of the strings.
We want to compute a table of K − 1 entries where entry k gives l(k) and also
points to one of common substrings of that length. For example, consider the set of
strings {sandollar, sandlot, handler, grand, pantry}. Then the l(k) values (without
pointers to the strings) are:

k l(k) one substring

1
--------------------------------------
2 4 sand
3 3 and
4 3 and
5 2 an

Surprisingly, the problem can be solved in linear, O(n), time [1]. It really is
amazing that so much information about the contents and substructure of the strings
can be extracted in time proportional to the time needed just to read in the strings.
The linear time algorithm will be fully discussed in Chapter ?? after the constant
time lowest common ancestor method has been discussed.
To prepare for the O(n) result, we show here how to solve the problem in O(Kn)
time. That time bound is also non-trivial, but is achieved by a generalization of the
longest common substring method for two strings. First, build a generalized suffix tree
T for the K strings. Each leaf of the tree represents a suffix from one of the K strings,
and is marked with one of K unique string identifiers, 1 to K, to indicate which string
the suffix is from. Each of the K strings is given a distinct termination symbol, so
that identical suffixes appearing in more than one string end at distinct leaves in the
generalized suffix tree. Hence, each leaf in T has only one string identifier.
Definition For every internal node v of T , define C(v) to be the number of
distinct string identifiers that appear at the leaves in the subtree of v.
Once the C(v) numbers are known, and the string-depth of every node is known,
the desired l(k) values can be easily accumulated with a linear time traversal of the
tree. That traversal builds a vector V where, for each value of k from 2 to K, V (k)
holds the string-depth (and location if desired) of the deepest (string-depth) node v
encountered with C(v) = k. (When encountering a node v with C(v) = k compare
the string-depth of v to the current value of V (k) and if v’s depth is greater than
V (k), change V (k) to the depth of v.) Essentially, V (k) reports the length of the
longest string that occurs exactly k times. Therefore V (k) ≤ l(k). To find l(k) simply
scan V from largest to smallest index writing into each position the maximum V (k)
value seen. That is, if V (k) is empty or V (k) < V (k + 1) then set V (k) to V (k + 1).
The resulting vector holds the desired l(k) values.

1.1 Computing the C(v) numbers

In linear time, it is easy to compute for each internal node v the number of leaves in
v’s subtree. But that number may be larger than C(v) since two leaves in the subtree
may have the same identifier. That repetition of identifiers is what makes it hard
to compute C(v) in O(n) time. So, instead of counting the number of leaves below
v, the algorithm uses O(Kn) time to explicitly compute which identifiers are found

2
below any node. For each internal node v, a K-length bit-vector is created which
has a 1 in bit i if there is a leaf with identifier i in the subtree of v. Then C(v) is
just the number of 1-bits in that vector. The vector for v is obtained by OR’ing the
vectors of the children of v. For l children, this takes lK time. Therefore over the
entire tree, since there are O(n) edges, the time needed to build the entire table is
O(Kn). We will return to this problem in Section 2 where an O(n) time solution will
be presented.

2 A linear time solution to the multiple common

substring problem
Above, a generalized suffix tree T was constructed for the K strings of total length
n, and the table of all the l(k) values was obtained by operations on T . That method
had a running time of O(Kn). In this section we reduce the time to O(n). The
solution was obtained by Lucas Hui [1]1 .
Recall that for any node v in T , C(v) is the number of distinct leaf string identifiers
in the subtree of v, and that a table of all the l(k) values can be computed in O(n)
time once all the C(v) values are known. Recall also that S(v) is the total number
of leaves in the subtree of v, and that S(v) can easily be computed in O(n) time for
all nodes.
Certainly S(v) ≥ C(v) for any node v, and it will be strictly greater when there
are two or more leaves of the same string identifier in v’s subtree. Our approach to
finding C(v) is to compute both S(v) and a correction factor U(v) which counts how
many ”duplicate” suffixes from the same string occur in v’s subtree. Then C(v) is
simply S(v) − U(v).
Definition: ni (v) is the number of leaves with identifier i in the subtree rooted
at node v. Let ni be the total number of leaves with identifier i.
With that definition, we immediately have the following
P P
Lemma 2.1 U(v) = i:ni (v)>0 (ni (v) − 1), and C(v) = S(v) − i:ni (v)>0 (ni (v) − 1).

We show below that all the correction factors for all internal nodes can be com-
puted in O(n) total time. That then gives an O(n) time solution to the k-common
substring problem.
1
In the introduction of an earlier unpublished manuscript [3], Pratt claims a linear time solution
to the problem but the claim doesn’t specify whether the problem is for a fixed k or for all values
of k. The section where the details were to be presented is not available and was apparently never
finished [2].

3
2.1 The method
The algorithm first does a depth-first traversal of T , numbering the leaves in the
order that they are encountered. That numbering has the familiar property that for
any internal node v, the numbers given to the leaves in the subtree rooted at v are
consecutive, i.e., they form a consecutive interval.
For purposes of the exposition, let us focus on the single identifier i and show
how to compute ni (v) − 1 for each internal node v. Let Li be the list of leaves with
identifier i, in increasing order of their dfs numbers. For example, in Figure 1, the
leaves with identifier i are shown boxed and the corresponding Li is 1, 3, 6, 8, 10. By
the properties of depth-first numbering, for the subtree rooted at any internal node
v, all the ni (v) leaves with identifier i occur in a consecutive interval of list Li . Call
that interval Li (v). If x and y are any two leaves in Li (v), then the lca of x and y
is a node in the subtree of v. So if we compute the lca for each consecutive pair of
leaves in Li (v), then all of the ni (v) − 1 computed lca’s will be found in the subtree
of v. Further, if x and y are not both in the subtree of v, then the lca of x and y will
not be a node in v’s subtree. This leads to the following lemma and method.
Lemma 2.2 If we compute the lca for each consecutive pair of leaves in Li , then for
any node v, exactly ni (v) − 1 of the computed lca’s will lie in the subtree of v.
Lemma 2.2 is illustrated in Figure 1.

lca(3,6)

lca(1,3) lca(8,10)

lca(6,8)

3 4

2 8 9 10
1

5 6 7

Figure 1: The boxed leaves have identifier i. The circled internal nodes are the lowest
common ancestors of the four adjacent pairs of leaves from list Li .

Given the lemma, we can compute ni (v) − 1 for each node v by the following
method. Compute the lca of each consecutive pair of leaves in Li , and accumulate for

4
each node w a count of the number of times that w is the computed lca. Let h(w)
P
denote that count for node w. Then for any node v, ni (v) − 1 is exactly [h(w) : w
is in the subtree of v]. A standard O(n)-time bottom-up traversal of T can therefore
be used to find ni (v) − 1 for each node v.
P
To find U(v), we don’t want ni (v) − 1 but rather i [ni (v) − 1]. But the algorithm
must not do a separate bottom-up traversal for each identifier, since then the time
bound would then be O(Kn). Instead, the algorithm should defer the bottom-up
traversal until each list Li has been processed, and it should let h(w) count the total
number of times that w is the computed lca over all of the lists. Only then is a single
P P
bottom-up traversal of T done. At that point, U(v) = i:ni>0 [ni (v)−1] = [h(w) : w
is in the subtree of v].
We can now summarize the entire O(n) method for solving the k-common sub-
string problem.
Begin
1. Build a generalized suffix tree T for the K strings.
2. Number the leaves of T as they are encountered in a depth-first traversal of T .
3. For each string identifier i, extract the ordered list Li of leaves with identifier
i. (The minor implementation detail needed to do this in O(n) total time is left to
the reader.)
4. For each node w in T set h(w) to zero.
5. For each identifier i, compute the lca of each consecutive pair of leaves in Li ,
and increment h(w) by one each time that w is the computed lca.
6. With a bottom-up traversal of T , compute, for each node v, S(v) and U(v) =
P P
i:ni >0 [ni (v) − 1] = [h(w) : w is in the subtree of v].
7. Set C(v) = S(v) − U(v) for each node v.
8. Accumulate the table of l(k) values as detailed in Section 1.
End.

2.2 Time analysis

The size of the suffix tree is O(n) and preprocessing of the tree for lca computations
P
is done in O(n) time. There are then K i=1 |n(i) − 1| < n lca computations done,
each of which takes constant time, so all the lca computations take O(n) time in
total. Hence only O(n) time is needed to compute all C(v) values. Once these are
known, only O(n) additional time is needed to build the output table. That part of
the algorithm is the same as in the previously discussed O(Kn) time algorithm of
Section 1. Therefore

Theorem 2.1 Let S be a set of K strings of total length n, and let l(k) denote the
length of the longest substring that appears in at least k distinct strings of S. A table
of all l(k) values, for k from 2 to K, can be built in O(n) time.

5
That so much information about the substrings of S can be obtained in time
proportional to the time needed just to read the strings is very impressive. It would
be a good challenge to try to obtain this result without the use of suffix trees (or a
similar data structure).

References
[1] L. Hui. Color set size problem with applications to string matching. Proc. 3rd
Symp. on Combinatorial Pattern Matching. Springer LNCS 644, pages 227–240,
1992.

[2] V. Pratt. Personal Communication.

[3] V. Pratt. Applications of the weiner repetition finder. Unpublished manuscript,

1973.

05continuous Univariate Distributions, Vol. 1 PDF
0% (1)
05continuous Univariate Distributions, Vol. 1 PDF
769 pages
Certified Nursing Assistant (CNA) Exam Study Guide
25% (4)
Certified Nursing Assistant (CNA) Exam Study Guide
20 pages
TATA LPT-613 Parts Catalog PDF
100% (1)
TATA LPT-613 Parts Catalog PDF
260 pages
MFJFJ
25% (4)
MFJFJ
4 pages
Linear Algebra
From Everand
Linear Algebra
Georgi E. Shilov
2.5/5 (3)
Pattern Matching: Suffix Tree Applications
No ratings yet
Pattern Matching: Suffix Tree Applications
39 pages
Crochemore Ilie Rytter
No ratings yet
Crochemore Ilie Rytter
16 pages
Efficiency of A Good But Not Linear Set Union Algorithm. Tarjan
No ratings yet
Efficiency of A Good But Not Linear Set Union Algorithm. Tarjan
11 pages
Lec11 Graphs
No ratings yet
Lec11 Graphs
77 pages
Homework3 PDF
No ratings yet
Homework3 PDF
6 pages
6.851 Advanced Data Structures (Spring'12) Prof. Erik Demaine Problem 9 Sample Solution
No ratings yet
6.851 Advanced Data Structures (Spring'12) Prof. Erik Demaine Problem 9 Sample Solution
2 pages
Applications of Suffix Trees
No ratings yet
Applications of Suffix Trees
40 pages
1 s2.0 S0020019015000411 Main
No ratings yet
1 s2.0 S0020019015000411 Main
3 pages
Dynamic Programming - Longest Common Subsequence (LCS)
No ratings yet
Dynamic Programming - Longest Common Subsequence (LCS)
34 pages
Unit-1 2
No ratings yet
Unit-1 2
7 pages
Cs521 Midterm Cheatsheet
0% (1)
Cs521 Midterm Cheatsheet
2 pages
L17
No ratings yet
L17
23 pages
CH5 3
No ratings yet
CH5 3
36 pages
ECE608 Homework #8 Solution, Fall 2003
No ratings yet
ECE608 Homework #8 Solution, Fall 2003
13 pages
CS300_SAMPLE_FINAL_SOLUTIONS.docx
No ratings yet
CS300_SAMPLE_FINAL_SOLUTIONS.docx
10 pages
AA Exam 2022 Answers
No ratings yet
AA Exam 2022 Answers
5 pages
Disjoint Sets: Each of The Elements Is in Exactly One Set at Any Time
No ratings yet
Disjoint Sets: Each of The Elements Is in Exactly One Set at Any Time
28 pages
HW 2
No ratings yet
HW 2
5 pages
Linear Pattern Matching of Repeated Substrings: Alejandro L Opez-Ortiz
No ratings yet
Linear Pattern Matching of Repeated Substrings: Alejandro L Opez-Ortiz
10 pages
Tutorial
No ratings yet
Tutorial
6 pages
Disjoint Set Data Structure: Find (X) - Determine Which Set An Item With Key X Is In, I.e., Return The Key of
No ratings yet
Disjoint Set Data Structure: Find (X) - Determine Which Set An Item With Key X Is In, I.e., Return The Key of
5 pages
Lab2 24 - 07 - 2024
No ratings yet
Lab2 24 - 07 - 2024
10 pages
Dynamic Programming:: Example 1: Assembly Line Scheduling. Instance
No ratings yet
Dynamic Programming:: Example 1: Assembly Line Scheduling. Instance
14 pages
York University CSE 3101 Summer 2012 - Exam
No ratings yet
York University CSE 3101 Summer 2012 - Exam
10 pages
Toc
No ratings yet
Toc
6 pages
Aoa 6
No ratings yet
Aoa 6
4 pages
Algorithms Homework Help
No ratings yet
Algorithms Homework Help
22 pages
Longest Common String
No ratings yet
Longest Common String
40 pages
Dhaka ICPC Preliminary 2021 Editorial
No ratings yet
Dhaka ICPC Preliminary 2021 Editorial
7 pages
Solutions For HW5-CS 6033 Fall 2024
No ratings yet
Solutions For HW5-CS 6033 Fall 2024
13 pages
Rolling Hash (Rabin-Karp Algorithm) : Objective
No ratings yet
Rolling Hash (Rabin-Karp Algorithm) : Objective
4 pages
Algorithms Exam Help
No ratings yet
Algorithms Exam Help
22 pages
Pattern Matching in Trees: Justus Schwartz
No ratings yet
Pattern Matching in Trees: Justus Schwartz
43 pages
CSE101 L12-13 Recurrence
No ratings yet
CSE101 L12-13 Recurrence
61 pages
HW1 DPV Chapter 6 Solutions
No ratings yet
HW1 DPV Chapter 6 Solutions
11 pages
Z Function and Its Calculation:: Int Int Int Int For Int If While If
No ratings yet
Z Function and Its Calculation:: Int Int Int Int For Int If While If
32 pages
Chapter 10 - Parallel in Tree-Related Problems
No ratings yet
Chapter 10 - Parallel in Tree-Related Problems
84 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
hw06 Solution PDF
No ratings yet
hw06 Solution PDF
7 pages
Data Structures For Computing Unique Palindromes in Static and Non-Static Strings
No ratings yet
Data Structures For Computing Unique Palindromes in Static and Non-Static Strings
22 pages
03-OO-í½-t-ßí-nG-µ en
No ratings yet
03-OO-í½-t-ßí-nG-µ en
39 pages
Data Structures And Algorithm Analysis In Java 3rd Edition Weiss Solutions Manual instant download
100% (2)
Data Structures And Algorithm Analysis In Java 3rd Edition Weiss Solutions Manual instant download
40 pages
PDF Data Structures And Algorithm Analysis In Java 3rd Edition Weiss Solutions Manual download
100% (3)
PDF Data Structures And Algorithm Analysis In Java 3rd Edition Weiss Solutions Manual download
44 pages
Section IV.5: Recurrence Relations From Algorithms
No ratings yet
Section IV.5: Recurrence Relations From Algorithms
9 pages
F16midterm1_solution
No ratings yet
F16midterm1_solution
8 pages
Problem Set 4 Solutions: Introduction To Algorithms
No ratings yet
Problem Set 4 Solutions: Introduction To Algorithms
5 pages
Recursion: Breaking Down Problems Into Solvable Subproblems
No ratings yet
Recursion: Breaking Down Problems Into Solvable Subproblems
26 pages
CS603PC_DAA_UNIT-2
No ratings yet
CS603PC_DAA_UNIT-2
15 pages
Tut3 Sol
No ratings yet
Tut3 Sol
3 pages
JU NCPC 2023 - Online Preliminary Contest Editorial
No ratings yet
JU NCPC 2023 - Online Preliminary Contest Editorial
4 pages
20-BinarySearchTrees
No ratings yet
20-BinarySearchTrees
83 pages
DAA - Internal 2 QP (With Key)
No ratings yet
DAA - Internal 2 QP (With Key)
22 pages
AA Exam 2022
No ratings yet
AA Exam 2022
3 pages
QUESTIONS Dynamic Programming
No ratings yet
QUESTIONS Dynamic Programming
6 pages
Answer 2019
No ratings yet
Answer 2019
7 pages
UNIT-2
No ratings yet
UNIT-2
19 pages
Binary Jumbled Pattern Matching On Trees and Tree-Like Structures
No ratings yet
Binary Jumbled Pattern Matching On Trees and Tree-Like Structures
18 pages
Dynamic Programing in Dsa
No ratings yet
Dynamic Programing in Dsa
32 pages
Elements of Tensor Calculus
From Everand
Elements of Tensor Calculus
A. Lichnerowicz
3.5/5 (2)
Advantages of Ethics in Human Resource
No ratings yet
Advantages of Ethics in Human Resource
11 pages
Handbook On Habitat Restoration (Full Book)
No ratings yet
Handbook On Habitat Restoration (Full Book)
340 pages
Evaluating The Possibility of Blockchain in Nigerian Real Estate Presentation
No ratings yet
Evaluating The Possibility of Blockchain in Nigerian Real Estate Presentation
13 pages
Citroen 2CV4 CV6 Owner's Manual
No ratings yet
Citroen 2CV4 CV6 Owner's Manual
86 pages
An Introduction To Race Car Engineering
No ratings yet
An Introduction To Race Car Engineering
13 pages
Class Orientation: Rea Mariz Jordan, LPT
No ratings yet
Class Orientation: Rea Mariz Jordan, LPT
34 pages
Repair Parts Sheet: RC10 & RW10 Series Hydraulic Cylinders RC101, RC102 & RC104 RW101, RW102, RW104 & RW106
No ratings yet
Repair Parts Sheet: RC10 & RW10 Series Hydraulic Cylinders RC101, RC102 & RC104 RW101, RW102, RW104 & RW106
2 pages
WPS Sample
No ratings yet
WPS Sample
6 pages
Ecommerce Coordinator: Who We Are
No ratings yet
Ecommerce Coordinator: Who We Are
3 pages
Option H2 and H9 Modbus Communication 4189340442 UK
No ratings yet
Option H2 and H9 Modbus Communication 4189340442 UK
203 pages
Qap For Jib Crane
No ratings yet
Qap For Jib Crane
2 pages
EE LAWS Summary
No ratings yet
EE LAWS Summary
21 pages
The State of South Asia 2024
No ratings yet
The State of South Asia 2024
76 pages
3 Aperture-Shared - Millimeter-Wave - Sub-6 - GHZ - Dual-Band - Antenna - Hybridizing - FabryProt - Cavity - and - Fresnel - Zone - Plate
No ratings yet
3 Aperture-Shared - Millimeter-Wave - Sub-6 - GHZ - Dual-Band - Antenna - Hybridizing - FabryProt - Cavity - and - Fresnel - Zone - Plate
12 pages
BobCat S450 Catalogue
No ratings yet
BobCat S450 Catalogue
2 pages
AEA0810151M035R - 47uf - KYOCERA AVX - SMD - 8.3 X 8.3 - AEC
No ratings yet
AEA0810151M035R - 47uf - KYOCERA AVX - SMD - 8.3 X 8.3 - AEC
6 pages
Activity 1 Serial Dilution NEW
100% (1)
Activity 1 Serial Dilution NEW
7 pages
Recent Advances in The Machining of Titanium Alloys Using Minimum Quantity Lubrication (MQL) Based Techniques PDF
No ratings yet
Recent Advances in The Machining of Titanium Alloys Using Minimum Quantity Lubrication (MQL) Based Techniques PDF
13 pages
Sansui Au-Alpha 777dg
No ratings yet
Sansui Au-Alpha 777dg
2 pages
Details of Rbi Index
No ratings yet
Details of Rbi Index
1 page
FullStmt 1663500693764 3730232321929 Malikaliawan12
No ratings yet
FullStmt 1663500693764 3730232321929 Malikaliawan12
16 pages
Cable Tray Price List - 2021
100% (1)
Cable Tray Price List - 2021
1 page
Group Case Study Instructions Busi411
No ratings yet
Group Case Study Instructions Busi411
2 pages
Twelve Books That Changed The World
No ratings yet
Twelve Books That Changed The World
6 pages
Description and Definition
No ratings yet
Description and Definition
29 pages
Harvard 18
No ratings yet
Harvard 18
3 pages