Fast Pattern Matching In: Strings
Fast Pattern Matching In: Strings
Abstract. An algorithm is presented which finds all occurrences of one. given string within
another, in running time proportional to the sum of the lengths of the strings. The constant of
proportionality is low enough to make this algorithm of practical use, and the procedure can also be
extended to deal with some more general pattern-matching problems. A theoretical application of the
algorithm shows that the set of concatenations of even palindromes, i.e., the language {can}*, can be
recognized in linear time. Other algorithms which run even faster on the average are also considered.
Key words, pattern, string, text-editing, pattern-matching, trie memory, searching, period of a
string, palindrome, optimum algorithm, Fibonacci string, regular expression
* Received by the editors August 29, 1974, and in revised form April 7, 1976.
t Computer Science Department, Stanford University, Stanford, California 94305. The work of
this author was supported in part by the National Science Foundation under Grant GJ 36473X and by
the Office of Naval Research under Contract NR 044-402.
Xerox Palo Alto Research Center, Palo Alto, California 94304. The work of this author was
supported in part by the National Science Foundation under Grant GP 7635 at the University of
California, Berkeley.
Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Mas-
sachusetts 02139. The work of this author was supported in part by the National Science Foundation
under Grant GP-6945 at University of California, Berkeley, and under Grant GJ-992 at Stanford
University.
323
324 DONALD E. KNUTH, J/t.IVIES H. MORRIS, JR. AND VAUGHAN R. PRATI’
abc abc ac ab
babc babc abcaabcabcabcacabc
The arrow here indicates the current text character; since it points to b, which
doesn’t match that a, we shift the pattern one space right and move to the next
input character:
abc abc ac ab
babc babc abcaabcabcabcacabc
Now we have a match, so the pattern stays put while the next several characters
are scanned. Soon we come to another mismatch"
abc abc ac ab
babc babc abcaabcabcabcacabc
At this point we have matched the first three pattern characters but not the fourth,
so we know that the last four characters of the input have been a b c x where x # a;
we don’t have to remember the previously scanned characters, since ourposition in
the pattern yields enough information to recreate them. In this case, no matter what x
is (as long as it’s not a), we deduce that the pattern can immediately be shifted four
more places to the right; one, two, or three shifts couldn’t possibly lead to a match.
Soon we get to another partial match, this time with a failure on the eighth
pattern character:
abcabcacab
babc babc abc aabcabcabcaca bc
FAST PATTERN MATCHING IN STRINGS 325
Now we know that the last eight characters were a b c a b c a x, where x # c. The
pattern should therefore be shifted three places to the right"
abc abc ac ab
babcbabcabc aabc abcabcacabc
We try to match the new pattern character, but this fails too, so we shift the pattern
four (not three or five) more places. That produces a match, and we continue
scanning until reaching another mismatch on the eighth pattern character"
abc abc ac ab
ba bcbabcabcaabc abc abc acabc
Again we shift the pattern three places to the right; this time a match is produced,
and we eventually discover the full pattern:
abc abc ac ab
ba bcbabcabcaa bc abc abc ac abc
The play-by-play description for this example indicates that the pattern-
matching process will run efficiently if we have an auxiliary table that tells us
exactly how far to slide the pattern, when we detect a mismatch at its/’th character
pattern[if. Let next[f] be the character position in the pattern which should be
checked next after such a mismatch, so that we are sliding the pattern ] next[] ]
places relative to the text. The following table lists the appropriate values:
/’=1 2 3 4 5 6 7 8 9 10
pattern[f]=a b c a b c a c a b
next[f ] O 1 1 0 1 1 0 5 0 1
(Note that next[j] 0 means that we are to slide the pattern all the way past the
current text character.) We shall discuss how to precompute this table later;
fortunately, the calculations are quite simple, and we will see that they require
only O(m) steps.
At each step of the scanning process, we move either the text pointer or the
pattern, and each of these can move at most n times; so at most 2n steps need to be
performed, after the next table has been set up. Of course the pattern itself doesn’t
really move; we can do the necessary operations simply by maintaining the pointer
variable ].
326 DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
1:= k := 1;
while/" _-< rn and k _-< n do
begin
while j > 0 and text[k] pattern[f]
do j := next[j];
k:=k+l;j:=j+l;
end;
If/" > rn at the conclusion of the program, the leftmost match has been found in
positions k rn through k 1; but if/" <- m, the text has been exhausted. (The and
operation here is the "conditional and" which does not evaluate the relation
text[k] pattern[f] unless/" > 0.) The program has a curious feature, namely that
the inner loop operation "j := next[f]" is performed no more often than the outer
loop operation "k := k + 1"; in fact, the inner loop is usually performed somewhat
less often, since the pattern generally moves right less frequently than the text
pointer does.
To prove rigorously that the above program is correct, we may use the
following invariant relation" "Let p k-/" (i.e., the position in the text just
preceding the first character of the pattern, in our assumed alignment). Then we
have text[p+i]=pattern[i] for l=<i </" (i.e., we have matched the first f-1
characters of the pattern, if j >0); but for 0_-< <p we have text[t + if pattern[if
for some i, where 1 -< -< rn (i.e., there is no possible match of the entire pattern to
the left of p)."
The program will of course be correct only if we can compute the next table so
that the above relation remains invariant when we perform the operation
j := next[j]. Let us look at that computation now. When the program sets
FAST PATTERN MATCHING IN STRINGS 327
/" := next[f], we know that f > 0, and that the last/" characters of the input up to and
including text[k] were
pattern[l].., pattern[j- 1] x
where x pattern If]. What we want is to find the least amount of shift for which
these characters can possibly match the shifted pattern; in other words, we want
next[f] to be the largest less than/" such that the last characters of the input were
pattern 1 ]... pattern 1] x
and pattern[i]pattern[f]. (If no such exists, we let next[i]=O.) With this
definition of next[f] it is easy to verify that text[t+l]...text[k]
pattern[l] pattern[k 1] for k f <= < k next[f]; hence the stated relation is
indeed invariant, and our program is correct.
Now we must face up to the problem we have been postponing, the task of
calculating next[f] in the first place. This problem would be easier if we didn’t
require pattern[i] pattern[f] in the definition of next[j], so we shall consider the
easier problem first. Let f[f] be the largest less than/" such that pattern[l]...
pattern[i- 1] pattern[f-i + 1] pattern[j- 1]; since this condition holds vac-
uously for 1, we always have f[/’] >= 1 when/" > 1. By convention we let f[ 1] 0.
The pattern used in the example of 1 has the following f table:
/’=1 2 3 4 5 6 7 8 9 10
pattern[f]=a b c a b c a c a b
f[/’]=0 1 1 1 2 3 4 5 1 2.
If pattern[j]=pattern[f[j]] then f[f+ 1]=f[j]+ 1; but if not, we can use
essentially the same pattern-matching algorithm as above to compute f[j + 1],
with text pattern! (Note the similarity of the f[j] problem to the invariant
condition of the matching algorithm. Our program calculates the largest j less
than or equal to k such that pattern[I].., pattern[j- 1] text[k-j+ 1]...
text[k- 1], so we can transfer the previous technology to the present problem.)
The following program will compute f[/" + 1], assuming that f[j] and next[l],
next[j- 1] have already been calculated"
:=
while > 0 and pattern[j] pattern[t]
do t := next[t];
f[/’+ 1] := t+ 1;
The correctness of this program is demonstrated as before; we can imagine two
copies of the pattern, one sliding to the right with respect to the other. For
example, suppose we have established that f[8]= 5 in the above case; let us
consider the computation of ]’[9]. The appropriate picture is
abcabc acab
abc abcac ab
328 DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
Since pattern[8] b, we shift the upper copy right, knowing that the most recently
scanned characters of the lower copy were a b c a x for x b. The next table tells
us to shift right four places, obtaining
abcabcacab
abcabcacab
This program takes O(m) units of time, for the same reason as the matching
program takes O(n): the operation := next[t] in the innermost loop always shifts
the upper copy of the pattern to the riglt, so it is performed a total of m times at
most. (A slightly different way to prove that the running time is bounded by a
constant times m is to observe that the variable starts at 0 and it is increased,
m- 1 times, by 1; furthermore its value remains nonnegative. Therefore the
operation := next[t], which always decreases t, can be performed at most m- 1
times.)
To summarize what we have said so far: Strings of text can be scanned
efficiently by making use of two ideas. We can precompute "shifts", specifying
how to move the given pattern when a mismatch occurs at its/’th character; and
this precomputation of "shifts" can be performed efficiently by using the same
principle, shifting the pattern against itself.
3. Gaining efficiency. We have presented the pattern-matching algorithm in
a form that is rather easily proved correct; but as so often happens, this form is not
very efficient. In fact, the algorithm as presented above would probably not be
competitive with the naive algorithm on realistic data, even though the naive
algorithm has a worst-case time of order m times n instead of m plus n, because
FAST PATTERN MATCHING IN STRINGS 329
the chance of this worst case is rather slim. On the other hand, a well-implemented
form of the new algorithm should go noticeably faster because there is no backing
up after a partial match.
It is not difficult to see the source of inefficiency in the new algorithm as
presented above: When the alphabet of characters is large, we will rarely have a
partial match, and the program will waste a lot of time discovering rather
awkwardly that text[k] pattern[l] for k 1, 2, 3, When/" 1 and text[k]
pattern[l], the algorithm sets/" := next[If=O, then discovers that/" =0, then
increases k by 1, then sets ] to 1 again, then tests whether or not 1 is <=m, and later
it tests whether or not 1 is greater than 0. Clearly we would be much better off
making/" 1 into a special case.
The algorithm also spends unnecessary time testing whether/" > m or k > n.
A fully-matched pattern can be accounted for by setting pattern[m + 1]= "@"
for some impossible character @ that will never be matched, and by letting
next[m+l]=-l; then a test for/’<0 can be inserted into a less-frequently
executed part of the code. Similarly we can for example set text[n + 1] "_1_"
(another impossible character) and text[n +2]=pattern[if, so that the test for
k > n needn’t be made very often. (See [17] for a discussion of such more or less
mechanical transformations on programs.)
The following form of the algorithm incorporates these refinements.
a := pattern 1];
pattern[m + 1] := ’@’ next[m + 1] := -1;
text[n + 1] := ’_t_’ text[n + 2] := a;
f:=k:= 1;
get started: comment/" 1;
while text[k] a do k := k + 1;
if k > n then go to input exhausted;
char matched’/" := f + 1; k := k + 1;
loop: comment/" > 0;
if text[k] pattern[f] then go to char matched;
] := next[f];
i/" 1 then go to get started;
0 then
if f
begin
] := 1; k := k+l;
go to get started;
end;
if / 0 then go to loop;
comment text[k- m] through text[k- 1] matched;
This program will usually run faster than the naive algorithm; the worst case
occurs when trying to find the pattern a b in a long string of a’s. Similar ideas can
be used to speed up the program which prepares the next table.
In a text-editor the patterns are usually short, so that it is most efficient to
translate the pattern directly into machine-language code which implicitly con-
tains the next table (cf. [3, Hack 179] and [24]). For example, the pattern in 1
330 DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
at label "char matched". The test "if/>0 then" is also removed from the
program. This forms linked lists for 1-</=<m of all places where the first /
characters of the pattern (but no more than/) are matched in the input.
Still another straightforward modification will find the longest initial match of
the pattern, i.e., the maximum/" such that pattern[l].., pattern[f] occurs in text.
In practice, the text characters are often packed into words, with say b
characters per word, and the machine architecture often makes it inconvenient
to access individual characters. When efficiency for large n is important on such
machines, one alternative is to carry out b independent searches, one for each
possible alignment of the pattern’s first character in the word. These searches can
treat entire words as "supercharacters", with appropriate masking, instead of
working with individual characters and unpacking them. Since the algorithm We
have described does not depend on the size of the alphabet, it is well suited to this
and similar alternatives.
Sometimes we want to match two or more patterns in sequence, finding an
occurrence of the first followed by the second, etc.; this is easily handled by
consecutive searches, and the total running time will be of order n plus the sum of
the individual pattern lengths.
We might also want to match two or more patterns in parallel, stopping as
soon as any one of them is fully matched. A search of this kind could be done with
multiple next and pattern tables, with one/" pointer for each; but this would make
the running time kn plus the sum of the pattern lengths, when there are k patterns.
Hopcroft and Karp have observed (unpublished) that our pattern-matching
algorithm can be extended so that the running time for simultaneous searches is
proportional simply to n, plus the alphabet size times the sum of the pattern
lengths. The patterns are combined into a "trie" whose nodes represent all of the
initial substrings of one or more patterns, and whose branches specify the
appropriate successor node as a function of the next character in the input text.
For example, if there are four patterns {a b c a b, a b a b c, b c a c, b b c }, the trie is
shown in Fig. 1.
node substring if if if
0 7 0
a 2 0
2 10 3
3 abc 7 0
4 abca abcab bcac
5 aba 6 0
6 abab 10 ababc
7 b 10 8
8 bc 7 0
9 bca 2 bcac
10 bb 10 bbc
FIG.
TABLE
/’=1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
pattern[i]=a b a a b a b a a b a a b a b a a b a b a
fl/’]=0 1 2 2 3 4 3 4 5 6 7 5 6 7 8 9 10 11 12 8
next[f]=O 1 0 2 1 0 4 0 2 0 7 1 0 4 0 2 1 0 12 0
If we extend this pattern to boo, we obtain infinite sequences f[j] and next[f]
having the same general character. It is possible to prove by induction that
(2) [[f]=f-G,- forFg <=f <Fk+l,
because of the following remarkable near-commutative property of Fibonacci
strings:
(3) l)n_2n_ C()n_l)n_2) for n ->- 3,
where c(a) denotes changing the two rightmost characters of a. For example,
I6 a b a a b a b a and c(b6) a b a a b a a b. Equation (3) is obvious when
n=3; and for n>3 we have c(b_2b,_0=b,_eC(bn_l)=b,_.b_3b_:
b,-lbn-: by induction; hence c(b_:b,_l) c(c(qb-149,-:)) 9-19,-:.
Equation (3) implies that
(4) next[F 1] Fk-1 1 for k _-> 3.
FAST PATTERN MATCHING IN STRINGS 333
Therefore if we have a mismatch when/" Fs-1 20, our algorithm might set
/" := next[f] for the successive values 20, 12, 7, 4, 2, 1, 0 of/’. Since Fk is (b k/ff-)
rounded to the nearest integer, it is possible to have up to log m consecutive
iterations of the/" := next[f] loop.
We shall now show that Fibonacci strings actually are the worst case, i.e., that
log, m is also an upper bound. First let us consider the concept of periodicity in
strings. We say that p is a period of ce if
(5) a[i] ce[i +p] for 1 =<i =< I l-p,
It is easy to see that p is a period of ce if and only if
(6) a (ce lce2)kce
for some k =>0, where [alce2[ =p and ce2 E. Equivalently, p is a period of ce if and
only if
(7) ce02 02ce
for some 02 and 02 with }021 1021 =p. Condition (6) implies (7) with 02 ce2cel and
02 ce2ce2. Condition (7) implies (6), for we define k [Ic{/p] and observe that if
k >0, then ce 02/3 implies /302 02/3 and [IBI/P] k- 1; hence, reasoning
Ice
inductively, ce Ozce for some ce with 21 < P, and ce 02 02ce i. Writing 02 ce lce2
yields. (6).
The relevance of periodicity to our algorithm is clear once we consider what it
means to shift a pattern. If pattern[If.., pattern[f-1]=ce ends with
pattern[if pattern[i-1]=/3, we have
(8) ce fl01 02fl
where [02[ 102[ / i, so the amount of shift j is a period of ce.
The construction of next[j] in our algorithm implies further that
pattern[if, which is the first character of 01, is unequal to pattern[j]. Let us assume
that/3 itself is subsequently shifted leaving a residue % so that
(9) /3 3’02 2’
where the first character of 02 differs frrn that of 02. We shall now prove that
(10)
if 1 1+1 1 1 1, there is an overlap of d=lfll+l l-I l characters between the
occurrences of fl and y in f102 =ce 0.0zy; hence the first character of 02 is
y[d + 1]. Similarly there is an overlap of d characters between the occurrences of
/3 and 3/in 02/3 ce y0202; hence the first character of 02 is/3[d + 1]. Since these
characters are distinct, we obtain y[d+ 1]fl[d+l], contradicting (9). This
establishes (10), and leads directly to the announced result:
THzoz. The number of consecutive times that ] := next[f] is performed,
while one text character is being scanned, is at most 1 + log6 m.
Pro@ Let Lr be the length of the shortest string ce as in the above discussion
such that a sequence of r consecutive shifts is possible. Then L 0, L2 1, and we
have 1[3l>=Lr_l, lyl>-L_2 in (10); hence La>-Fr+I 1 by induction on r. Now if r
shifts occur we have m -->E+l->b r-. [-1
334 DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
The algorithm of 2 would run correctly in linear time even if f[/’] were used
instead of next[f], but the analogue of the above theorem would then be false. For
example, the pattern a leads to f [j] j 1 for 1 =< j =< m. Therefore if we matched
a to the text a"-lbce, using f[/’] instead of next[j], the mismatch text[m]
pattern[m] would be followed by m occurrences of j := f[/’] and m 1 redundant
comparisons of text[m] with pattern[j], before k is advanced to m + 1.
The subject of periods in strings has several interesting algebraic properties,
but a reader who is not mathematically inclined may skip to 6 since the following
material is primarily an elaboration of some additional structure related to the
above theorem.
LEMMA 1. If p and q are periods of ce, and p + q <-_ Ice[ + gcd(p, q), then
god(p, q) is a period of a.
Proof. Let d=gcd(p,q), and assume without loss of generality that
d <p < q p + r. We have a[i] a[i +p] for 1 _<- _<-Ice]-p and a[i] ce[i + q] for
l_<-i--< I[-q; hence c[i +r] [i +q] ce[i] for 1 +r<-i +r<=lcel-p, i.e.,
abc j ka bc d
def ghi
abc j ka bc. da
daf ghi
abc dab ghi ka bc dab
abc dabc hi j ka bc dabc
abc dabc di ka bc dabc d
abc dabc daj ka bc dabc da
abc dabc dab ka bc dabc dab
ab cdab c dab c a b c dab c dab c
ab caab c aabc a bc aab c aab c a
aac aaac aaac a ac aaac aaac aa
aaaaaaaaaaaa aaaaaaaaaaaaa
FIG. 2
Note that the number of degrees of freedom, i.e., the number of distinct symbols,
decreases by 1 at each step. It is not difficult to prove that the number cannot
decrease by more than 1 as we go from lal n- 1 to I 1- n, since the only new
FAST PATTERN MATCHING IN STRINGS 335
When m <= 2 this property is easily checked, and when rn > 2 it is equivalent by
induction to
r(m,m+n)=r(m,n)cr(m-(nmodm),m) for0<m<n, m>2.
Set n mod rn r, [n/m q, and apply property (i).
By properties (ii) and (iii) of this lemma, r(p, p +q) minus its last two
characters is the string of length p + q-2 having periods p and q. Note that
Fibonacci strings are just a very special case, since b,, =cr(F,_a, F,). Another
property of the o- strings appears in 15]. A completely different proof of Lemma 1
and its optimality, and a completely different definition of r(m, n), were given by
Fine and Will in 1965 [7]. These strings have a long history going back at least to
the astronomer Johann Bernoulli in 1772; see [25, 2.13] and [21].
If c is any string, let P(c) be its shortest period. Lemma 1 implies that all
periods q which are not multiples of P(a) must be greater than lal-P(a)+
gcd(q, P(a)). This is a rather strong condition in terms of the pattern matching
algorithm, because of the following result.
LEMMA 3. Let a pattern [ 1]... pattern [f 1] and let a pattern [/’]. In the
pattern matching algorithm, f[/’]=/’-P(a), and next[f ]=f-q, where q is the
smallestperiod of a which is not a period o[aa. (If no such period exists, next[f] 0.)
If P(a) divides P(ca) and P(aa) < f, then P(a)= P(aa). If P(a) does not divide
P(aa) or if P(aa)=L then q =P(a).
Proof. The characterizations of f[/’] and next[f] follow immediately from the
definitions. Since every period of aa is a period of a, the only nonobvious
statement is that P(a)= P(aa) whenever P(a) divides P(aa) and P(cea) f. Let
P(a) =p and P(aa)= rap; then the (mp)th character from the right of c is a, as is
the (m- 1)pth, as is the pth; hence p is a period of aa.
Lemma 3 shows that the/" := next[j] loop will almost always terminate
quickly. If P(a) P(aa), then q must not be a multiple of P(a); hence by Lemma
1, P(a)+q >=j + 1. On the other hand q >P(a); hence q >1/2j and next[j]<1/2/. In
the other case q P(a), we had better not have q too small, since q will be a period
in the residual pattern after shifting, and next[next[ill will be <q. To keep the
loop running it is necessary for new small periods to keep popping up, relatively
prime to the previous periods.
6. Palindromes. One of the most outstanding unsolved questions in the
theory of computational complexity is the problem of how long it takes to
determine whether or not a given string of length n belongs to a given context-free
language. For many years the best upper bound for this problem was O(n 3) in a
general context-free language as n c; L. G. Valiant has recently lowered this to
O(n og7). On the other hand, the problem isn’t known to require more than order
n units of time for any particular language. This big gap between O(n) and
O(n 2.8a) deserves to be closed, and hardly anyone believes that the final answer
will be O (n).
,
Let be a finite alphabet, let Z* denote the strings over and let
P {aa R ]a e E*}.
Here cr denotes the reversal of a, i.e., (a,a2 a,) a, azal. Each string
7r in P is a palindrome of even length, and conversely every even palindrome over
FAST PATTERN MATCHING IN STRINGS 337
such that a begins with or we can prove that no such exists, in O(n steps. Then
we can determine in O(n time whether or not a given string is in L*.
Proofi Let a be any string, and suppose that the time required to test or
,
nonempty prefixes in L is <-Kn for all large n. We begin by testing a’s initial
subsequences of lengths 1, 2, 4, ..., 2 ..., and finally a itself, until finding a
prefix in L or until establishing that a has no such prefix. In the latter case, a is not
in L*, and we have consumed at most (K + Ka) + (2K + Ka) + (4K + Ka) +" +
(]a [K + Ka) < 2Kn + Ka log n units of time for some constant Ka. But if we find a
nonempty prefix/3 L where a fly, we have used at most 41 [K + K(log2 [/3 [)
units of time so far. By the left cancellation property, a L* if and only if y L*,
and since [71 n -I/31 we can prove by induction that at most (4K + Ka)n units of
time are needed to decide membership in L*, when n > 0. U
COROLLARY. P* can be recognized in O(n) time.
Note that the related language
P* (Tr Y-,* 17r 7r and e 2}*
cannot be handled by the above techniques, since it contains both a a a b b b and
a a a b b b b a; the fundamental theorem of palstars fails with a vengeance. It is
an open problem whether or not PI* can be recognized in O(n) time, although we
suspect that it can be done. Once the reader has disposed of this problem, he or
she is urged to tackle another language which has recently been introduced by S.
A. Greibach [ 11], since the latter language is known to be as hard as possible; no
context-free language can be harder to recognize except by a constant factor.
7. Historical remarks. The pattern-matching algorithm of this paper was
discovered in a rather interesting way. One of the authors (J. H. Morris) was
implementing a text-editor for the CDC 6400 computer during the summer of
1969, and since the necessary buffering was rather complicated he sought a
method that would avoid backing up the text file. Using concepts of finite
automata theory as a model, he devised an algorithm equivalent to the method
presented above, although his original form of presentation made it unclear that
the running time was O(m + n). Indeed, it turned out that Morris’s routine was too
complicated for other implementors of the system to understand, and he dis-
covered several months later that gratuitous "fixes" had turned his routine into a
shambles.
In a totally independent development, another author (D. E. Knuth) learned
early in 1970 of S. A. Cook’s surprising theorem about two-way deterministic
pushdown automata [5]. According to Cook’s theorem, any language recog-
nizable by a two-way deterministic pushdown automaton, in any amount of time,
can be recognized on a random access machine in O(n) units of time. Since D.
Chester had recently shown that the set of strings beginning with an even
palindrome could be recognized by such an automaton, and since Knuth couldn’t
imagine how to recognize such a language in less than about n 2 steps on a
conventional computer, Knuth laboriously went through all the steps of Cook’s
construction as applied to Chester’s automaton. His plan was to "distill off" what
(Note added April, 1976.) Zvi Galil and Joel Seiferas have recently resolved this conjecture
affirmatively.
FAST PATTERN MATCHING IN STRINGS 339
was happening, in order to discover why the algorithm worked so efficiently. After
pondering the mass of details for several hours, he finally succeeded in abstracting
the mechanism which seemed to be underlying the construction, in the special case
of palindromes, and he generalized it slightly to a program capable of finding the
longest prefix of one given string that occurs in another.
This was the first time in Knuth’s experience that automata theory had taught
him how to solve a real programming problem better than he could solve it before.
He showed his results to the third author (V. R. Pratt), and Pratt modified Knuth’s
data structure so that the running-time was independent of the alphabet size.
When Pratt described the resulting algorithm to Morris, the latter recognized it as
his own, and was pleasantly surprised to learn of the O(m + n) time bound, which
he and Pratt described in a memorandum [22]. Knuth was chagrined to learn that
Morris had already discovered the algorithm, without knowing Cook’s theorem;
but the theory of finite-state machines had been of use to Morris too, in his initial
conceptualization of the algorithm, so it was still legitimate to conclude that
automata theory had actually been helpful in this practical problem.
The idea of scanning a string without backing up while looking for a pattern,
in the case of a two-letter alphabet, is implicit in the early work of Gilbert [10]
dealing with comma-free codes. It also is essentially a special case of Knuth’s
LR(0) parsing algorithm [16] when applied to the grammar
for each a in the alphabet,
where a is the pattern. Diethelm and Roizen [6] independently discovered the
idea in 1971. Gilbert and Knuth did not discuss the preprocessing to build the next
table, since they were mainly concerned with other problems, and the pre-
processing algorithm given by Diethelm and Roizen was of order rn 2. In the case
of a binary (two-letter) alphabet, Diethelm and Roizen observed that the
algorithm of 3 can be improved further" we can go immediately to "char
matched" after/" := next[j] in this case if next[j] > O.
A conjecture by R. L. Rivest led Pratt to discover the log6 rn upper bound on
pattern movements between successive input characters, and Knuth showed that
this was best possible by observing that Fibonacci strings have the curious
properties proved in 5. Zvi Galil has observed that a real-time algorithm can be
obtained by letting the text pointer move ahead in an appropriate manner while
the f pointer is moving down [9].
In his lectures at Berkeley, S. A. Cook had proved that P* was recognizable
in O(n log n) steps on a random-access machine, and Pratt improved this to O(n)
using a preliminary form of the ideas in 6. The slightly more refined theory in the
present version of 6 is joint work of Knuth and Pratt. Manacher [20] found
another way to recognize palindromes in linear time, and Galil [9] showed how to
improve this to real time. See also Slisenko [23].
It seemed at first that there might be a way to find the longest common
substring of two given strings, in time O(m + n); but the algorithm of this paper
does not readily support any such extension, and Knuth conjectured in 1970 that
such efficiency would be impossible to achieve. An algorithm due to Karp, Miller,
and Rosenberg [13] solved the problem in O((m + n) log (m + n)) steps, and this
340 DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
tended to support the conjecture (at least in the mind of its originator). However,
Peter Weiner has recently developed a technique for solving the longest common
substring problem in O(m + n) units of time with a fixed alphabet, using tree
structures in a remarkable new way [26]. Furthermore, Weiner’s algorithm has
the following interesting consequence, pointed out by E. McCreight" a text file can
be processed (in linear time) so that it is possible to determine exactly how much of
a pattern is necessary to identify a position in the text uniquely; as the pattern is
being typed in, the system can be interrupt as soon as it "knows" what the rest of
the pattern must be! Unfortunately the time and space requirements for Weiner’s
algorithm grow with increasing alphabet size.
If we consider the problem of scanning finite-state languages in general, it is
known [1 9.2] that the language defined by any regular expression of length
m is recognizable in O(mn) units of time. When the regular expression has the
form
the algorithm we have discussed shows that only O(m + n) units of time are
*
needed (considering as a character of length 1 in the expression). Recent work
by M. J. Fischer and M. S. Paterson [8] shows that regular expressions of the form
This postscript was added by D. E. Knuth in March, 1976, because of developments which
occurred after preprints of this paper were distributed.
FAST PATTERN MATCHING IN STRINGS 341
while k _-< n do
begin
]:=m;
while j > 0 and text[k] =pattern[j] do
begin
f:=f-1;k:=k-1;
end;
if ] 0 then
begin
match foundat (k);
k:=k+m+l
end else
k := k + max (d[text[k]], dd[j]);
end;
This program calls match found at (k) for all O<=k<-n-m such that
pattern[l].., pattern[m] text[k + 1] text[k + m]. There are two precom-
puted tables, namely
The d table can clearly be set up in O(q + rn) steps, and the dd table can be
precomputed in O(m) steps using a technique analogous to the method in 2
above, as we shall see. The Boyer-Moore paper [4] contains further exposition of
the algorithm, including suggestions for highly efficient implementation, and gives
both theoretical and empirical analyses. In the remainder of this section we shall
show how the above methods can be used to resolve some of the problems left
open in [4].
First let us improve the original Boyer-Moore algorithm slightly by replacing
dd[f ] by
dd’[f ] min {s + rn -j s => 1 and (s >=j or pattern[f-s] # pattern[f])
and ((s =>i or pattern[i-s] pattern[i]) for j <i =< m)}.
(This is analogous to using next[j] instead of f[j]; Boyer and Moore [4] credit the
improvement in this case to Ben Kuipers, but they do not discuss how to
determine dd’ efficiently.) The following program shows how the dd’ table can be
precomputed in O(m) steps; for purposes of comparison, the program also shows
how to compute dd, which actually turns out to require slightly more operations
than dd’
In practice one would, of course, compute only dd’, suppressing all references to
dd. The example in Table 2 illustrates most of the subtleties of this algorithm.
TABLE 2
j=l 2 3 4 5 6 7 8 9 10 11
pattern[j] b a d b a c b a c b a
f[j]=10 11 6 7 8 9 10 11 11 11 12
dd[j]=19 18 17 16 15 8 7 6 5 4
dd’[/’]=19 18 17 16 15 8 13 12 8 12
FAST PATTERN MATCHING IN STRINGS 343
To prove correctness, one may show first that f[]] is analogous to the f[]] in 2,
but with right and left of the pattern reversed; namely f[m m + 1, and for ] < m
we have
being matched for the last time. Clearly (kn ,- m’) =< n, so it remains to show that
m[,’<-_3n.
Let Pk be the amount by which the pattern would shift after the kth stage if the
d[a] heuristic were not present (d[a ] -c); then p -< s, and p is a period of the
string matched at stage k.
Consider a value of k such that m/,’ > 0, and suppose that the text characters
matched during the kth stage form the string ce =alaZ where lal=m and
[cezl m;,’ + 2s; hence the text characters in a are matched for the last time. Since
the pattern does not occur in the text, it must end with xa and the text scanned so
far must end with za, where x z. At this point the algorithm will shift the pattern
right sk positions and will enter stage k + 1. We distinguish two cases: (i) The
pattern length rn exceeds m +pg. Then the pattern can be written O3a, where
It l pg; the last character of/3 is x and the last character of 0 is y x, by definition
of dd’. Otherwise (ii) rn =< m +pk; the pattern then has the form/3a, where
[/31 <--p <_-s. By definition of m;,’ and the assumption that the pattern does not
occur in the text, we have It’ll /1 21, i.e., It l In both cases (i) and
(ii), p is a period of
Now consider the first subsequent stage k’ during which the leftmost of the
text characters tapped during stage k is matched again; we shall write k k’ when
the stages are in this relation. Suppose the mismatch occurs this time when text
character z’ fails to m.atch pattern character x’. If z’ occurs in the text within tel,
regarding ce as fixed in its stage k position, then x’ cannot be within/ce where
now occurs in the stage k’ position of the pattern, since Pk is a period of/a and the
character Pk positions to the right of x’ is a z’ (it matches a z’ in the text). Thus x’
now appears within 0. On the other hand, if z’ occurs to the left of a, we must have
]cel 0, since the characters of ce are never matched again. In either event, case
(ii) above proves to be impossible. Hence case (i) always occurs when m’ > 0, and
x’ always appears within 0.
To complete the argument, we shall show that -,k-.k’ m[; for all fixed k’, is at
most 3Sk,. Let p’ =Pk’ and let a’ denote the pattern matched at stage k’. Let
k <...<k be the values of k such that k-k’. If Io’l+p’<-_m, let/’a’ be the
rightmost p’ +la’l characters of the pattern. Otherwise let a" be the leftmost
la’l +p’-m characters of a’; and let/3’a’ be a" followed by the pattern. Note that
in both cases a’ is an initial substring of J’a’ and I/’] P’. In both cases, the actions
of the algorithm during stages k + 1 through k’ are completely known if we are
given the pattern and/’, and if we know z’ and the place within/3’ where stage
k + 1 starts matching. This follows from the fact that fl’ by itself determines
the text, so that if we match the pattern against the string z’fl’fl’fl’... (starting at
the specified place for stage k + 1) until the algorithm first tries to match z’ we will
know the length of a’. (If Ic’l < p’ then/3’ begins with a’ and this statement holds
trivially; otherwise, a’ begins with fl’ and has period p’; hence fl’fl’fl’.., begins
with a’.) Note that the algorithm cannot begin two different stages at exactly the
same position within/3’, for then it would loop indefinitely, contradicting the fact
that it does terminate. This property will be out key tool for proving the desired
result.
Let the text strings matched during stages kl, kr be 01, Or, and let
their periods determined as in case (i) be p p respectively; we have pi <
FAST PATTERN MATCHING IN STRINGS 345
for 1 _-</" =< r. Suppose that during stage kj the mismatch of xj # zi implies that the
pattern ends with yi/3icei, where [fljl Pi. We shall prove that [a 1[ +" + leer[-< 3p’.
First let us prove that I ;I <p’ for all j: We have observed that x’ always occurs
within 0; hence y/3a occurs as a rightmost substring of x’a’. If I ,1 >--p’ then
p +p’--< [flal; hence the character p positions to the right of yi in x’a’ is xi, as is
the character p + p’ positions to the right of yi. But the character p’ positions to the
right of y in x’ce’ is a yj, since p’ is a period of x’ce’; hence the character p’ +pi
positions to the right of y is also y, contradicting x # yj.
Since Iceil < P’, each string cei for j >- 2 appears somewhere within fl’, when fl’
is regarded as a cyclic string, joined end-for-end. (It follows from the definition of
k k’ that zce is a substring of ce’ for/" => 2.) We shall prove that the rightmost
halves of these strings, namely the rightmost [1/21ceil ] characters as they appear in
fl’, are disjoint. This implies that 1/21cel+... +1/21cerl <--P’, and the proof will be
Ice
complete (since 11----- P’).
Suppose therefore that the right half of the appearance of cei overlaps the
right half of the appearance of cei within/3’, for some /" >= 2, where the rightmost
character of cei is within cei. This means that the algorithm at stage kg begins to
match characters starting within cei at least pi characters to the right of zj where
zicei appears in fl’, when the text ce’ is treated modulo p’. (Recall that pi <
The pattern ends with xicej, and pj is a period of xlai. The algorithm must work
correctly when the text equals the pattern, so there must come a stage, before
shifting the pattern to the right of the appearance of cej, where the algorithm scans
left until hitting zi. At this point, call it stage k", there must be a mismatch of
z. x., since Pi or more characters have been matched. (The character pi positions
to the right of z. is x., by periodicity.) Hence k" < k’; and it follows that k" kg. (If
k"> k we have zcei entirely contained within ce", but then k k’ implies that
k"= k’.) Now k"= k implies that zi z and xi x. We shall obtain a contradic-
tion by showing that the algorithm "synchronizes" its stage k + 1 behavior with its
stage k. + 1 behavior, modulo p’, causing an infinite loop as remarked above. The
main point is that the dd’ table will specify shifting the pattern pi steps, so that yi is
brought into the position corresponding to zj, in stage k as well as in stage ki. (Any
lesser shift brings an x. into position pi spaces to the right of zi; hence it puts y xi
into the position corresponding to zi, by periodicity, contradicting xi y.) The
amount of shift depends on the maximum of the d and dd’ entries, and the d entry
will be chosen (in either k or ki) if and only if zi is not a character of fli; but in this
case, the d entry will also specify the same shift both for stage ki and stage k..
The constant 6 in the above theorem is probably much too large, and the
above proof seems to be much too long; the reader is invited to improve the
theorem in either or both respects. An interesting example of the rather complex
behavior possible with this algorithm occurs when the pattern is bO and the text is
Oa4r for large r, where
COROLLARY. The worst case running time of the Boyer-Moore algorithm with
dd’ replacing dd is O(n + rm character comparisons, if the pattern occurs r times in
the text.
Proof. Let T(n, r) be the worst case running time as a function of n and r,
346 DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
when m is fixed. The theorem implies that T(n, 0) -< 7n, counting the mismatched
characters as well as the matched ones. Furthermore, if r >0 and if the first
appearance of the pattern ends at position no we have T(n, r)_-<7(no 1)+m
+ T(n no + m 1, r- 1). It follows that T(n, r) <- 7n + 8rm 14r. [-]
When the Boyer-Moore algorithm implicitly shifts the pattern to the right, it
forgets all it "knows" about characters already matched; this is why the linearity
theorem is not trivial. A more complex algorithm can be envisaged, with a finite
number of states corresponding to which text characters are known to match the
pattern in its current position; when in state q we fetch the character x := text[k
t[q]], then we set k := k + s[q, x] and go to state q’[q, x]. For example, consider
the pattern a b a c b a b a, and the specification of t, s, and q’ in Table 3; exactly 41
distinguishable states can arise. An asterisk (*) in that table shows where the
pattern has been fully matehed.
The number of states in this generalization of the Boyer-Moore algorithm
can be rather large, as the example shows, but the patterns which occur most often
in practice probably do not imply many states. The number of states is always less
than 2 ", and perhaps a much smaller upper bound is possible; it is unclear which
patterns of a given length lead to the most states, and it does not seem obvious that
this maximum number of states is exponential in rn.
If the characters of the pattern are distinct, say a laz a,, this generaliz-
...
ation of the Boyer-Moore algorithm leads to exactly 1/2(m 2+ m) states. (Namely,
all states of the form o... ak ai+1 a,, for 0-<_k <]-<_m, with ak
suppressed if k 0.) By merging several of these states we obtain the following
simple algorithm, which uses a table c[x] where
The algorithm works only when all pattern characters are distinct, but it improves
slightly on the Boyer-Moore technique in this important special case.
f:=k:=m;
while k _-< n do
begin := c[text[k]];
if < 0 then ] := m
else if 0 then
begin for := 1 step 1 until m- 1 do
if text[k i] pattern[m i] then go to nomatch;
match found at (k m);
nomatch:/" := rn;
end else if +j _> m then ] := else j := m;
k := k +];
end;
Let us close this section by making a preliminary investigation into the
question of "fastest" pattern matching in strings, i.e., optimum algorithms. What
algorithm minimizes the number of text characters examined, over all conceivable
algorithms for the problem we have been considering? In order to make this
question nontrivial, we shall ask for the minimum average number of characters
FAST PATTERN MATCHING IN STRINGS 347
examined when finding a// occurrences of the pattern in the text, where the
average is taken uniformly with respect to strings of length n over a given
alphabet. (The minimum worst case number of characters examined is of no
interest, since it is between n m and n for all patterns 3; therefore we ask for the
minimum average number. It might be argued that the minimum average number,
taken over random strings, is of little interest, since people rarely search in
random strings; they usually search for patterns that actually appear. However,
the random-string model is a reasonable approximation when we consider those
stretches of text that do not contain the pattern, and the. algorithm obviously must
examine every character in those places where the pattern does occur.)
The case of patterns of length 2 can be solved exactly; it is somewhat
surprising to find that the analysis is not completely trivial even in this case.
Consider first the pattern a b where a b. Let q be the alphabet size, q _-> 2. Let
f(n) denote the minimum average nurnber of characters examined by an
algorithm which finds all occurrences of the pattern in a random text of length n;
and let g(n) denote the minimum average number of characters examined in a
random text of length n + 1 which is known to begin with a, not counting the
examination of the known first character. These functions can be computed by the
following recurrence relations:
f(O) =/(1) g(O) O, g(1) 1.
+(1--)(f(k-l)+f(n-k))),
n-->2.
The recurrence for f follows by considering which character is examined first; the
recurrence for g follows from the fact that the second character must be examined
in any case, so it can be examined first without loss of efficiency. It can be shown
that the minimum is always assumed for k 2; hence we obtain the closed form
solution
n(q2+q-1) (q-1)(q2+2q-1) (1- q)"
f(n)
q(2q- 1) q(2q- 1)
+ q"-3(q- 1)(2q- 1) 2,
q q
for 1 -< k _-< n occurs for k 2, whenever n -> 2 and q -> 2.)
This is clear when we must find all occurrences of the pattern; R. L. Rivest has recently proved it
also for algorithms which stop after finding one occurrence. (Information Processing Letters, to appear.)
348 DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
TABLE 3
s[q,x],q’[q,x]
f(n)=l+ l<=k<=n
min
( (g(k 1) + g(n k)) + 1 () (f(k 1) +f(n k)) ) n >-_ 2;
the sense of minimum average text characters inspected to find all matches in a
random string:
k:=2;
while k -< n do
begin c := text[k];
if c pattern[2] and text[k 1] pattern[l]
then match found at (k- 2);
while c pattern [1] do
begin k := k + 1; c := text[k];
if c =pattern[2] then match found at (k 2);
end;
k :=k+2;
end;
For patterns of length 3 the recurrence relations become more complex; they
depend on more than simply the length of the strings and knowledge about
characters at the boundaries. The determination of an optimum strategy in this
case remains an open problem. The algorithm sketched at the beginning of this
section shows that an average of O(n(log m)/m) bit inspections suffices over a
binary alphabet. Clearly [n/m is a lower bound, since the algorithm must inspect
at least one bit in any block of n consecutive bits. The pattern a can be handled
with O(n/rn) bit inspections on the average; but it seems reasonable to conjecture
that patterns of length rn exist for arbitrarily large m, such that an average of at
least cn (log rn)/m bits must be inspected for all large n. Here c denotes a positive
constant, independent of rn and n.
REFERENCES
[1] ALFRED V. AHO, JOHN E. HOPCROFT AND JEFFREY D. ULLMAN, The Design and Analysis
of Computer Algorithms, Addison-Wesley, Reading, Mass., 1974.
[2] ALFRED V. AHO AND MARGARET J. CORASICK, Efficient string matching: An aid to
bibliographic search, Comm. ACM, 18 (1975), pp. 333-340.
[3] M. BEELER, R. W. GOSPER AND R. SCHROEPPEL, HAKMEM, Memo No. 239, M.I.T.
Artificial Intelligence Laboratory, Cambridge, Mass., 1972.
[4] ROBERT S. BOYER AND J. STROTHER MOORE, a fast string searching algorithm, manuscript
dated December 29, 1975; Stanford Research Institute, Menlo Park, Calif., and Xerox Palo
Alto Research Center, Palo Alto, Calif.
[5] S. A. COOK, Linear time simulation of deterministic two-way pushdown automata, Information
Processing 71, North-Holland, Amsterdam, 1972, pp. 75-80.
[6] PASCAL DIETHELM AND PETER ROIZEN, An efficient linear search for a pattern in a string,
unpublished manuscript dated April, 1972; World Health Organization, Geneva, Switzer-
land.
[7] N. J. FINE AND H. S. WlLF, Uniqueness theorems for periodic functions, Proc. Amer. Math. Soc.,
16 (1965), pp. 109-114.
[’8] MICHAEL J. FISCHER AND MICHAEL S. PATERSON, String matching and other products,
SIAM-AMS Proc., vol. 7, American Mathematical Society, Providence, R.I., 1974,
I19. 113-125.
350 DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
[9] ZvI GALIL, On converting on-line algorithms into real-time and on real-time algorithms for
string-matching and palindrome recognition, SIGACT News, 7 (1975), No. 4, pp. 26-30.
[10] E. N. GILBERT, Synchronization of binary messages, IRE Trans. Information Theory, IT-6
(1960), pp. 470-477.
[11] SHEILA A. GREmACH, The hardestcontext-free language, this Journal, 2 (1973), pp. 304-310.
[12] MALCOLM C. HARRISON, Implementation of the substring test by hashing, Comm. ACM, 14
(1971), pp. 777-779.
[13] RICHARD M. KARr’, RAYMOND E. MILLER AND ARNOLD L. ROSENBERG, Rapid identifi-
cation of repeated patterns in strings, trees, and arrays, ACM Symposium on Theory of
Computing, vol. 4, Association for Computing Machinery, New York, 1972, pp. 125-136.
[14] DONALD E. KNUTH, Fundamental Algorithms, The Art of Computer Programming, Vol. 1,
Addison-Wesley, Reading, Mass., 1968; 2nd edition 1973.
[15] ., Sequences with precisely k + k-blocks, Solution to problem E2307, Amer. Math.
Monthly, 79 (1972), pp. 773-774.
[16] , On the translation of languages from left to right, Information and Control, 8 (1965), pp.
607-639.
[17] , Structured programming with go to statements, Computing Surveys, 6 (1974),
pp. 261-301.
[18] DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT, Fast pattern
matching in strings, Tech. Rep. CS440, Computer Science Department, Stanford Univ.,
Stanford, Calif., 1974.
e
[19] R. C. LYNDON AND M. P. SCHOTZEN3ZrtGEg, The equation alVt=blc in a free group,
Michigan Math. J., 9 (1962), pp. 289-298.
[20] GLENN MANACHER, A new linear-time on-line algorithm for finding the smallest initial
palindrome of a string, J. Assoc. Comput. Mach., 22 (1975), pp. 346-351.
[21] A. MARKOFF, Sur une question de Jean Bernoulli, Math. Ann., 19 (1882), pp. 27-36.
[22] J. H. MORRIS, JR. AND VAUGHAN R. PRATT, A linear pattern-matching algorithm, Tech. Rep.
40, Univ. of California, Berkeley, 1970.
[23] A. O. SLISENKO, Recognition of palindromes by multihead Turing machines, Dokl. Steklov
Math. Inst., Akad. Nauk SSSR, 129 (1973), pp. 30-202. (In Russian.)
24 KEN THOMPSON, Regular expression search algorithm, Comm. ACM, 11 (1968), pp. 419-422.
[25] B. A. VNKOV, Elementary Number Theory, Wolters-Noordhoff, Groningen, the Netherlands,
1970.
[26] PETER WEINER, Linear pattern matching algorithms, IEEE Symposium on Switching and
Automata Theory, vol. 14, IEEE, New York, 1973, pp. 1-11.