String Searching Over Small Alphabets
String Searching Over Small Alphabets
TABLE I. Shift table on the left and the previously matched and the current TABLE II. State transition table for the pattern “CABAB”. The triplets contain
reading positions for the pattern “CABAB” on the right. A “*” indicates the the shift amount, the next state index, and the new reading position, in that order.
current reading position; “X” marks the already matched characters. Bold typeset indicates a full match.
F G H H F G F G F G H H F G F G
I J I M I I J
#X5 5 *k# 55 *l# 5 5 *
K K J L K I K I J L
m# 5 5 * #
k 5 5 *l# 5 5 *
K J L L K I M I J L L
# 5 5 * #
k 5 5;)*l# 5 5 *
M K K J L L L M I K I K I J L L L
# 5 5 *k# 5 5 *l# 5 5 *
I K K K J L L L L I K I K I J L L L L
# 5 5 *k# 5 5 *n#=oB5:pB5:qD*
C. To the right of the shift table we also indicate the previously The new table of Algorithm 2 determines not only the -+
/. 0
matched character positions and the current reading position as amount, but also the next state, the new reading position and
an aid to the reader. We use 3N> as an index when whether or not a full match has been found. Using the pattern
performing the lookup; clearly () could have also been HFrGEFrG
, the state transition table appears in Table II. Notice,
used by reversing the row order. We chose this representation that the transition from every state is always made to either to
because of its closer visual resemblance to the corresponding state the next state (according to the indexing), or to the starting state
transition table of our improved algorithm. which has index zero.
Here, we are not primarily concerned about the cost of the
Algorithm 2 can be easily adapted to find all matches of the
preprocessing of the pattern. Note however, that the tables used
pattern in the text; the state transition table does not need to be
in the original version of the Boyer–Moore algorithm are linear in
modified.
and can be computed in OP#QR* time1 . The shift table of Algo-
rithm 1 has S0 entries, where 0 denotes the alphabet size. A naive During the execution of the Boyer-Moore algorithm a mis-
implementation can compute the table in OP#Q'T/0;* complexity, but match moves the relative reading position to the last character of
one may hope for an OP#UV0;* solution. the pattern. No information is kept about the previously read (and
matched) characters, unless they are in a single block at the end of
the pattern. This suggests that for random texts and patterns, the
B. State transition table shift amount is linear in the alphabet size [3], [13] (and limited by
In this section, we modify Algorithm 1 to use the lookup table the pattern length).
to do more of the bookkeeping needed inside the loop. The One natural attempt for improvement is to allow the algorithm
advantage of the new formulation will become apparent when we to remember the previously matched blocks, facilitating longer
introduce our improved algorithm. One may also expect the ad- shifts. A more complicated preprocessing of the pattern (produc-
ditional setup costs to be well compensated by the more compact ing an automata) or building a data structure on the fly is needed
loop body. in this case. Our experiments indicate an improvement of a con-
In effect, we realize a finite automaton with the lookup ta- stant factor in the expected shift amounts for every additional
ble. The Knuth–Morris–Pratt algorithm [10] and variants of the remembered block. Therefore, for longer patterns we need more
Boyer–Moore algorithm are naturally implemented with a finite and more complicated bookkeeping and/or memory management
automaton [2]. We emphasize that in the latter case the characters in order to have the shift amount increase with the pattern size.
of the input text are not read in sequential order. One of the reasons for the success of the classical Boyer-Moore
algorithm is that it allows very fast implementations using a tight
Algorithm 2 The Boyer-Moore string search algorithm imple- loop with a small memory footprint. The memory requirements
mented with a state transition table. and the additional processing needed by the larger data structures
Input: character long pattern , character long text . mentioned above (see the related [6]), can diminish the gains
Output: The first matching position if any. provided by the larger shift amounts.
1: Calculate the state transition table from the pattern .
WR R
2: Set
= , X06Y06 and () Z .
[
3: while
> do
]& C. New algorithm
4: Read character $#\
= (; +* into , .
5: #\-+
/. 065Y806Y065\()5%WY06,/-B* 21^X066065Y,%7 .
Our proposed algorithm manages to achieve larger shifts, re-
6: if WY06,/- 0\_` then
membering only two blocks of the previously matched characters,
7: return
. hence limiting the bookkeeping requirements and memory use.
8: end if
W '& In case of a mismatch and subsequent shift, we do not move
9:
;-9
/. 0 .
the relative reading position to the end of the pattern. Instead,
10: end while
we attempt to read a character at one end of a block of already
11: return ?$@BA .
matched characters, thereby extending the block. We pick the
1 If the number of bits needed to represent the entries are also considered, the right end, if possible.
complexity becomes acbdSef:ghdji . The shift may cause a matching block to reach or pass the
3
Fig. 1. Block configurations for the new algorithm. TABLE III. State transition table for pattern
based on the implicit
tracking of two blocks. The triplets contain the shift amount, the next state index,
s and the new reading position in that order. The full match is indicated by bold
typesetting. On the right we show the previously matched characters marked with
s an “X”, while a “*” indicates the current reading position.
F G H H F G F G
s I M I M I J
#855 * # 5 5 * # 5 5 *
K I I I M I L J
s # 5 5 * # 5 5 * # 5 5 *
I K I K I J L
# 5 5 * # 5 5 * # 5 5 *
s M I K M I M I L J
#855 * # 5 5 * # 5 5 *
I K I t u I J L L
# 5 5 * # 5 5;9* # 5 5 *
s K v K I K I L J L
# 5 5 * # 5 5 * # 5 5 *
t w K I K I J L L L
s # 5 5 * # 5 5 * # 5 5 *
u M M I M I L L L J
# 5 5 *k# 5;9+5 *l# 5 5 *
v K I u I L J L L
# 5 5 * # 5; 5;)*l# 5 5 *
w K I K I J L L L L
beginning of the pattern. Consider the following situation: # 5 5 * # 5 5 * #
o5/p5/q*
M I M M M I L J
M I K t u v w # 5 5 * # 5; 5 *l# 5 5 *
K I K I L L L J L
J G H ) #=oB5:pB5:qD* # 5 5 * # 5 5 *
x x x x x x x K I K I L J L L L
G H G H H F G H #=oB5:pB5:qD* # 5 5 * # 5 5 *
M I I I I M M I L L J
# 5 5 * # 5 5 * # 5 5 *
I K I w I M I L L L J
The pattern in this example is “BCBCCABC” and the first row # 5 5 * # 5 5 * # 5 5 *
depicts the text with “.”-s indicating the unread characters. The
“*” marks the character, that will be read next. If a “B” is read,
~ . 0 , then the left block is empty; this is also the initial
then we shift the pattern to the position indicated below and pre-
M state.
pare to read the character at position : y The right block is described by its endpoints:
-90606Y0
&
I K t u v w M and
=B-906 , and we have $#\
Y* P#Q
Y* if
G G H 9m
J
B-+0606Y0R|
-906 . Note, that
B-+0606Y0
x x x x x x
G H G H H F G H
B-+06 indicates an empty block.
The current reading position (relative to the alignment), as
The “B” we have just read, has moved out of scope due to the
indicated by () , is calculated from ~. 0 ,
=B-90606Y;0 and
shift, leaving the two characters at positions 6 and 7 to match the
-906 .
pattern. Another “B” at position 13 would prompt a shift by one
in the classical Boyer-Moore algorithm, however positions 6 and When the text character specified by the current alignment and
7 rule out that alignment and a shift by 5 becomes possible. the relative reading position does not agree with the pattern, we
In general, our algorithm remembers two matching blocks of shift with the smallest possible value, such that in the new align-
characters: the left block always starts at the beginning of the ment the pattern agrees with all the previously read characters.
pattern and we attempt to extend the right block with every read. A shift by the length of the pattern guarantees no overlap with
If possible, we extend the right block to the right. In case of an previously read characters, thereby proving the existence of the
empty right block, we read the character aligned with the last minimum. In case of a character match, we extend the right block
character of the pattern. Both blocks may degenerate into the accordingly.
empty string leaving the possibilities illustrated on Figure 1. Solid We handle the transitions between the possible block configu-
bars indicate the already matched characters, while the current rations with particular attention to the cases when we reach the
reading position is marked by a downward pointing arrow. end of the pattern or the left block.
The first row shows the starting state, which also arises when- For an efficient implementation, we build a state transition
ever all previously matched positions are shifted out of scope. table, as discussed in Section 2-B, that only implicitly contains
The second row shows the result of three successful matches. The the information on the two blocks of alreadyHmatched FrGPFcG characters.
third row indicates that we read on the right end of the right block, In case of the previously examined pattern , Figure III
if possible. The examples in rows 4, 5, 6 and 7 show the possi- shows the table. On the right, we also show the blocks of match-
bilities with a nonempty left block, in particular row 7 depicts a ing characters and the current reading position for reference.
state when a successful match implies a full match of the whole We will refer to Algorithm 2 using the above–described state
pattern. transition table as 2BLOCK.
Next, we formally describe the information maintained about The state transition table for 2BLOCK has less then 0\]T
the previously read and matched characters (recall , and
entries and can be computed in polynomial time. We propose im-
from Algorithm 2): provements in Section 2-E to reduce the table size in order to help
y The number of characters in the left block is indicated by practical implementations. First, we will discuss the theoretical
& {
z. 0 . That is $#\
Y* P#U
Y* for
>|}~. 0 . If properties of 2BLOCK.
4
D. Correctness and Complexity Our tests confirmed that a large state transition table hurts the
The behavior of Algorithm 2 almost solely depends on the performance, especially when the cache memory is exhausted.
properties of the corresponding table . For 2BLOCK The fol- The table generation itself may also be of significant cost.
I
lowing properties are crucial: Consider state of the state transition table created for the
A) The reads do not fragment the blocks, meaning that after pattern “CABAB” as shown in Table III. The left block consists of
the shift2 two blocks will still be able to represent the set of a single matched character; any shift moves this character out of
still matching characters. scope. Keeping track of this left block only marginally increase
B) The shift amounts stored in the table are the minimal pos- the average shift amount. Of course, if we eliminate this state,
sible, considering the previously matched characters. then we also remove the assurance that every character is read at
C) The set of characters the algorithm implicitly keeps track most once.
of (the two blocks in case of 2BLOCK), are exactly those While inspecting larger tables, it is apparent that there are states
characters of the text which were read and match the pattern with the reading position close to the beginning of the string with
assuming the current alignment. no characters matched ahead of that position. Consider the read
of an“A” in the following example:
Property A ensures that the state transition table for 2BLOCK G H F H H J
is well defined; its validity follows from the description in Sec- x x x x x x x
G H F H H F H H
tion 2-C. Property B is also true by definition; it ensures that the
first match (if exists) is indeed found by the algorithm. K
We shift by and arrive to:
Case analysis based on the arrangement of the blocks can verify F J
that Property C is preserved during the execution of the loop of x x x x x x x x x x x
G H F H H F H H
Algorithm 2 and that it trivially holds at the start of the algorithm.
Notice, that the classical Boyer-Moore algorithm does not have I
Assuming that the text is random, we expect the next shift to be ,
Property C, because it “forgets” previous matches which may still M M
or with probability each. In comparison, a read at the end
be in scope.
of the pattern, without assuming any other matching characters,
We also claim that 2BLOCK is linear in . In fact, we know u
yields shift amounts of , and . Removing states like these may
more: no character of the text is read twice. Note, that for the clas-
not only reduce the table size, but may also increase the expected
sical Boyer-Moore algorithm, the best theoretical result claims
shift amount. Note, that elimination of certain states as we suggest
no more than three reads for any character of the text [9]. Some
may also yield a state transition table which implicitly tracks the
variants reduced the number of reads of any character of the text
single right block only.
to at most two [7], [8].
Finally, we notice that many states have the property that a
Property C above is instrumental to establish our claim. Ac-
large shift occurs in case of any subsequent mismatch. Typically,
cording to this property, the set of previously read characters of
the remaining character positions each trigger the addition of a
the text which overlap the current pattern alignment are in the
new state with one more matching position in the blocks. This
two blocks the algorithm implicitly keeps track of. Since we read
scenario can be represented in a more compact manner by allow-
a character at one end of the right block, it follows that no previ-
ing the states to indicate that any further mismatch allows a large
ously read character can be ever read again.
shift. The main loop of Algorithm 3 shows how we accommodate
The above argument applies to any variant of Algorithm 2 sat-
these smart state tables. We acknowledge that the shift amount of
isfying Property C. In order to establish this property for a partic-
ular state transition table, one may produce a witness consisting
Algorithm 3 The Boyer-Moore string searching variant imple-
of the set of matching characters for every state. For 2BLOCK we
mented with a smart state transition table.
approached the problem from the opposite direction: we derived
Input: character long pattern , character long text .
the state transition table from the witnessing blocks of matching Output: The first matching position if any.
characters. 1: Calculate the state transition table from the pattern .
2: Set
, 806Y06 and .
E. Preprocessing and Further Improvements !
3: while
" do
'&
We generate the state transition table of 2BLOCK by starting 4: Read character $#%
()+* into , .
with the initial state corresponding to both blocks being empty 5: #\-9
:. 065YX066065% ;5%WY06,/-h56XY;0;* 21^806Y0656,87 .
and then we add the new states (corresponding to rows in Ta- 6: if Y06,/- 0\_ then
ble III) one at a time as needed. The process of creating the 7: return
= .
new block configurations for any possible character read and then 8: end if
calculating the shift amount is quite straightforward. If the con- 9: if X60 0\_ AND remaining characters match then
figuration has been created before, then it must be referenced, 10: return
= .
otherwise it is added. Finiteness guarantees termination of the 11: end if
'&
process. The discussion of a detailed implementation is beyond 12:
-+
/. 0 .
the scope of this paper. 13: end while
2 Zero 14: return ?$@BA .
shift amount is also allowed.
5
TABLE IV. The shift amount achieved by the various algorithms averaged from =+
TABLE VI. Running times of the various algorithms (in seconds) on a text with
random executions. characters executing ten different searches.
BM 2BLOCK CUT SMART SCUT BM 2BLOCK CUT SMART SCUT
10 3.86 4.35 4.48 3.47 3.80 10 3.62 3.18 3.32 4.12 3.90
20 5.26 6.98 7.27 5.73 6.18 20 3.04 2.58 2.66 3.00 2.93
30 5.37 9.51 9.80 7.69 8.08 30 3.18 2.56 2.58 2.81 2.78
40 6.73 11.64 12.14 9.67 10.46 40 2.74 2.38 2.28 2.46 2.38
50 6.74 14.01 14.25 11.89 12.39 50 2.71 2.46 2.21 2.30 2.27
100 9.22 24.08 24.34 20.73 21.47 100 2.50 8.57 2.64 2.13 2.04
150 10.49 33.52 33.82 29.43 30.52 150 2.38 52.14 9.05 1.59 1.60
200 11.51 42.42 42.38 37.01 37.97 200 2.33 172.53 30.28 1.28 1.29
TABLE V. The number of states used by the various algorithms averaged from
random executions. performance of 2BLOCK and CUT will suffer for long patterns.
BM 2BLOCK CUT SMART SCUT The exact point when SMART and SCUT overtakes them de-
10 10 55 25 19 9 pends on implementation details, the exact heuristics chosen and
20 20 186 69 34 19 machine specifics.
30 30 398 122 49 25
4. F UTURE W ORK
40 40 697 198 64 33
50 50 1096 275 88 38 The new algorithms increase the average shift amounts as the
100 100 4263 1126 159 80 pattern size increases. A state of the art, highly optimized imple-
150 150 9477 2370 233 121 mentation of the algorithms is the next logical step to pursue.
200 200 16443 4145 305 160 The improvements discussed in Section 2-E do not exhaust all
the possibilities. Future research may consider the use of prob-
a smart state is not optimal; it cannot be larger than the smallest ability calculations when creating the state transition table with
possible shift considering all the characters. attention to the character distribution as well. The fact, that we
track two blocks of previously matched characters may be revis-
3. E XPERIMENTS ited, as well as whether the left block has to be anchored to the
beginning of the string.
We implemented various versions of our algorithm using the
Open theoretical questions include the complexity of the CUT,
ideas described in Sections 2-C and 2-E. 2BLOCK and CUT are
SMART, and the SCUT algorithms.
based on Algorithm 2 using a state transition table that implicitly
tracks two blocks. SMART and SCUT are based on Algorithm 3 R EFERENCES
employing smart states, whenever any subsequent mismatch re-
[1] R. A. Baeza-Yates. String searching algorithms revisited. Lecture Notes in
sults in a shift not less than half the pattern length. CUT and Computer Science, 382:75–96, 1989.
SCUT avoids creating states with the rightmost matching charac- [2] R. A. Baeza-Yates, C. Choffrut, and G. H. Gonnet. On boyer–moore au-
ter position in the left half of the pattern; the transition is made to tomata. Algorithmica, 12:268–292, October 1994.
[3] R. A. Baeza-Yates and M. Regnier. Average running time of the boyer–
the initial state instead. We have also included Algorithm 1 (BM) moore–horspool algorithm. Theoretical Computer Science, 92:19–31, Jan-
in the experiments. uary 1992.
We run our experiments using randomly generated patterns [4] R. Boyer and J S. Moore. A fast string searching algorithm. Comm. of the
ACM, 20:762–772, 1977.
and text over a four character alphabet. We slightly modified the [5] D. Cantone and S. Faro. Fast–search: a new efficient variant of the boyer–
algorithms to find all matches of the pattern. moore string matching algorithm. Lecture Notes in Computer Science,
Tables IV, V and VI summarize the results averaged from ten 2647:47–58, 2003.
[6] C. Choffrut and Y. Haddad. String–matching with OBDDs. Theoretical
executions. We observe that the average shift amount achieved Computer Science, 320:187–198, June 2004.
by the Boyer–Moore algorithm falls behind when compared to [7] L. Colussi. Fastest pattern matching in strings. Journal of Algorithms,
the new algorithms. Note, that CUT has improved both the shift 16:163–189, March 1994.
[8] M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq,
amounts and the table size when compared to 2BLOCK. SMART W. Plandowski, and W. Rytter. Speeding up two string–matching algo-
and SCUT trades the shift amounts in exchange for the use of rithms. Algorithmica, 12:247–267, October 1994.
fewer states. [9] A. M. Odlyzko J. L. Guibas. A new proof of the linearity of the boyer–moore
string searching algorithm. SIAM Journal on Computing, 9:672–682, 1980.
In practical implementations, the increased memory use caused [10] D. Knuth, J. H. Morris Jr., and V. Pratt. Fast pattern matching in strings.
by a large state table can substantially effect performance. Ta- SIAM Journal on Computing, 6:323–350, 1977.
ble VI shows execution times of the algorithms performing ten [11] T. Lecroq. A variation on the boyer–moore algorithm. Theoretical Computer
Science, 92:119–144, January 1992.
string searches consecutively. Note, that our implementations [12] M. E. Nebel. Fast string matching by using probabilities: On an optimal
have not been aggressively optimized: the state table generation mismatch variant of horspool’s algorithm. Theoretical Computer Science,
uses a naive algorithm and the table itself could be implemented 359:329–343, August 2006.
[13] R. Schaback. On the expected sublinearity of the boyer–moore algorithm.
using less memory. We did not fine tune the heuristics decid- SIAM Journal on Computing, 17(4):648–658, 1988.
ing the cutoff points for eliminating states in CUT, SMART and [14] J. Tarhio. A sublinear algorithm for two-dimensional string matching. Pat-
SCUT. However, the trend for each algorithm is indicative: the tern Recognition Letters, 17:833–838, July 1996.