String Finding1
String Finding1
Jun-ichi Aoe
Department of Information Science and System Engineering,
The University of Tokushima,
Minami-Josanjima-Cho,
Tokushima-City
770, Japan.
Abstract
This paper describes a method of implementing a static transition table of a string pattern
matching machine to locate all occurrences of a finite number of keywords in a t e x t string. T h e
scheme combines the fast access of an array representation with the compactness of a list
structure. Each transition can be computed from the present data s t r u c t u r e in O(1) time and the
storage is as small as the list structure. The construction and pattern matching programs
associated with the present data s t r u c t u r e are provided and the efficiency is evaluated b y a
empirical results.
Key Words: A string pattern matching algorithm, information retrieval, data structure,
space saving, static set of keywords, text-scanning.
1. Introduction
A string pattern matching matching machine has been applied to the subprocess of many
information retrieval models: for example, a lexical analyzer, code optimization and code
generation of a compilerE2],Eg],E11~,E12~,[13~E18~,E23]; a library bibliographic searchEl~[4];
t e x t - e d i t i n g ; a spell checkerE19]; filtering high f r e q u e n c y words in natural language
processing[73,E21]; and so on. Recently, Aho et al.E1] presented an efficient string pattern
matching algorithm to locate all occurrences of any of a finite number of keywords in a t e x t
string and Knuth et al.E16] presented a fast matching algorithm to find all occurrences of one
given keyword in a t e x t string. This paper discusses an efficient technique for implementing a
machine based on Aho et al..
The process of the matching machine occupies a reasonable portion of the total running time
of these models, since it is a low level process that must look at the input one symbol at a time.
Therefore, a fast implementation of the transition table of the string pattern matching machine
should be selected. A matrix representation, indexed by states and input symbols, is the fastest,
but it can take up too much space. A list representation consisting of the transitions out of each
state can achieve a space saving over the matrix representation, but is slower. T h e data
s t r u c t u r e for the finite state machine suggested by S. C. JohnsonE14] and shown by Aho et al.E2]
is more complex to implement, but it combines the time efficiency of the matrix representation
with the space e f f i c i e n c y of the list representation. This data s t r u c t u r e uses t h r e e
one-dimensional arrays to represent the transition table of a finite state machine. It has been
23
A. Formal Definition
In this paper a string is simply a finite sequence of symbols. Let K be a finite set of strings
which we shall call keywords. A string pattern matching machine for K is a program which
takes as input the t e x t string w and produces as output the locations in w at which keywords of
K appear as substrings. In order to simply the discussion we consider the matching machine
which takes as input one specified string x , called a word, of the text string w a n d produces as
o u t p u t which x is in K or not. To avoid confusion between words like THE and THEN, let us
add a special endmarker symbol, /g to the ends of all keywords, so no prefix of a keyword can be
a keyword itself. A set of these keywords is denoted as K. We define a string pattern matching
machine as M = ( $ , / , gA where
(1) S is a finite set of states. Each state is represented by an integer more than zero.
One state (usually 1) designated as the initial state in S
(2) I is a finite set of input symbols
(3) g i s a function from SX I U {#} to SU {fail}, called a goto function.
State s without transitions drawing from it called an accepting state. Namely, x # is in K if
and only if a sequence of the transitions from 1 to the accepting state s spells out some x#. T h e
transition labeled a in I from s to st indicates that g(s, a) = st. The absence of a transition
indicates fail. T h e transition graph by the goto function is called a goto graph. T h e following
(usual) conventions hold in this paper.
a, /~ c, d E I O {#};
x, y, z ~ ( I U { # } ) " .
Let ¢ be the empty string. T h e notation of the goto function for K is e x t e n d e d to strings by the
conditions
8¢ s.-, ~ ) : s ,
g(sr, ax) : g(g(Sr, a), ~ .
T h e goto function g can be produced by the following program based on an algorithm of Aho
et al.[1]. In this program, the function g is implemented by a two-dimensional array GOTO, where
the e n t r y for state s and for input symbol a is denoted as GOTO[s][a] and the fail value is
denoted as a zero entry. We assume that GOTO[sIa] = 0 if g(s,a) has not yet been defined. T h e
function ENTER(key_word inserts into the goto graph a path that spells out "key_word'.
2
24
main()
int new_state;
[
char *keyword;
n e w _ s t a t e = 1;
do
gets(key_word);
while(ENTER(key_word) == TRUE);
ENTER(key_word)
char key_word[];
{
s=l; jr0;
i f ( s t r l e n ( k e y _ w o r d ) == O) then return(FALSE);
w h i l e ( G O T O [ s I k e y _ w o r d [ j ] ] :~: O){
s = GOTO[sIkey_word[j]];
++j;
}
w h i l e ( k e y _ w o r d [ j ] != " ~ 0 " ) [
++new_state;
GOTO[s][key_word[j]] = n e w _ s t a t e ;
s = new_state;
++j;
}
return(TRUE);
}
The following program summarizes the behavior of a matching program by using the array
GOTO.
GOTO_MATCH(word)
char word[];
{
int s, nextstate, input index;
S = 1;
i n p u t _ i n d e x = -1;
do{
++input_index;
nextstate = GOTO[next stateIword[input_index]];
i f ( n e x t _ s t a t e = 0) return(FALSE);
s = next_state;
} w h i l e ( w o r d [ i n p u t _ i n d e x ] == "#');
return(TRUE);
3
25
11 1. Efficient Implementation
A. A Compact Array S t r u c t u r e
J
/7
t $ S'
4
26
[Definition 3] We define the d o u b l e - a r r a y and TAIL as being valid for K if the following c o n -
ditions are satisfied for the goto graph.
(1) For s i n S ~ a n d s ' i n ( S ~ U Sp), g(s, a) = s ' i f and only if BASE[s] + a = st and CHECK[st] = s
(2) For s i n Sp such that STR[s] = b~ba ... b,(O~t~), BASE[s] is negative and
TAIL[p] = b~, TAIL[p + 1] = ba . . . . . TAIL[p + m - I] = b .
for p = -BASE[s],
bI b 2 •.. bm
STR[s'] = bib2 . . . b
1
BASE CHECK
p = -BASE[s' ]
/ p TAIL
/
l bl b2 -.. bm
a d
( STR[3]-b#
P P
~ ) - ~ NTR[5]=Iy~
STR[6]-opos#
r
- :Q STR[7]=#
$
~ -- ~ STZ[8]=#
t
*~ STR[9]=#
5Q STR[IO]=k#
a
~ - - - - ~ STR[12]=sename#
STR[13]=#
STR[15]=f#
STR[ 1 6] ==ai 1#
STR[18]=t#
STRE20]=t
L" >~) STR[21]=t#
d
~ STR[22]=#
h )~ 3 e c k •
STR[27]=q#
= ~ STZ[28]=r#
~--~ STB[30]= rp#
STR[32] = wn#
s@ STR[33]= h#
Since the presented data structure requires state numbers to be related, for s and g in S such
that g(s,a) = d, the process of state d must be preceded by that of state s to obtain the relations.
Thus, the function B U I L T _ B C T to construct the double-array and TAIL processes the
multistates in the first step and the separate states in the next. We assume that the
double-array has enough available zero entries to reduce the G O T O array.
28
The function BUILT_BCT is given below, where it utilizes the following functions.
• new[s]: T h e entry of array ne_Wto denote a new state number for state s.
• function IN_QUEUE(s): Store state s in the queue memory, but the initial q u e u e is empty.
• function OUT_QUEUE(): Remove one e n t r y from the queue memory and return it. If the
q u e u e is empty, then return FALSE.
• m: T h e number of input symbols.
• n : T h e number of states.
BUILT_BCT0
[
int s, n e x t _ s t a t e , t a i l _ i n d e x ;
s = OUT_QUEUE();
}while(s = F A L S E ) ;
T h e for-statement of line (a-l) selects as the row displacement BASE[new[s]] for state s the
smallest value such that no nonfail value GOTO[s~./~ in row s of the array G O T O is mapped to the
same position as any nonfail value in a previous row. The for-statement of line (a-2) defines
nonfail value GOTO[s~./] on the double-array, while renumbering state numbers, and stores the
next governing state in the queue. The for-statement of line (a-3) stores STR[s] for each
separate state in TAIL while holdir~g the condition to BASE[new[s]] on Definition 3---(2).
(Example 2 ) Figure 4 shows the d o u b l e - a r r a y and TAIL for K" as shown in Fig. 4. In this
example, the numerical values for a, b, G . . . . z~ # are regarded as 1, 2, 3 . . . . . 26, 27,
respectively.
1 2 3 4 5 6 7 8 9 I0 11 12 13 14 15 16 17 18 19 20 21
CHECK 0 1 1 I 2 3 4 3 I0 4 4 28 12 3 4 0 2 14 2 2 2
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
BASE 48 -19 -6 -22 -29 0 9 -36 -38 -51 0 0 0 -34 -41 0 -44 0 0 0 -47
CHECK 13 17 2 17 14 0 15 15 15 13 0 0 0 lO 15 0 15 0 0 0 15
I 2 3 4 5 6 7 8 9 I0 II 12 13 14 15 16 17 18 19 20 21
m
TAIL ] b # # # # k # s e n a m e # # t # # 1 y #
I
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
TAIL o p o s # f # m a i l # I # n # r p # o d
43 44 45 46 47 48 49 50 51 52
TAIL # w n # h # q # r #]
In order to keep a state number from exceeding the maximum index of the arrays BASE and
CHECK, denoted b y BC-$IZt~ of the array CHECK as the maximum index of the nonzero entries.
Suppose that the d o u b l e - a r r a y and TAIL is placed in a main memory as global variables. Then,
a matching program is summarized by the following function BCT_MATCH(word) that r e t u r n s
TRUE if "word" is in K, otherwise FALSE.
8
30
BCT_MATCH(word)
char word[];
[
int s, next_state, input_index, tail_index;
s=1;
input_index = -I;
do{
++input_index;
next_state = BASE[s] ÷ word[input_index];
/*(b-l)*/ if(next_state > BASE[0] ','~CHECK[next_state] != s) then return (FALSE);
s = next_state;
} while(BASE[s] >= 0)
l*(b-2)*/ if(word[input_index] == "#') then return(TRUE);
tail_index = -BASE[s];
++input_index;
do{
/*(b-3)*/ if(TAIL[tail_index] != word[input_index]) then return(FALSE);
*+tail_index;
++input_index;
]while(word[input_index-l] == '#');
return(TRUE);
(Example 3 ) Consider the retrieval of keyword cat# by using the double-array and TAIL for
K" as shown in Fig. 4. Note that BC-SIZE is 42. BCT_MATCH(cat~ returns TRUE by the
following computations.
1) For input_index = 0.
next_state = BASE[s] + word[input_index] = BASE[l] + word[0] = I + c = 4
next_state = 4 < B C - S I Z E = 42 and C H E C K [ n e x t s t a t e ] = CHECK[4] = I
s = next_state = 4
BASE[s] = BASEl4] = 6 > 0
2) For input_index = I.
next_state = BASE[s] + word[input index] = BASE[4] + word[l] = 6 + a = 7
next_state = 7 < B C - S I Z E = 42 and C H E C K [ n e x t s t a t e ] = CHECK[7] = 4
s = next_state = 7
BASE[s] = BASE[7] = -16 < 0
3) For input i n d e x = 2 .
TAIL[tail_index] = TAIL[16] = t = word[input_index] = word[2]
4) For input_index = 3.
TAIL[tail_index] = TAIL[17] = # = word[input_index] = word[3]
9
31
IV. E v a l u a t i o n
A. Theoretical Observation
• Size of the d o u b l e - a r r a y
As discussed by Aho et al. [2] and Tarjan et a1.[241 it is difficult to e v a l u a t e theoretically
BC-SIZE by using external parameters a" (the number of multistates and separate states) and m
(the number of input symbols). However, because of the sparse relations on the states and input
symbols of the transition table, we can assume that BC-SIZE is proportional to ff + cm for a con-
stant c, where cm is equal to the number of redundant indexes on the arrays BASE and CHECK.
T h e value c is called a redundant coeHicient and will be evaluated b y empirical observations.
B. Simulation Results
10
32
V. Conc l us i on
R e f e r e n c e s
[1] A. V. Aho and M. J. Corasick, "Efficient string matching: An aid to bibliographic search,"
Comm.ACM., vol. 18, pp. 333-340, June1975.
[2] A. V. Aho, R. Sethi and J. D. Ullman: Compilers Principles, Techniques, and Tools,
Addison-Wesley, Reading Mass., Ch. 3, pp. 144-146, 1986.
11
33
[3] J. Ape, Y. Yamamoto and R. Shimada, "An efficient method for storing and retrieving finite
state machines," (in Japanese), IECE Trans., vol. J65-D, pp. 1235-1242, Oct. 1982, (in English),
Electronica Japonica, vol. 13.
[4] , "A method for improving string pattern matching machines", 1EEE Trans., Softw. Eng.,
vol. SE-10, pp. 116-120, Jan. 1984.
[5] , "An efficient algorithm of reducing sparse matrices by row displacements,"
(in Japanese), Trans. IPS Japan, vol. 26, pp. 211-218, Mar. 1985.
[6] , "An Efficient implementation of static string pattern matching machines," Proc of the
First Int. Conf. on Supercomputing, pp. 491-498, Dec. 1985.
[7] J. Ape and M. Fujikawa, "An efficient representation of hierarchical semantic primitives -An
aid to machine translation systems-, " Proc. of the Second Int. Conf. on Supercomputing,
pp.361-370, May 1987.
[8] F. Berman, E. Bock, E. Dittert, M. J. O'donnelland D. Plank: "Collections of functions for
perfect hashing, SIAM J. Comput., vol. 15, pp. 604-618, Feb. 1986.
[9] R. G. G. Cattel,"Automatic derivation of code generators from machine descriptions,"
ACId Trans. Prog. Lang. Syst., vol. 4, pp. 173-190 Jan. 1982.
[10] R. J. Cichelli, " Minimal perfect functions made simple," Comm. ACK vol. 23, pp. 17-19, Jan.
1980.
[11] ]. W. Davidson and C. W. Fraser, OThe design and application of a retargetable peephole
optimizer," ACId Trans., Prog. Lang. Sys~ vol. 4, pp. 21-36, Jan. 1982.
[12] S. L. Graham,"Table-driven code generation," Computer, vol. 13, pp. 25-34, Aug. 1980.
[13] J. Harada,"Pascal compiler by table driven lexical and syntax analyzers", (in Japanese)
Tokushima Univ., Graduation Thesis, 1983.
[14] S. C. Johnson, " YACC-yet another compiler-compiler," CSTR 32, Bell Lab., N. ]., pp.l-34,
1975.
[15] G. Jaeschke, "Reciprocal hashing: A method for generating minimal perfect hashing
functions," Comm. ACM., vol. 24, pp. 829-833, Dec. 1981.
[16] D. E. Knuth, J. H.. Morris, and V. R. Pratt,"Fast pattern matching in string," SIAM J.
Compug, vol. 6, pp. 323-350, June 1977.
[17] D. E. Knuth, The Art of Computer Programming, vol. I, Fundamental Algorithm, pp. 295-304,
Addition-Wesley, Reading Mass., Ibid., vol. IIl, Sorting and Searching, pp. 481-505, 1973..
[18] M. E. Lesk,"Lex-a lexical analyzer generator, "CSTR 39, Bell Lab., N. J., pp. 1-13, Oct. 1975.
[19] J. L. Peterson, J. L., Computer Programs for Spelling Correction, Lecture Notes in Comput.
Sci., Springer-Verlag, New York 1980.
[20] T. Sato,"COBOL Technique," (in Japanese), Tokyo-Denki-Daigaku, 1970.
[21] B. A. Shell, "Median split trees: A fast lookup techniques for frequency occurring keys, "
Comm. ACId., vol. 21, pp. 947-959, Nov. 1978.
[22] R. Sprugnoli, "Perfect hashing functions: a single probe retrieving method for static sets,"
Comm. ACId vol. 20, pp. 841-850, Nov. 1977.
[23] A. S. Tanenbaum, H. Staveren and J. W. Stevenson,"Using peephole optimization on
intermediate code," ACId Trans. Prog. Lang. Syst., vol.4, pp. 21-36, Jan. 1982.
[24] R. E. Tarjan and A. C. Yap, "Storing a sparse table, "Comm. ACI~ vol. 22, pp. 606-611, Nov.
1979.
[25] N. Wirth,"The programming language Pascal,"Acta Inf., vol. 1, pp. 35-63, Jan. 1971.
12