0% found this document useful (0 votes)

7 views12 pages

String Finding1

Uploaded by

gsvl2207

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views12 pages

String Finding1

Uploaded by

gsvl2207

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

22

An Efficient Implementation of String

Pat t ern Ma t ch i ng Machines for A Finite
Number of Keywords

Jun-ichi Aoe
Department of Information Science and System Engineering,
The University of Tokushima,
Minami-Josanjima-Cho,
Tokushima-City
770, Japan.

Abstract

This paper describes a method of implementing a static transition table of a string pattern
matching machine to locate all occurrences of a finite number of keywords in a t e x t string. T h e
scheme combines the fast access of an array representation with the compactness of a list
structure. Each transition can be computed from the present data s t r u c t u r e in O(1) time and the
storage is as small as the list structure. The construction and pattern matching programs
associated with the present data s t r u c t u r e are provided and the efficiency is evaluated b y a
empirical results.

Key Words: A string pattern matching algorithm, information retrieval, data structure,
space saving, static set of keywords, text-scanning.

1. Introduction

A string pattern matching matching machine has been applied to the subprocess of many
information retrieval models: for example, a lexical analyzer, code optimization and code
generation of a compilerE2],Eg],E11~,E12~,[13~E18~,E23]; a library bibliographic searchEl~[4];
t e x t - e d i t i n g ; a spell checkerE19]; filtering high f r e q u e n c y words in natural language
processing[73,E21]; and so on. Recently, Aho et al.E1] presented an efficient string pattern
matching algorithm to locate all occurrences of any of a finite number of keywords in a t e x t
string and Knuth et al.E16] presented a fast matching algorithm to find all occurrences of one
given keyword in a t e x t string. This paper discusses an efficient technique for implementing a
machine based on Aho et al..
The process of the matching machine occupies a reasonable portion of the total running time
of these models, since it is a low level process that must look at the input one symbol at a time.
Therefore, a fast implementation of the transition table of the string pattern matching machine
should be selected. A matrix representation, indexed by states and input symbols, is the fastest,
but it can take up too much space. A list representation consisting of the transitions out of each
state can achieve a space saving over the matrix representation, but is slower. T h e data
s t r u c t u r e for the finite state machine suggested by S. C. JohnsonE14] and shown by Aho et al.E2]
is more complex to implement, but it combines the time efficiency of the matrix representation
with the space e f f i c i e n c y of the list representation. This data s t r u c t u r e uses t h r e e
one-dimensional arrays to represent the transition table of a finite state machine. It has been
23

utilized by many users as compiler-compilers LEX[18] and YACC[14] in UNIX systems[2].

Although the present data s t r u c t u r e is related to the three arrays structure, the following
improvements over the three arrays s t r u c t u r e are introduced by restricting the transition t a b l e
of the finite state machine to that of the pattern matching machine.
1) Remove one a r r a y from the three arrays s t r u c t u r e by defining interdependent relations among
the s t a t e numbers.
2) Store the states with more than one nonfail value as the fast access s t r u c t u r e presented b y 1)
and the states with only one nonfail value can be encoded as a string memory.
T h e major functions of the construction and matching programs associated with the present
data s t r u c t u r e are provided by a language C. T h e presented scheme is e v a l u a t e d shortly by
the theoretical observation and the evaluation is supported by the simulation results for various
sets of keywords.

I I. A Pat t ern Matching Mach i ne

A. Formal Definition

In this paper a string is simply a finite sequence of symbols. Let K be a finite set of strings
which we shall call keywords. A string pattern matching machine for K is a program which
takes as input the t e x t string w and produces as output the locations in w at which keywords of
K appear as substrings. In order to simply the discussion we consider the matching machine
which takes as input one specified string x , called a word, of the text string w a n d produces as
o u t p u t which x is in K or not. To avoid confusion between words like THE and THEN, let us
add a special endmarker symbol, /g to the ends of all keywords, so no prefix of a keyword can be
a keyword itself. A set of these keywords is denoted as K. We define a string pattern matching
machine as M = ( $ , / , gA where
(1) S is a finite set of states. Each state is represented by an integer more than zero.
One state (usually 1) designated as the initial state in S
(2) I is a finite set of input symbols
(3) g i s a function from SX I U {#} to SU {fail}, called a goto function.
State s without transitions drawing from it called an accepting state. Namely, x # is in K if
and only if a sequence of the transitions from 1 to the accepting state s spells out some x#. T h e
transition labeled a in I from s to st indicates that g(s, a) = st. The absence of a transition
indicates fail. T h e transition graph by the goto function is called a goto graph. T h e following
(usual) conventions hold in this paper.
a, /~ c, d E I O {#};
x, y, z ~ ( I U { # } ) " .
Let ¢ be the empty string. T h e notation of the goto function for K is e x t e n d e d to strings by the
conditions
8¢ s.-, ~ ) : s ,
g(sr, ax) : g(g(Sr, a), ~ .

B. Construction of the goto function

T h e goto function g can be produced by the following program based on an algorithm of Aho
et al.[1]. In this program, the function g is implemented by a two-dimensional array GOTO, where
the e n t r y for state s and for input symbol a is denoted as GOTO[s][a] and the fail value is
denoted as a zero entry. We assume that GOTO[sIa] = 0 if g(s,a) has not yet been defined. T h e
function ENTER(key_word inserts into the goto graph a path that spells out "key_word'.

2
24

main()
int new_state;
[
char *keyword;

n e w _ s t a t e = 1;
do
gets(key_word);
while(ENTER(key_word) == TRUE);

ENTER(key_word)
char key_word[];
{
s=l; jr0;
i f ( s t r l e n ( k e y _ w o r d ) == O) then return(FALSE);
w h i l e ( G O T O [ s I k e y _ w o r d [ j ] ] :~: O){
s = GOTO[sIkey_word[j]];
++j;
}
w h i l e ( k e y _ w o r d [ j ] != " ~ 0 " ) [
++new_state;
GOTO[s][key_word[j]] = n e w _ s t a t e ;
s = new_state;
++j;
}
return(TRUE);
}

The following program summarizes the behavior of a matching program by using the array
GOTO.

GOTO_MATCH(word)
char word[];
{
int s, nextstate, input index;

S = 1;

i n p u t _ i n d e x = -1;
do{
++input_index;
nextstate = GOTO[next stateIword[input_index]];
i f ( n e x t _ s t a t e = 0) return(FALSE);
s = next_state;

} w h i l e ( w o r d [ i n p u t _ i n d e x ] == "#');
return(TRUE);

3
25

11 1. Efficient Implementation

A. A Compact Array S t r u c t u r e

A triple-array s t r u c t u r e using three one-dimensional arrays BASE, CHECK and NEXT as

shown in Fig. 1 was defined by S • C. Johnson[14](the details are found in Aho et al.[2]) as a
static implementation scheme for transition tables of YACC[2],[14] and LEX[21118]. The goto
value g(s, a) = st is computed from the triple-array by the following steps.
First, the index t of the array CHECK is computed as t = BASE[s] + a.
If CHECKEr] is equal to the current state s~ then g(s, a) becomes NEXT[tl otherwise g(s,a) fails.
The excellent feature of the triple-array structure is that the indexes to the arrays NEXT
and CHECK are efficiently computed by the array BASE and the numerical value of the input
symbol. Thus, it can achieve the fast retrieval of the transition table.
By restricting the use of the triple-array to the transition of a string pattern matching
machine, the following alternative data structure, called a double-array, can be defined.
For s a n d st in g g(s, a) = st if and only if BASE and CHECK for K s a t i s f i e s
st = BASE[s] + a and CHECK[St] = s
In the d o u b l e - a r r a y structure, the subsequent governing state st can be computed by
BASE[s] and the current input symbol a , so the array NEXT can be removed from the
triple-array.
The goto graph has many states for a large set of keywords, so it is important to make the
d o u b l e - a r r a y more compact. In the interest of space saving, we define the following terms on
the goto graph.

[Definition I] The following states are defined on the goto graph.

1) For key x a y in K, we define state st such that g ( s , xa) = g as a separate state if a is a
sufficient symbol for distinguishing the key x a y from all others in K.
2) Each state(except the separate state) in a path from the initial state to the separate state is
called a multistat~
3) Each state(except the separate state) in a path from the separate state to the accepting state
is called a single-state.
Let Sp, SM and S, be sets of separate states, multistates and single=states, respectively.

[Definition 2] A string x # s u c h that g(s, x#) = s ' f o r s in Sp and st in Sx is called a single=string

for the separate state s, denoted by STR[s]. Only if separate state s has no transition, then
STR[s] = ~.

BASE CHECK NEXT

J
/7
t $ S'

Fig. 1 Triple-array structure for g(g a) = s :

4
26

We propose that the transitions from S ~ × ( I U [#h to ( S ~ U Sp) be stored in the d o u b l e - a r r a y

and that those from (Sp U Sa)× ( I U {#J) to Sj be stored as the single-string in a string memory,
called a TALL. This is a well-known technique[17], there is, however, a problem in how to
determine the following:
a) Whether a given state is a separate state or not.
b) A location for taking a single-string from TAIL.
This problem can be easily solved by using additional arrays, but it can take up too much
space. T h e following modified d o u b l e - a r r a y and TAIL enable us to solve the problem without
using e x t r a space.

[Definition 3] We define the d o u b l e - a r r a y and TAIL as being valid for K if the following c o n -
ditions are satisfied for the goto graph.
(1) For s i n S ~ a n d s ' i n ( S ~ U Sp), g(s, a) = s ' i f and only if BASE[s] + a = st and CHECK[st] = s
(2) For s i n Sp such that STR[s] = b~ba ... b,(O~t~), BASE[s] is negative and
TAIL[p] = b~, TAIL[p + 1] = ba . . . . . TAIL[p + m - I] = b .
for p = -BASE[s],

In this elaborate s t r u c t u r e as depicted in Fig. 2, BASE[r] has two values to indicate a

separate state number and locate the single-string in TAIL. T h e goto graph on ( S ~ U ~ ) X ( I U
{#}) is called the reduced goto graplr

bI b 2 •.. bm

STR[s'] = bib2 . . . b

1
BASE CHECK
p = -BASE[s' ]

/ p TAIL
/

l bl b2 -.. bm

Fig. 2 D o u b l e - a r r a y and TAIL structure.

(Example 1) F i g u r e 3 shows the reduced goto graph and the s i n g l e - s t r i n g s for

K "= {adb#, apply//, apropos#, at#, as#, at#, awk#, basename#, bc#, bill#, binmail#, cat#, co#,
ccat#, cd#, checkeq#, checknr#, cchfn#, chgrp#, chmod#, chown#, chsl~},
which is a s u b s e t of UNIX 4.2BSD commands. In Fig. 3 the multistates are 1, 2, 4, 11, 14, 17, 19,
23, 24, 25, 26 and the other states are the separate states.
27

a d
( STR[3]-b#
P P
~ ) - ~ NTR[5]=Iy~
STR[6]-opos#
r
- :Q STR[7]=#
$
~ -- ~ STZ[8]=#
t
*~ STR[9]=#
5Q STR[IO]=k#
a

~ - - - - ~ STR[12]=sename#

STR[13]=#
STR[15]=f#

STR[ 1 6] ==ai 1#

STR[18]=t#

STRE20]=t
L" >~) STR[21]=t#
d
~ STR[22]=#
h )~ 3 e c k •
STR[27]=q#

= ~ STZ[28]=r#
~--~ STB[30]= rp#

~--~ $TR[31]= od#

STR[32] = wn#

s@ STR[33]= h#

Fig. 3 The reduced goto graph for K:

B. Construction of The Reduced Date Structure

Since the presented data structure requires state numbers to be related, for s and g in S such
that g(s,a) = d, the process of state d must be preceded by that of state s to obtain the relations.
Thus, the function B U I L T _ B C T to construct the double-array and TAIL processes the
multistates in the first step and the separate states in the next. We assume that the
double-array has enough available zero entries to reduce the G O T O array.
28

The function BUILT_BCT is given below, where it utilizes the following functions.
• new[s]: T h e entry of array ne_Wto denote a new state number for state s.
• function IN_QUEUE(s): Store state s in the queue memory, but the initial q u e u e is empty.
• function OUT_QUEUE(): Remove one e n t r y from the queue memory and return it. If the
q u e u e is empty, then return FALSE.
• m: T h e number of input symbols.
• n : T h e number of states.

• SEPARATE[s]: T h e e n t r y of an array SEPARATE having TRUE if state s is a s e p a r a t e state,

otherwise FALSE.

BUILT_BCT0
[
int s, n e x t _ s t a t e , t a i l _ i n d e x ;

/* Routine for multistates */

s = 1; new[s]= 1;
do{
BASE[new[s]] = I;
overlap:
/*(a-t)*/ for(j=t; j <= m; ++j)
if(GOTO[s][j] != 0 && CHECK[BASE[new[s] + j] != 0) then
{
++BASE[new[s]];
goto overlap;

/(a-2)/ for(j=l; j <= m; ++j)

if(GOTO[s][j] l= 0) then
{
CHECK[BASE[new[s]] + j] = new[s];
n e x t _ s t a t e = GOTO[s][j];
n e w [ n e x t _ s t a t e ] = BASE[new[s] + j];
if(SEPARATE[next_state] == FALSE) then
INQUEUE(next_state);

s = OUT_QUEUE();
}while(s = F A L S E ) ;

/* Routine for separate state */

t a i l _ i n d e x = 1;
strcpy(TAIL, "#"); /* TAIL[0] = "#" is regarded as a dummy entry*/
/*(a-3)*/ for(s = 1; s <= n; ++s)
if(SEPARATE(s) == TRUE) then
{
BASE[new[s]] = - t a i l _ i n d e x ;
strcat(TAIL, STR[s]);
t a i l _ i n d e x += strlen(STR[s]);
}
29

T h e for-statement of line (a-l) selects as the row displacement BASE[new[s]] for state s the
smallest value such that no nonfail value GOTO[s~./~ in row s of the array G O T O is mapped to the
same position as any nonfail value in a previous row. The for-statement of line (a-2) defines
nonfail value GOTO[s~./] on the double-array, while renumbering state numbers, and stores the
next governing state in the queue. The for-statement of line (a-3) stores STR[s] for each
separate state in TAIL while holdir~g the condition to BASE[new[s]] on Definition 3---(2).

(Example 2 ) Figure 4 shows the d o u b l e - a r r a y and TAIL for K" as shown in Fig. 4. In this
example, the numerical values for a, b, G . . . . z~ # are regarded as 1, 2, 3 . . . . . 26, 27,
respectively.

1 2 3 4 5 6 7 8 9 I0 11 12 13 14 15 16 17 18 19 20 21

BASE 1 1 5 6 -1 -8 -16 -15 -34 8 -18 2 9 12 23 0 7 -27 -3 -4 -5

CHECK 0 1 1 I 2 3 4 3 I0 4 4 28 12 3 4 0 2 14 2 2 2

22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

BASE 48 -19 -6 -22 -29 0 9 -36 -38 -51 0 0 0 -34 -41 0 -44 0 0 0 -47

CHECK 13 17 2 17 14 0 15 15 15 13 0 0 0 lO 15 0 15 0 0 0 15

I 2 3 4 5 6 7 8 9 I0 II 12 13 14 15 16 17 18 19 20 21
m

TAIL ] b # # # # k # s e n a m e # # t # # 1 y #
I

22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

TAIL o p o s # f # m a i l # I # n # r p # o d

43 44 45 46 47 48 49 50 51 52

TAIL # w n # h # q # r #]

Fig. 4 The d o u b l e - a r r a y and TAIL for K'.

B. A Matching Program b y The Reduced Data Structure

In order to keep a state number from exceeding the maximum index of the arrays BASE and
CHECK, denoted b y BC-$IZt~ of the array CHECK as the maximum index of the nonzero entries.
Suppose that the d o u b l e - a r r a y and TAIL is placed in a main memory as global variables. Then,
a matching program is summarized by the following function BCT_MATCH(word) that r e t u r n s
TRUE if "word" is in K, otherwise FALSE.

8
30

BCT_MATCH(word)
char word[];
[
int s, next_state, input_index, tail_index;

s=1;
input_index = -I;
do{
++input_index;
next_state = BASE[s] ÷ word[input_index];
/*(b-l)*/ if(next_state > BASE[0] ','~CHECK[next_state] != s) then return (FALSE);
s = next_state;
} while(BASE[s] >= 0)
l*(b-2)*/ if(word[input_index] == "#') then return(TRUE);
tail_index = -BASE[s];
++input_index;
do{
/*(b-3)*/ if(TAIL[tail_index] != word[input_index]) then return(FALSE);
*+tail_index;
++input_index;
]while(word[input_index-l] == '#');
return(TRUE);

In this program, line(b-I) returns F A L S E w h e n a mismatch is detected on the double-array.

T h e first do-while loop terminates if BASE[s] is negative, that is, state s is a separate state. If
the current input symbol is equal to "#' at line (b-2), then T R U E is immediately returned because
of STRIa] = e. Line (b-3) in the second do-while loop returns F A L S E w h e n a mismatch is
detected on T A I L and the loop terminates w h e n the matching on T A I L is success.

(Example 3 ) Consider the retrieval of keyword cat# by using the double-array and TAIL for
K" as shown in Fig. 4. Note that BC-SIZE is 42. BCT_MATCH(cat~ returns TRUE by the
following computations.
1) For input_index = 0.
next_state = BASE[s] + word[input_index] = BASE[l] + word[0] = I + c = 4
next_state = 4 < B C - S I Z E = 42 and C H E C K [ n e x t s t a t e ] = CHECK[4] = I
s = next_state = 4
BASE[s] = BASEl4] = 6 > 0
2) For input_index = I.
next_state = BASE[s] + word[input index] = BASE[4] + word[l] = 6 + a = 7
next_state = 7 < B C - S I Z E = 42 and C H E C K [ n e x t s t a t e ] = CHECK[7] = 4
s = next_state = 7
BASE[s] = BASE[7] = -16 < 0
3) For input i n d e x = 2 .
TAIL[tail_index] = TAIL[16] = t = word[input_index] = word[2]
4) For input_index = 3.
TAIL[tail_index] = TAIL[17] = # = word[input_index] = word[3]

9
31

IV. E v a l u a t i o n

A. Theoretical Observation

• Size of the d o u b l e - a r r a y
As discussed by Aho et al. [2] and Tarjan et a1.[241 it is difficult to e v a l u a t e theoretically
BC-SIZE by using external parameters a" (the number of multistates and separate states) and m
(the number of input symbols). However, because of the sparse relations on the states and input
symbols of the transition table, we can assume that BC-SIZE is proportional to ff + cm for a con-
stant c, where cm is equal to the number of redundant indexes on the arrays BASE and CHECK.
T h e value c is called a redundant coeHicient and will be evaluated b y empirical observations.

* Construction Time of the d o u b l e - a r r a y

T h e worst-case time complexity of the function BUILT_BCT depends on the first
f o r - s t a t e m e n t of line (a-l). The maximum value of BASE[new(s)] in the for-loop of function
BUILT_BCT becomes n'+cax so the worst-case time confirming
(GOTO[s][j] != 0 && CHECK[BASE[new[sl + .~ != 0)
for each s is proportional to ~n'+cm) = n ' ~ + c ~ . Hence, the worst-case time complexity of
constructing the d o u b l e - a r r a y is O ( f f 2 ~ - c n ' ~ ) .

B. Simulation Results

T a b l e I r e p r e s e n t s the simulation results on a workstation sun3/260 for the following sets of

keywords.
KWI: T h e r e s e r v e d words for Pascal[25].
KW2: Main city names in Japan.
KW3: T h e r e s e r v e d words for COBOL[20].
KW4: Commands in UNIX System V.
KW5: Commands in UNIX 4.2BSD.
KW6: Main city names in the world.
KW7: English words in UNIX's spell dictionary.
In this simulation, the number m of the input symbols is 128 and e v e r y e n t r y in the
d o u b l e - a r r a y requires two bytes, but the d o u b l e - a r r a y for KW7 is divided into two to keep two
b y t e s e n t r y of the d o u b l e - a r r a y . In the storage result, TRANSI stands for a v e r a g e storage of
one transition on (SM U S P ) × (I U {#}) and B C T / S F K stands for the ratio the size of the
d o u b l e - a r r a y and TAIL to that of source file of keywords. In the time result Matching stands
for an a v e r a g e matching time for each keyword. From the result, it turns out that, with the
increase of the number of keywords, the number of b y t e s representing one transition closes upon
4.0 and that the the total size of BASE, CHECK and TAIL closes upon that of source file of
keywords. This depends on the following features.
1) The e x t r e m e l y small redundant coefficient value c.
2) T h e t r e e s t r u c t u r e of the goto graph being allowed to merge the common p r e f i x e s into the
same transition.
3) A compact string memory TAIL representing all transitions on single-states that occupies
60%~,J76% to the total number of states.
T h e list forms[17],[19] representing a tree s t r u c t u r e requires from three to five bytes, so we can
say that the presented data s t r u c t u r e is compact.
As may readily be seen in the results, it turns out that the matching is v e r y fast and that the
construction can be performed at a practical speed.

10
32

Table I Simulation results.

KYl g~2 gf3 KW4 g| 5 KW6 X|7

Number and Length

Number of geyvords 35 45 310 603 657 1.480 23,976
Number of M u l t i s t a t e s 17 24 301 296 394 981 16.518
Number of Single-States 109 216 947 1,630 1,730 7.281 59.941
Number of Total States 161 285 1,558 2,529 2.781 9.742 100.235
Average Length of geywords 5.1 6.4 7.5 6.7 6.9 9.5 8.2
Storage(kilo-bytes)
BASE and CHECK 0.25 O-31 2.69 3.75 4.45 9.89 162
TAIL 0.11 0.22 0.95 1.63 1.73 7.28 60
BASE. CHECK and TAIL O. 36 O. 53 3. 64 5. 38 6. 18 17.17 222
TRANSI 4.81 4.55 4.42 4.17 4.23 4.02 4. O0
Source F i l e of geyvords 0.18 0.29 2.33 4.02 4. 54 14.08 196
BCT/SFK 2. O0 1.83 1.56 1.34 1.36 1.22 1. 13
Redundant C o e f f i c i e n t

c 0.08 0.06 0.48 0.30 0.48 0.16 0.11

Time
Natching(mili-second) 31.4 31.3 30.6 30.3 31.6 29.1 31.7
Construclion(second) 1,2 1.6 10.1 21.1 23.0 52.3 840

V. Conc l us i on

An efficient implementation of a string pattern matching machine has been described by

improving the t h r e e - a r r a y s t r u c t u r e of compiler-compilers LEX[18] and YACC[14] in UNIX
systems. It has been shown by empirical results for various sets of keywords that the size of
the obtained data s t r u c t u r e is v e r y compact and that the presented matching program is v e r y
fast.
T h e presented data s t r u c t u r e was used for a lexical analyzer[13]; filtering high f r e q u e n c y
words in natural language processing[7]; and for a Roman-Hiragana translator[7] of a Japanese
word processor, which converts Roman characters into the corresponding Hiragana characters.
The presented method can be applied to a static key search in place of perfect
hashing[8],[10],[15],[17],[22] and to the reduction algorithm of sparse matrices[5],[24]. It would
be a v e r y interesting study to update the double-array and to e x t e n d the use of the
d o u b l e - a r r a y to the transition tables of a finite state machine[2] associated with parsing
tables[14].
The developed program will be provided on request to any readers. Please feel free to
contact me.

R e f e r e n c e s

[1] A. V. Aho and M. J. Corasick, "Efficient string matching: An aid to bibliographic search,"
Comm.ACM., vol. 18, pp. 333-340, June1975.
[2] A. V. Aho, R. Sethi and J. D. Ullman: Compilers Principles, Techniques, and Tools,
Addison-Wesley, Reading Mass., Ch. 3, pp. 144-146, 1986.

11
33

[3] J. Ape, Y. Yamamoto and R. Shimada, "An efficient method for storing and retrieving finite
state machines," (in Japanese), IECE Trans., vol. J65-D, pp. 1235-1242, Oct. 1982, (in English),
Electronica Japonica, vol. 13.
[4] , "A method for improving string pattern matching machines", 1EEE Trans., Softw. Eng.,
vol. SE-10, pp. 116-120, Jan. 1984.
[5] , "An efficient algorithm of reducing sparse matrices by row displacements,"
(in Japanese), Trans. IPS Japan, vol. 26, pp. 211-218, Mar. 1985.
[6] , "An Efficient implementation of static string pattern matching machines," Proc of the
First Int. Conf. on Supercomputing, pp. 491-498, Dec. 1985.
[7] J. Ape and M. Fujikawa, "An efficient representation of hierarchical semantic primitives -An
aid to machine translation systems-, " Proc. of the Second Int. Conf. on Supercomputing,
pp.361-370, May 1987.
[8] F. Berman, E. Bock, E. Dittert, M. J. O'donnelland D. Plank: "Collections of functions for
perfect hashing, SIAM J. Comput., vol. 15, pp. 604-618, Feb. 1986.
[9] R. G. G. Cattel,"Automatic derivation of code generators from machine descriptions,"
ACId Trans. Prog. Lang. Syst., vol. 4, pp. 173-190 Jan. 1982.
[10] R. J. Cichelli, " Minimal perfect functions made simple," Comm. ACK vol. 23, pp. 17-19, Jan.
1980.
[11] ]. W. Davidson and C. W. Fraser, OThe design and application of a retargetable peephole
optimizer," ACId Trans., Prog. Lang. Sys~ vol. 4, pp. 21-36, Jan. 1982.
[12] S. L. Graham,"Table-driven code generation," Computer, vol. 13, pp. 25-34, Aug. 1980.
[13] J. Harada,"Pascal compiler by table driven lexical and syntax analyzers", (in Japanese)
Tokushima Univ., Graduation Thesis, 1983.
[14] S. C. Johnson, " YACC-yet another compiler-compiler," CSTR 32, Bell Lab., N. ]., pp.l-34,
1975.
[15] G. Jaeschke, "Reciprocal hashing: A method for generating minimal perfect hashing
functions," Comm. ACM., vol. 24, pp. 829-833, Dec. 1981.
[16] D. E. Knuth, J. H.. Morris, and V. R. Pratt,"Fast pattern matching in string," SIAM J.
Compug, vol. 6, pp. 323-350, June 1977.
[17] D. E. Knuth, The Art of Computer Programming, vol. I, Fundamental Algorithm, pp. 295-304,
Addition-Wesley, Reading Mass., Ibid., vol. IIl, Sorting and Searching, pp. 481-505, 1973..
[18] M. E. Lesk,"Lex-a lexical analyzer generator, "CSTR 39, Bell Lab., N. J., pp. 1-13, Oct. 1975.
[19] J. L. Peterson, J. L., Computer Programs for Spelling Correction, Lecture Notes in Comput.
Sci., Springer-Verlag, New York 1980.
[20] T. Sato,"COBOL Technique," (in Japanese), Tokyo-Denki-Daigaku, 1970.
[21] B. A. Shell, "Median split trees: A fast lookup techniques for frequency occurring keys, "
Comm. ACId., vol. 21, pp. 947-959, Nov. 1978.
[22] R. Sprugnoli, "Perfect hashing functions: a single probe retrieving method for static sets,"
Comm. ACId vol. 20, pp. 841-850, Nov. 1977.
[23] A. S. Tanenbaum, H. Staveren and J. W. Stevenson,"Using peephole optimization on
intermediate code," ACId Trans. Prog. Lang. Syst., vol.4, pp. 21-36, Jan. 1982.
[24] R. E. Tarjan and A. C. Yap, "Storing a sparse table, "Comm. ACI~ vol. 22, pp. 606-611, Nov.
1979.
[25] N. Wirth,"The programming language Pascal,"Acta Inf., vol. 1, pp. 35-63, Jan. 1971.

Smart - English - Student - Book - 1 - Scope & Squence - 714
No ratings yet
Smart - English - Student - Book - 1 - Scope & Squence - 714
1 page
Shortest Common Superstring1
No ratings yet
Shortest Common Superstring1
14 pages
360855
No ratings yet
360855
9 pages
Abstracted Finite-Memory Programs: Example 2.2.1 Let P Be The Program in Figure
No ratings yet
Abstracted Finite-Memory Programs: Example 2.2.1 Let P Be The Program in Figure
6 pages
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
No ratings yet
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
18 pages
Efficient String Matching: An Aid To Bibliographic Search
No ratings yet
Efficient String Matching: An Aid To Bibliographic Search
8 pages
QuestionBank2024
No ratings yet
QuestionBank2024
8 pages
Central To The Investigation of Finite
No ratings yet
Central To The Investigation of Finite
5 pages
Optimization of Pattern Matching Algorithm For Memory Based Architecture
No ratings yet
Optimization of Pattern Matching Algorithm For Memory Based Architecture
6 pages
String Matching
No ratings yet
String Matching
18 pages
SLR
No ratings yet
SLR
6 pages
Finite Automata
No ratings yet
Finite Automata
3 pages
Transition Graphs
No ratings yet
Transition Graphs
9 pages
Aho-Corasick String Matching
No ratings yet
Aho-Corasick String Matching
24 pages
Fa, Nfa, Dfa
No ratings yet
Fa, Nfa, Dfa
11 pages
3
No ratings yet
3
6 pages
Compiler Design Unit 3
No ratings yet
Compiler Design Unit 3
20 pages
String Finding3
No ratings yet
String Finding3
17 pages
DATS Univ Sol summer 2015
No ratings yet
DATS Univ Sol summer 2015
27 pages
Rohit v Bennur
No ratings yet
Rohit v Bennur
44 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Module 5 TuringMachine
No ratings yet
Module 5 TuringMachine
78 pages
Brute Force
No ratings yet
Brute Force
5 pages
Nondeterministic Finite State Machines: Nondeterminism
No ratings yet
Nondeterministic Finite State Machines: Nondeterminism
28 pages
IRS unit-5
No ratings yet
IRS unit-5
62 pages
UNIT 5 IRS PDF
No ratings yet
UNIT 5 IRS PDF
9 pages
Sandeep Singh (Iii B.Tech I.T)
No ratings yet
Sandeep Singh (Iii B.Tech I.T)
179 pages
Programming Techniques for Turing Machine construction
No ratings yet
Programming Techniques for Turing Machine construction
31 pages
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
No ratings yet
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
5 pages
Unit 5 - Theory of Computation FINAL
No ratings yet
Unit 5 - Theory of Computation FINAL
28 pages
Unit V (TM) - Part2
No ratings yet
Unit V (TM) - Part2
33 pages
BCS503 TOC Third IA Test Scheme
No ratings yet
BCS503 TOC Third IA Test Scheme
6 pages
TOC - Question Answer
No ratings yet
TOC - Question Answer
41 pages
Ant Colony Optimization For Maze Solving
No ratings yet
Ant Colony Optimization For Maze Solving
11 pages
Ada Notes Unit 4
No ratings yet
Ada Notes Unit 4
28 pages
On-Line Construction of Suffix Trees
No ratings yet
On-Line Construction of Suffix Trees
18 pages
Pattern Matching + Hashing
No ratings yet
Pattern Matching + Hashing
29 pages
Unit 3
No ratings yet
Unit 3
34 pages
11 Data Structures and Algorithms - Narasimha Karumanchi
No ratings yet
11 Data Structures and Algorithms - Narasimha Karumanchi
12 pages
TURING MACHINE
No ratings yet
TURING MACHINE
84 pages
CT Lecture 5-Applications of FA
No ratings yet
CT Lecture 5-Applications of FA
15 pages
Unit V (TM) - Part1n
No ratings yet
Unit V (TM) - Part1n
24 pages
June 2023 QP - Paper 1 AQA Computer Science a-level
No ratings yet
June 2023 QP - Paper 1 AQA Computer Science a-level
20 pages
Basic Programming Logic
No ratings yet
Basic Programming Logic
18 pages
Class NOtes TOC Unit I II 1691659224
No ratings yet
Class NOtes TOC Unit I II 1691659224
35 pages
Lab Manual Toc
No ratings yet
Lab Manual Toc
60 pages
Chapter 2- String Processing
No ratings yet
Chapter 2- String Processing
26 pages
At Aba 1
No ratings yet
At Aba 1
9 pages
5CS4-AOA-Unit-3 @zammers
No ratings yet
5CS4-AOA-Unit-3 @zammers
7 pages
DAA_unit_5
No ratings yet
DAA_unit_5
22 pages
From Programs To Turing Transducers: Example 4.2.1 Let P Be The Program in Figure
No ratings yet
From Programs To Turing Transducers: Example 4.2.1 Let P Be The Program in Figure
7 pages
Automata unit2
No ratings yet
Automata unit2
13 pages
Lecture 5 Transition Graphs
No ratings yet
Lecture 5 Transition Graphs
39 pages
Unit-1-2
No ratings yet
Unit-1-2
34 pages
Chapter 2yearwise Marking
No ratings yet
Chapter 2yearwise Marking
29 pages
cd aat
No ratings yet
cd aat
8 pages
This Chapter "Theory of Computation" Is Taken From Our
No ratings yet
This Chapter "Theory of Computation" Is Taken From Our
51 pages
TOC 2022 Solution
No ratings yet
TOC 2022 Solution
9 pages
Turing Machine
No ratings yet
Turing Machine
39 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
C Program To Convert Binary Number To Octal and Octal To Binary
No ratings yet
C Program To Convert Binary Number To Octal and Octal To Binary
3 pages
School Action Plan in Physical Education
No ratings yet
School Action Plan in Physical Education
2 pages
I Am A Tiger
No ratings yet
I Am A Tiger
2 pages
Elements of Poetry
100% (2)
Elements of Poetry
14 pages
Exponential Functions and Their Graphs PDF
No ratings yet
Exponential Functions and Their Graphs PDF
2 pages
Identities:: Beauty and Health
No ratings yet
Identities:: Beauty and Health
23 pages
Stress Presentation
No ratings yet
Stress Presentation
44 pages
Manual of The Inkjet Printer
No ratings yet
Manual of The Inkjet Printer
29 pages
Maya Azran - Saffron Women
No ratings yet
Maya Azran - Saffron Women
26 pages
Computer Science
No ratings yet
Computer Science
36 pages
Lung Cancer Detection Using Digital Image Processing
67% (3)
Lung Cancer Detection Using Digital Image Processing
40 pages
Diagnostic
No ratings yet
Diagnostic
6 pages
DSP Processor Architecture: TMS 320C67XX Blackfin Processor On Chip Resources and Programming Considerations
No ratings yet
DSP Processor Architecture: TMS 320C67XX Blackfin Processor On Chip Resources and Programming Considerations
8 pages
Experiment CP 1
No ratings yet
Experiment CP 1
7 pages
(MB TUTORING 2024) GLOBAL SUCCESS 7 - ÔN TẬP NGỮ PHÁP (09 - 06 - 2024)
No ratings yet
(MB TUTORING 2024) GLOBAL SUCCESS 7 - ÔN TẬP NGỮ PHÁP (09 - 06 - 2024)
3 pages
Spring 2024 - EDU302 - 1
No ratings yet
Spring 2024 - EDU302 - 1
2 pages
48.OMG Iz TXT SPK Wrecking Our Language (Khairunnisa Nikman) PP 345-349
No ratings yet
48.OMG Iz TXT SPK Wrecking Our Language (Khairunnisa Nikman) PP 345-349
5 pages
Exploring The Rich Culture of Punjab
No ratings yet
Exploring The Rich Culture of Punjab
8 pages
Mathematics JSS1 Scheme of Work - SyllabusNG
No ratings yet
Mathematics JSS1 Scheme of Work - SyllabusNG
15 pages
UPSKILLING Sample Certificate
No ratings yet
UPSKILLING Sample Certificate
4 pages
Presentation_24617_Content_Document_20240922113548AM
No ratings yet
Presentation_24617_Content_Document_20240922113548AM
71 pages
LEITURA e Interpretação ENSINO MÉDIO
No ratings yet
LEITURA e Interpretação ENSINO MÉDIO
3 pages
BTCE - Pre-A1 - Sample - Speaking
No ratings yet
BTCE - Pre-A1 - Sample - Speaking
4 pages
Solid Works Intro
No ratings yet
Solid Works Intro
166 pages
IE1005 Part 1 Review Yr202223
No ratings yet
IE1005 Part 1 Review Yr202223
4 pages
Keffiyeh_Made in China
No ratings yet
Keffiyeh_Made in China
36 pages
NB 2024 08 09 45
No ratings yet
NB 2024 08 09 45
48 pages
Digital System Design
No ratings yet
Digital System Design
3 pages
Do's and Don'Ts of Business Writing
No ratings yet
Do's and Don'Ts of Business Writing
2 pages

String Finding1

Uploaded by

String Finding1

Uploaded by

22

An Efficient Implementation of String

utilized by many users as compiler-compilers LEX[18] and YACC[14] in UNIX systems[2].

I I. A Pat t ern Matching Mach i ne

B. Construction of the goto function

A triple-array s t r u c t u r e using three one-dimensional arrays BASE, CHECK and NEXT as

[Definition I] The following states are defined on the goto graph.

[Definition 2] A string x # s u c h that g(s, x#) = s ' f o r s in Sp and st in Sx is called a single=string

BASE CHECK NEXT

Fig. 1 Triple-array structure for g(g a) = s :

We propose that the transitions from S ~ × ( I U [#h to ( S ~ U Sp) be stored in the d o u b l e - a r r a y

In this elaborate s t r u c t u r e as depicted in Fig. 2, BASE[r] has two values to indicate a

Fig. 2 D o u b l e - a r r a y and TAIL structure.

(Example 1) F i g u r e 3 shows the reduced goto graph and the s i n g l e - s t r i n g s for

~--~ $TR[31]= od#

Fig. 3 The reduced goto graph for K:

B. Construction of The Reduced Date Structure

• SEPARATE[s]: T h e e n t r y of an array SEPARATE having TRUE if state s is a s e p a r a t e state,

/* Routine for multistates */

/*(a-2)*/ for(j=l; j <= m; ++j)

/* Routine for separate state */

BASE 1 1 5 6 -1 -8 -16 -15 -34 8 -18 2 9 12 23 0 7 -27 -3 -4 -5

Fig. 4 The d o u b l e - a r r a y and TAIL for K'.

B. A Matching Program b y The Reduced Data Structure

In this program, line(b-I) returns F A L S E w h e n a mismatch is detected on the double-array.

* Construction Time of the d o u b l e - a r r a y

T a b l e I r e p r e s e n t s the simulation results on a workstation sun3/260 for the following sets of

Table I Simulation results.

KYl g~2 gf3 KW4 g| 5 KW6 X|7

Number and Length

c 0.08 0.06 0.48 0.30 0.48 0.16 0.11

An efficient implementation of a string pattern matching machine has been described by

You might also like

/(a-2)/ for(j=l; j <= m; ++j)