0% found this document useful (0 votes)
5 views

Chapter 3 implementation_of_lexical_analysis

The document discusses the implementation of lexical analysis using regular expressions and finite automata. It covers the notation for regular expressions, the process of tokenizing input, handling ambiguities, and error management. Additionally, it explains the differences between deterministic and nondeterministic finite automata, their acceptance criteria, and the conversion from regular expressions to finite automata.

Uploaded by

ahmed301saeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Chapter 3 implementation_of_lexical_analysis

The document discusses the implementation of lexical analysis using regular expressions and finite automata. It covers the notation for regular expressions, the process of tokenizing input, handling ambiguities, and error management. Additionally, it explains the differences between deterministic and nondeterministic finite automata, their acceptance criteria, and the conversion from regular expressions to finite automata.

Uploaded by

ahmed301saeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Implementation of Lexical

Analysis

Chapter 3

1
Notatio
n
• There is variation in regular
expression notation

A+
• At least one:  AA*
• Union: A | B A+B
• Option: A +   A?
• Range: ‘a’+’b’+…+’z’  [a-z]
• Excluded range:
complement of [a-z]
2


Regular Expressions in Lexical
Specification
• Last Lecture: a specification for the
predicate
s  L(R)
Set of strings

• But a yes/no answer is not enough!


• Instead: partition the input into tokens
c1c2c3 c4c5c6c7 …

• We adapt regular expressions to this goal


3
Regular Expressions => Lexical
Spec. (1)
1. Write a rexp for the lexemes of each
token
• Number = digit +
• Keyword = ‘if’ + ‘else’ + …
• Identifier = letter (letter + digit)*
• OpenPar = ‘(‘
• …

4
Regular Expressions => Lexical
Spec. (2)

2. Construct R, matching all lexemes for


all tokens

R = Keyword + Identifier + Number + …


= R1 + R2 + …

5
Regular Expressions => Lexical
Spec. (3)
3. Let input be x1…xn
For 1  i  n check
x1…xi  L(R)

4. If success, then we know that


x1…xi  L(Rj) for some j

5. Remove x1…xi from input and go to


(3)

6
Ambiguities
(1)
• There are ambiguities in the
algorithm

• How much input is used? What if


• x1…xi  L(R) and also
• x1…xK  L(R)
k≠i

• Rule: Pick longest possible string in


L(R)
– The “maximal munch”
7
Ambiguities
(2)
• Which token is used? What
if
• x1…xi  L(Rj) and also
R=R 1 + R2 + R +
• x1…xi  kL(Rk)
≠i … 3
Keyword = ‘if’ + ‘else’ + …
Identifier = letter (letter +
digit)*
• Rule: use rule listed first (j if j < k)
– Treats “if” as a keyword, not an identifier
8
Error
Handling
• What if
No rule matches a prefix of input ?
x1…xi  L(Rj)

• Problem: Can’t just get stuck …

• Solution:
– Write a rule matching all “bad”
strings
– Put it last (lowest priority)
9
Summa
ry
• Regular expressions provide a concise
notation for string patterns

• Use in lexical analysis requires small


extensions
– To resolve ambiguities
– To handle errors

• Good algorithms known


– Require only single pass over the input
– Few operations per character (table lookup)

10
Finite
Automata
• Regular expressions =
specification
• Finite automata =
implementation

• A finite automaton consists of


– An input alphabet 
– A finite set of states S
– A start state n
– A set of accepting states F  S
– A set of transitions state input 11
Finite
Automata
• Transition
s1 a s2
• Is read
In state s1 on input “a” go to state

s2

• If end of input and in•accepting state


Terminates in a state s that
• => accept=>
Otherwise is NOT an accepting state (s
 F)
reject • Gets stuck
12
Finite Automata State
Graphs
• A state

• The start state

• An accepting state

a
• A transition

13
A Simple
Example
• A finite automaton that accepts only
“1”

1
A B

• Accepts ‘1’ : 


• Rejects ‘0’ :

 14
Another Simple
Example
• A finite automaton accepting any number
of 1’s followed by a single 0
• Alphabet: {0,1}

0
A


B

15
• Accepts ‘110’:  
And Another
Example
• Alphabet {0,1}
• What language does this
recognize?
1 0

0 0

1
1

16
And Another
Example
1 0
Select the regular
language that 0 0
denotes the same
language as this
finite automaton 1
1
(0 + 1)*
(1* + 0)(1 + 0)
1* + (01)* + (001)* +
(000*1)* (0 + 1)*00
17
And Another
Example
1 0
Select the regular
language that 0 0
denotes the same
language as this
finite automaton 1
1
(0 + 1)*
(1* + 0)(1 + 0)
1* + (01)* + (001)* +
(000*1)*
(0 +
1)*00 18
Epsilon
Moves
• Another kind of transition: -
moves
A  B

• Machine can move from state A to


state B without reading input

19
Deterministic and Nondeterministic
Automata
• Deterministic Finite Automata (DFA)
– One transition per input per state
– No -moves

• Nondeterministic Finite Automata


(NFA)
– Can have multiple transitions for one input
in a given state
– Can have -moves

20
Execution of Finite
Automata
• A DFA can take only one path through
the state graph
– Completely determined by input

• NFAs can choose


– Whether to make -moves
– Which of multiple transitions for a single input
to take

21
Acceptance of
NFAs
• An NFA can get into multiple
states
1

0
0

• Input:
• Possible
States:
Rule: NFA accepts if it can get to a final
state
22
Acceptance of
NFAs
• An NFA can get into multiple
states
1

0
0
A
• Input: 1
• Possible
States:
Rule: NFA accepts if it can get to a final
state
23
Acceptance of
NFAs
• An NFA can get into multiple
states
1

0
0
A
• Input: 1
• Possible {A
States: }
Rule: NFA accepts if it can get to a final
state
24
Acceptance of
NFAs
• An NFA can get into multiple
states
1

0
0
A
• Input: 1 0
• Possible {A
States: }
Rule: NFA accepts if it can get to a final
state
25
Acceptance of
NFAs
• An NFA can get into multiple
states
1

0
0
A B
• Input: 1 0
• Possible {A {A,
States: } B}
Rule: NFA accepts if it can get to a final
state
26
Acceptance of
NFAs
• An NFA can get into multiple
states
1

0
0
A B
• Input: 1 0
• Possible {A {A,
States: } B}
Rule: NFA accepts if it can get to a final
state
Acceptance of
NFAs
• An NFA can get into multiple
states
1

0
0
A
• Input: 1 0 0
• PossibleB {A {A, {A, B,
} B} C}
States:
Rule: NFA accepts if it can get to a final
state C
28
NFA vs. DFA
(1)
• NFAs and DFAs recognize the same set
of languages (regular languages)

• DFAs are faster to execute


– There are no choices to
consider

• NFAs are, in general, smaller


– Sometimes exponentially
smaller 29
NFA vs. DFA
(2)
• For a given language NFA can be simpler
than DFA
1
NFA 0
0 0

1 0
0 0
DFA
1
1
• DFA can be exponentially larger than
NFA 30
Regular Expressions to Finite
Automata
• High-level sketch

NFA

Regular
expression DFA
s

Lexical Table-driven
Specificatio Implementation of
n DFA
31
Regular Expressions to
NFA (1)
• For each kind of rexp, define an
NFA
– Notation: NFA for rexp M

M
• For  

• For input a
a

32
Regular Expressions to
NFA (2)
• For AB
A 
B

• For A +
B
B 


 A

33
Regular Expressions to
NFA (3)
• For A*

A
 

34
Example of RegExp -> NFA
conversion
• Consider the regular
expression
(1+0)*1
• The NFA is


 C
1 
E
A B G 
1
 0 I J
 D F  H 


35
NFA to DFA: The
Trick
• Simulate the NFA
• Each state of DFA
= a non-empty subset of states of the NFA
• Start state
= -closure of the start state of NFA
• Add a transition S a S’ to DFA iff
– S’ is the set of NFA states reachable from any
state in S after seeing the input a, considering -
moves as well
• Final states
 Subsets that include at least one final state of
NFA
36
-closure of a
state
-closure(B)= {B,C,D}
-closure(G)=
{A,B,C,D,G,H,I}

37
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

38
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

39
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

40
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

ABCDHI

41
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

0
ABCDHI

42
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

0
ABCDHI

43
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

0 FGHIABCD
ABCDHI

44
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

0 FGHIABCD

ABCDHI
1
45
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

0 FGHIABCD

ABCDHI
1
46
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

0 FGHIABCD
ABCDHI
1 EJGHIABCD
47
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

0 FGHIABCD
ABCDHI 0
1 EJGHIABCD
48
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 

0 FGHIABCD
ABCDHI 0 1
1 EJGHIABCD
49
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 


0
0 FGHIABCD
ABCDHI 0 1
1 EJGHIABCD
50
NFA -> DFA
Example

 C
1 
E
A B G 
1
 0 I J
 D F  H 


0
0 FGHIABCD
ABCDHI 0 1
1
1 EJGHIABCD
51
Implementat
ion
• A DFA can be implemented by a 2D table T
– One dimension is “states”
– Other dimension is “input symbol”
– For every transition Si a Sk define T[i,a] = k

• DFA “execution”
– If in state Si and input a, read T[i,a] = k and
skip to state Sk
– Very efficient

52
Table Implementation of a
DFA
0
0 T
S 0 1
1
1 U

0 1
S T U
T T U
U T U
53
Implementation (Cont.)

• NFA -> DFA conversion is at the heart of


tools such as flex

• But, DFAs can be huge

• In practice, flex-like tools trade off


speed for space in the choice of NFA
and DFA representations

54
DFA for recognizing two relational
operators

start > = return(SYMBOL, >=)


0 6 7

other
8
* return(SYMBOL, >)

We’ve accepted “>” and have read “other” character that must be
unread. That is moving the input pointer one character back.

55
DFA of Pascal relational
operators
start < =
0 1 2 return(SYMBOL, <=)
>
3 return(SYMBOL, <>)
other

= 4
*
return(SYMBOL, <)

5 return(SYMBOL, =)
>

=
6 7 return(SYMBOL, >=)
other
8
*
return(SYMBOL, >)
56
DFA for recognizing id and
keyword

letter or digit

start letter other


0 9 10
*

return(get_token(), install_id())

returns either a KEYWORD or ID based on If the token is an ID, its lexeme


the type of the token is inserted into the symbol
table (only one record for each
lexeme); and lexeme of the
token is returned.
57
DFA of Pascal Unsigned
Numbers
other

digit digit digit

start digit . digit E +|- digit other *


0 11 12 13 14 15 16 17

E digit

other

return(NUM, lexeme of the number)

58
Lexical
errors
• Some errors are out of power of
lexical analyzer to recognize:
fi (a == f(x)) …

• However, it may be able to recognize


errors like:
d = 2r

• Such errors are recognized when no


pattern for tokens matches a character
sequence 59
Error
recovery
• Panic mode: successive characters are
ignored until we reach to a well formed
token

• Delete one character from the remaining


input

• Insert a missing character into the


remaining input

• Replace a character by another character


60
• Transpose two adjacent characters
Lexical Analyzer in Perspective
(Revisited)
Symbol Table
key lexeme type …
1 position real … token
2 initial real …
3 rate real … <type, attribute>
output
source lexical
analyzer parser
program
get next
token
symbol
table
position = initial + rate * 60

61
<id, 1> <op, = > <id, 2> <op, + > <id, 3> <op, * > <num, 60 >
Using Buffer to Enhance
Efficiency
Current token
E = M * C * * 2 eof

lexeme beginning forward (scans


ahead to find
pattern match)
if forward at end of first half then begin
reload second half ; Block I/O
forward : = forward + 1
end
else if forward at end of second half
thenreload
beginfirst half ; Block I/O
move forward to biginning of first half
end
else forward : = forward + 1 ;
62
Algorithm: Buffered I/O with
Sentinels Current token
E = M * eof C * * 2 eof eof

lexeme beginning forward (scans


forward : = forward + 1 ; ahead to find
if forward is at eof then begin pattern match)
if forward at end of first half then begin
reload second half ; Block I/O
forward : = forward + 1
end
else if forward at end of second half then begin
reload first half ; Block I/O
move forward to biginning of first half
end
else / * eof within buffer signifying end of input * /
terminate lexical analysis 63

end 2nd eof  no more input !

You might also like