Lecture02 Scanning 2
Lecture02 Scanning 2
Kenneth C. Louden
2. Scanning (Lexical Analysis)
PART TWO
Contents
PART ONE
2.1 The Scanning Process
2.2 Regular Expression
2.3 Finite Automata
PART TWO
2.4 From Regular Expressions to DFAs [Open]
2.5 Implementation of a TINY Scanner [Open]
2.6 Use of Lex to Generate a Scanner Automatically [Open]
2.4 From Regular Expression To
DFAs
Main Purpose
• Study an algorithm:
– Translating a regular expression into a DFA via
NFA.
Regular Program
NFA DFA
Expression
Contents
• From a Regular Expression to an NFA [More]
• From an NFA to a DFA [More]
• Simulating an NFA using Subset Construction
[More]
• Minimizing the Number of States in a DFA [More]
2.4.1 From a Regular Expression
to an NFA
The Idea of Thompson’s
Construction
• Use ε-transitions
– to “glue together” the machine of each piece of a regular
expression
– to form a machine that corresponds to the whole expression
• Basic regular expression
– The NFAs for basic regular expression of the form a, ε,or φ
a
ε
The Idea of Thompson’s
Construction
• Concatenation: to construct an NFA equal to rs
– To connect the accepting state of the machine of r to
the start state of the machine of s by an ε-transition.
– The start state of the machine of r as its start state and
the accepting state of the machine of s as its accepting
state.
– This machine accepts L(rs) = L(r) L(s) and so
corresponds to the regular expression rs.
ε
r s
… …
The Idea of Thompson’s
Construction
• Choice among alternatives: To construct an NFA
equal to r | s
– To add a new start state and a new accepting state and
connected them as shown using ε-transitions.
– Clearly, this machine accepts the language
L(r|s) =L(r ) U L( s), and so corresponds to the regular
expression r|s.
r
…
ε ε
ε ε
s
…
The Idea of Thompson’s
Construction
• Repetition: Given a machine that corresponds to r,
Construct a machine that corresponds to r*
– To add two new states, a start state and an accepting state.
– The repetition is afforded by the newε-transition from the
accepting state of the machine of r to its start state.
– To draw an ε-transition from the new start state to the new
accepting state.
– This construction is not unique, sim-plifications are possible in the
many cases.
ε
ε ε
r
…
ε
Examples of NFAs Construction
Example 1.12: Translate regular expression ab|a into NFA
a
a ε b
a ε b
ε ε
ε ε
a
Examples of NFAs Construction
Example 1.13: Translate regular expression letter(letter|digit)* into NFA
letter
letter
ε ε
digit
ε ε
digit
ε
letter
ε ε
ε ε
ε ε
digit ε
letter
ε ε ε
letter ε ε ε
ε ε
digit
RET
2.4.2 From an NFA to a DFA
Goal and Methods
• Goal
– Given an arbitrary NFA, construct an equivalent DFA. (i.e.,
one that accepts precisely the same strings)
• Some methods
– (1) Eliminating ε-transitions
• ε-closure: the set of all states reachable by ε-transitions from a state
or states
– (2) Eliminating multiple transitions from a state on a single input
character.
• Keeping track of the set of states that are reachable by matching a
single character
– Both these processes lead us to consider sets of states instead of
single states. Thus, it is not surprising that the DFA we construct
has sets of states of the original NFA as its states.
The Algorithm Called Subset
Construction.
• The ε-closure of a Set of states:
– The ε-closure of a single state s is the set of states
reachable by a series of zero or more ε-transitions,
and we write this set as: s
• Example 2.14: regular a*
ε
ε a ε
1 2 3 4
ε
The algorithm called subset
construction.
ε
ε a ε
1 2 3 4
The ε-closure of a set of states : the union of the ε-closures of each individual state.
S= U
sin S
s
{1,3} = 1∪ 3 = {1,2,3}∪{2,3,4}={1,2,3,4}
The Subset Construction Algorithm
(1) Compute the ε-closure of the start state of M; to obtain new state M .
(2) For this set, and for each subsequent set, compute transitions on
characters a as follows.
Given a set S of states and a character a in the alphabet,
Compute the set
S′a = { t | for some s in S there is a transition from s to t on a }.
Then, compute S a ' , the ε-closure of S′a.
This defines a new state in the subset construction, together with
a new transition S→ S a ' .
(3) Continue with this process until no new states or transitions are created.
(4) Mark as accepting those states constructed in this manner that contain
an accepting state of M.
Examples of Subset Construction
ε
ε a ε
1 2 3 4
M ε-closure of M ( S ) S′a
1 1,2,4 3
3 2,3,4 3
a
a
{1,2,4} {2,3,4}
Examples of Subset Construction
a ε b
ε 2 3 4 5 ε
1
ε ε
a
6 7
a b
{1,2,6} {3,4,7,8} {5,8}
Examples of Subset Construction ε
letter
ε 5 6 ε
letter ε ε ε
1 2 3 4 9
ε ε
digit
7 8
1 1 2
2 2,3,4,5,7,10 6 8
6 4,5,6,7,9,10 6 8
letter
8 4,5,7,8,9,10 6 8
letter {4,5,6,7,9,10}
letter
{1} {2,3,4,5,7,10} digit letter
digit {4,5,7,8,9,10}
digit
RET
2.4.3 Simulating an NFA using
the Subset Construction
One Way of Simulating an NFA
• NFAs can be implemented in similar ways to
DFAs, except that NFAs are nondeterministic
– Many different sequences of transitions that
must be tried.
– Store up transitions that have not yet been tried
and backtrack to them on failure.
An Other Way of Simulating an
NFA
• Use the subset construction
– Instead of constructing all the states of the associated
DFA
– Construct only the state at each point that is indicated
by the next input character
• The advantage: Not need to construct the entire DFA
– Example: input single character a, construct the start
state {1,2,6}and then the second state {3,4,7,8} to
move and match the a.
– Since no following b, accept without generating the
state {5,8}
a b
{1,2,6} {3,4,7,8} {5,8}
An Other Way of Simulating an
NFA
• The disadvantage: A state may be constructed many times, if the path
contains loops
– Example: given the input string r2d3, the sequence of states as showing
below letter
letter {4,5,6,7,9,10}
letter
{1} {2,3,4,5,7,10} digit letter
digit {4,5,7,8,9,10}
digit
• If these states are constructed as the transitions occur, then the states
of the DFA have been constructed and the state {4,5,7,8,9,10}has even
been constructed twice
– Less efficient than constructing the entire DFA
RET
2.4.4 Minimizing the Number of
States in a DFA
Why need Minimizing ?
• The process of deriving a DFA algorithmically
from a regular expression has the unfortunate
property that
– the resulting DFA may be more complex than
necessary.
• The derived the DFA for the regular expression a*
and an equivalent DFA
a a
a
An Important Result from Automata
Theory for Minimizing
• Given any DFA, there is an equivalent DFA
containing a minimum number of states, and, that
this minimum-state DFA is unique (except for
renaming of states)
2. Given this partition of the states of the original DFA, consider the
transitions on each character a of the alphabet.
(1) If all accepting states have transitions on a to accepting states, then
this defines an a-transition from the new accepting state (the set of all
the old accept-ing states) to itself.
(2) If all accepting states have transitions on a to non-accepting states,
then this defines an a-transition from the new accepting state to the
new non-accepting state (the set of all the old non-accepting states).
Algorithm obtaining Mini-States
DFA
(3) On the other hand, if there are two accepting states s and t that
have transitions on a that land in different sets, then no a-transition can
be defined for this grouping of the states. We say that a distinguishes
the states s and t
(4) We must also consider error transitions to an error state that is non-
accepting. If there are accepting states s and t such that s has an a-
transition to another accepting state, while t has no a-transition at all
(i.e., an error transition), then a distinguishes s and t.
3. If any further sets are split, we must return and repeat the process
from the beginning. This process continues until either all sets contain
only one element (in which case, we have shown the original DFA to
be minimal) or until no further splitting of sets occurs.
Examples of Minimizing DFA
Example 2.18: The regular
expression letter(letter|digit)* letter
letter {4,5,6,7,9,10}
letter
{1} {2,3,4,5,7,10} digit letter
digit {4,5,7,8,9,10}
digit
letter
letter
1 2
digit
Examples of Minimizing DFA
a
Example 2.18: the regular expression (a| ε)b* 1 2
b b
a b
{1} b {2},{3}
RET
2.5 Implementation of a Tiny
Scanner
The Tiny language
return PLUS
INNUM
+
[other]
digit
- letter
letter [other]
STAR INID DONE
T
;
+-*/=<()
return SEMI
The DFAs for the Tokens of TINY
• The DFA extended by adding comments, white space, and
assignment to this DFA
• The DFA considers reserved words to be the same as
identifiers, and then to look up the identifiers in a table of
reserved words
digit
white INNUM
space digit
[other]
letter
letter [other]
STAR INID DONE
T =
:
[other]
{
}
INASSIGN
INCOMMENT
other other
Ways to Translate a DFA or NFA
into Code
A better method:
• Using a variable to maintain the current state and
• writing the transitions as a doubly nested case statement inside a loop,
• where the first case statement tests the current state and the nested second level tests the input
character.
• case state of 1 2 3
Assumes :
• The transi-tions are kept in a transition array T indexed by states and input characters;
• The transi-tions that advance the input (i.e., those not marked with brackets in the table) are given by
the Boolean array Advance, indexed also by states and input characters;
• Accepting states are given by the Boolean array Accept, indexed by states.
2.6.1 Lex conventions for
regular expression
Conventions
• Matching of single characters, or strings of
characters, by writing the characters in sequence.
• Metacharacters matched as actual characters by
surrounding the characters in quotes; Quotes
written around characters that are not
metacharacters, where they have no effect.
– match a left parenthesis, we must write " (“
– an alternative is to use the backslash metacharacter \
– match the character sequence (* , have to write \(\*
or " (* "
Conventions
• Metacharacters : *, +, (, ) , |, ?
– The set of strings of a’s and b’s that begin with either
aa or bb and have an optical c at the end.
– (aa|bb)(a|b)*c? ("aa"|"bb")("a"|"b")*"c“
• The Lex convention for character classes (sets of
characters) is to write them between square
brackets.
– The above example cab be writen as:
– (aa|bb) [ab]*c?
Conventions
• Ranges of characters written using a hyphen
– The expression [0-9] means in Lex any of the digits
zero through nine.
• A period is a metacharacter represents a set of
characters:
– It represents any character except a new-line.
• Complementary sets written in this notation,
using the carat ^ as the first character inside the
brackets
– [^0-9abc] means any character that is not a digit and
is not one of the letters a, b, or c.
Conventions
• One curious feature is that inside square brackets
(representing a character class), most of the
metacharacters lose their special status and do not
need to be quoted.
– written [-+] instead of ("+" | "-") . (but not [+-] because of the
metacharacter use of - to express a range of characters).
– [."?] means any of the three characters period, quotation mark,
or question mark
• Some characters, however, are still metacharacters
even inside the square brackets, and to get the actual
character, we must precede the character by a backslash .
– [\^ \ \] means either of the actual characters ^ or \.
Conventions
• A further important metacharacter convention in
Lex is the use of curly brackets to denote names of
regular expressions.
– that a regular expression can be given a name, and that
these names can be used in other regular expressions as
long as there are no recursive references.
• nat [0-9]+
• signedNat (+|-)?{nat}
The table of Conventions
Pattern Meaning
a the character a
a? an optional a
a|b a or b
(a) a itself
[abc] any of the characters a, b, or c .
THANKS