2-Introduction to Compilation and Lexical Analysis-19!07!2024
2-Introduction to Compilation and Lexical Analysis-19!07!2024
DESIGN
Objective
• To provide fundamental knowledge of various language
translators.
• To make students familiar with lexical analysis and
parsing techniques.
• To understand the various actions carried out in
semantic analysis.
• To make the students get familiar with how the
intermediate code is generated.
• To understand the principles of code optimization
techniques and code generation.
• To provide foundation for study of high-performance
compiler design.
Outcomes
• Apply the skills on devising, selecting, and using tools
and techniques towards compiler design
• Develop language specifications using context free
grammars (CFG).
• Apply the ideas, the techniques, and the knowledge
acquired for the purpose of developing software
systems.
• Constructing symbol tables and generating
intermediate code.
• Obtain insights on compiler optimization and code
generation
Syllabus
• Module: 1 Introduction to Compilation and Lexical Analysis 7 hours
• Introduction to LLVM - Structure and Phases of a Compiler-Design Issues-
Patterns Lexemes-Tokens-Attributes-Specification of Tokens-Extended
Regular Expression- Regular expression to Deterministic Finite Automata
(Direct method) - Lex - A Lexical Analyzer Generator
• Module: 2 Syntax Analysis 8 hours
• Role of Parser- Parse Tree - Elimination of Ambiguity – Top Down Parsing
– Recursive Descent Parsing - LL (1) Grammars – Shift Reduce Parsers-
Operator Precedence Parsing - LR Parsers, Construction of SLR Parser
Tables and Parsing- CLR Parsing- LALR Parsing
• Module: 3 Semantic Analysis 5 hours
• Syntax Directed Definition – Evaluation Order - Applications of Syntax
Directed Translation - Syntax Directed Translation Schemes -
Implementation of L-attributed Syntax Directed Definition
• Module: 4 Intermediate Code Generation 5 hours
• Variants of Syntax trees - Three Address Code- Types – Declarations -
Procedures - Assignment Statements - Translation of Expressions -
Control Flow - Back Patching- Switch Case Statements.
Cont..
• Module: 5 Code Optimization 6 hours
• Loop optimizations- Principal Sources of Optimization -Introduction to
Data Flow Analysis - Basic Blocks - Optimization of Basic Blocks -
Peephole Optimization- The DAG Representation of Basic Blocks -Loops in
Flow Graphs - Machine Independent Optimization Implementation of a
naïve code generator for a virtual Machine- Security checking of virtual
machine code
• Module: 6 Code Generation 5 hours
• Issues in the design of a code generator- Target Machine- Next-Use
Information – Register Allocation and Assignment- Runtime Organization-
Activation Records.
• Module: 7 Parallelism 7 hours
• Parallelization- Automatic Parallelization- Optimizations for Cache
Locality and Vectorization- Domain Specific Languages-Compilation-
Instruction Scheduling and Software Pipelining- Impact of Language
Design and Architecture Evolution on Compilers Static Single Assignment
• Module: 8 Contemporary Issues 2 hours
Text Books & References
• Text Book
• A. V. Aho, Monica S. Lam, Ravi Sethi and Jeffrey D. Ullman, Compilers:
Principles, techniques, & tools, 2007, Second Edition, Pearson Education,
Boston.
• Reference Books
• Watson, Des. A Practical Approach to Compiler Construction. Germany,
Springer International Publishing, 2017
Content - Module -1
• Introduction to Compilation And Lexical
Analysis
• Introduction to LLVM
• Structure and Phases of a Compiler
• Design Issues
• Patterns Lexemes
• Tokens-Attributes
• Specification of Tokens
• Extended Regular Expression
• Regular expression to Deterministic Finite Automata (Direct
method)
• Lex - A Lexical Analyzer Generator
Translator
• A translator is a program that takes one form of
program as input and converts it into another
form.
• Types of translators are:
1. Compiler Source Translator Target
Program Program
2. Interpreter
3. Assembler
Error
Messages
Compiler
• A compiler is a program that reads a program written
in source language and translates it into an equivalent
program in target language.
Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Lexical analysis
• Lexical Analysis is also called linear
analysis or scanning. Position = initial + rate*60
• Lexical Analyzer divides the given source
statement into the tokens.
Lexical analysis
• Ex: Position = initial + rate * 60 would
be grouped into the following tokens: id1=id2+ id3 * 60
Position (identifier)
= (Assignment symbol) Reads the stream of char
initial (identifier) making up the source
program & group the char
+ (Plus symbol) into meaningful sequences
called lexeme.
rate (identifier)
* (Multiplication symbol) Lexical analyzer represents
the lexeme in the form of
60 (Number) tokens.
Phases of Compiler
Compiler
Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Syntax analysis
Position = initial + rate*60
Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Semantic analysis
• Semantic analyzer determines the =
meaning of a source string. id1 +
• It performs following operations: id2 * int to
1. matching of parenthesis in the real
expression. id3 60
id2 *
id3 inttoreal
60
Phases of compiler
Compiler
Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Intermediate code generator
• Two important properties of =
intermediate code : id1 +
1. It should be easy to produce.
id2 *
2. Easy to translate into target
t3 id3 inttoreal
program. t2 t1
60
• Intermediate form can be represented
Intermediate code
using “three address code”.
• Three address code consist of a t1= int to real(60)
sequence of instruction, each of t2= id3 * t1
t3= t2 + id2
which has at most three operands. id1= t3
Phases of compiler
Compiler
Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Code optimization
• It improves the intermediate code.
• This is necessary to have a faster Intermediate code
execution of code or less
t1= int to real(60)
consumption of memory. t2= id3 * t1
t3= t2 + id2
id1= t3
Code optimization
Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Code generation
• The intermediate code instructions
are translated into sequence of Code optimization
machine instruction.
t1= id3 * 60.0
id1 = id2 + t1
Code generation
MOV id3, R2
MUL #60.0, R2
MOV id2, R1
ADD R2,R1
MOV R1, id1
Id3R2
Id2R1
Symbol table
• Symbol table are data structures that are used by compilers
to hold information about source-program constructs.
• It is used to store information about the occurrences of
various entities such as, objects, classes, variable names,
functions, etc.,
• It is used by both analysis phase and synthesis phase.
• Symbol table is used for the following purposes
• It is used to store the name of all the entities in a structured form at
one place
• It is used to verify if a variable has been declared
• It is used to determine the scope of a name
• It is used to implement type checking by verifying assignments and
expression in the source code are semantically correct.
Cont.,
• Symbol table can be a linear (Linked list) or hash table
• It maintain a entry for each name as,
• <symbol name, type, attribute>
Eg.
• static int age;
• <age, int, static>
Phases of compiler
Source
program
Analysis
Lexical analysis Phase
Syntax analysis
Semantic
analysis Error
Symbol
table detection
Intermediate
and recovery
code
Variable Type Address Code
Name optimization
Position Float 0001
Code Synthesis
Initial Float 0005 generation Phase
Rate Float 0009 Target
Program
Exercise 1
• Write output of all the phases of compiler for following
statements:
1. x = b-c*2
2. I=p*n*r/100
Grouping of Phases
Front end & back end (Grouping of
phases)
Front end
• Depends primarily on source language and largely independent of the target machine.
• It includes following phases:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
5. Creation of symbol table
Back end
Depends on target machine and do not depends on source program.
It includes following phases:
1. Code optimization
2. Code generation phase
3. Error handling and symbol table operation
Difference between compiler &
interpreter
Compiler Interpreter
Scans the entire program and translates it It translates program’s one statement at a
as a whole into machine code. time.
It generates intermediate code. It does not generate intermediate code.
An error is displayed after entire program An error is displayed for every instruction
is checked. interpreted if any.
Memory requirement is more. Memory requirement is less.
Example: C compiler Example: Basic, Python, Ruby
Context of Compiler
(Cousins of compiler)
Context of compiler (Cousins of compiler)
Skeletal Source Program
• In addition to compiler, many other
system programs are required to Preprocessor
generate absolute machine code. Source
Program
• These system programs are: Compiler
Target Assembly
• Preprocessor Program
• Assembler Assembler
• Linker Relocatable Object
• Loader Code
Libraries & Linker / Loader
Object Files
Absolute Machine
Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Preprocessor
Some of the task performed by preprocessor: Preprocessor
Target Assembly
Program
Assembler
Relocatable Object
Code
Libraries & Linker / Loader
Object Files
Absolute Machine
Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Assembler
Assembler is a translator which takes the assembly Preprocessor
Target Assembly
Program
Assembler
Relocatable Object
Code
Libraries & Linker / Loader
Object Files
Absolute Machine
Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Linker
Linker makes a single program from a several files Preprocessor
Symbol Table
= Operator1
Tokens
sum Identifier2
+ Operator2
45 Constant1
Lexemes
Lexemes of identifier: total, sum
Lexemes of operator: =, +
Lexemes of constant: 45
Attributes of Tokens
The
When morenames
token than one lexeme
and can match
associated a pattern,
attribute thefor
values lexical
the analyzer
Fortran
must provide the subsequent compiler phases additional information
statement
Eabout
= M *the
C particular
** 2 lexeme that matched.
are
Forwritten
example,
belowtheaspattern for token
a sequence number matches both 0 and 1
of pairs.
<id,
Thus, in many
pointer cases the lexical
to symbol-table analyzer
entry for E> returns to the parser not
only a token
<assign op> name, but an attribute value that describes the lexeme
represented
<id, pointer toby the token; entry for M>
symbol-table
<mult op> information about an identifier e.g., its lexeme, its type,
Normally,
andpointer
<id, the location at which
to symbol-table it isfor
entry first
C>found is kept in the symbol
table.
<exp op>Thus, the appropriate attribute value for an identifier is a
<float> <id, limitedSquaare> <(> <id, x> <)> <{>
<float> <id, x>
<return> <(> <id, x> <op,"<="> <num, -10.0> <op, "||"> <id, x> <op, ">="> <num, 10.0>
<)> <op, "?"> <num, 100> <op, ":"> <id, x> <op, "*"> <id, x> <}>
Input buffering
• There are mainly two techniques for input buffering:
1. Buffer pairs
2. Sentinels
Buffer Pair
• The lexical analysis scans the input string from left to right one
character at a time.
• Buffer divided into two N-character halves, where N is the
number of character on one disk block.
: : :E: :=: : : C: * : * : 2 : eof :
Mi : * : : : :
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
forward forward
lexeme_beginnig
• Pointer Lexeme Begin, marks the beginning of the current
lexeme.
• Pointer Forward, scans ahead until a pattern match is found.
• Once the next lexeme is determined, forward is set to
character at its right end.
• Lexeme Begin is set to the character immediately after the
lexeme just found.
• If forward pointer is at the end of first buffer half then second
is filled with N input character.
• If forward pointer is at the end of second buffer half then first
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
forward
lexeme_beginnig
• In buffer pairs we must check, each time we move the forward
pointer that we have not moved off one of the buffers.
• Thus, for each character read, we make two tests.
• We can combine the buffer-end test with the test for the
current character.
• We can reduce the two tests to one if we extend each buffer to
hold a sentinel character at the end.
• The sentinel is a special character that cannot be part of the
source program, and a natural choice is the character EOF.
Specification of tokens
Strings and languages
Term Definition
Prefix of s A string obtained by removing zero or more
trailing symbol of string S.
e.g., ban is prefix of banana.
Suffix of S A string obtained by removing zero or more
leading symbol of string S.
e.g., nana is suffix of banana.
Sub string of S A string obtained by removing prefix and suffix
from S.
Proper prefix, suffix The
e.g.,proper
nan isprefixes,
substringsuffixes, and substrings of a string s are
of banana
and substring of S those, prefixes, suffixes, and substrings, respectively, of s that are
not ε or not equal to s itself
2. LD is the set of 520 strings of length two, each consisting of one letter followed
by one digit.
is the set of all strings of letters and digits beginning with a letter.
5. L(LUD)*
| - union
* - ‘ Zero or more occurrence of ’
Regular expression
• A regular expression is a sequence of characters that define
a pattern.
Notational shorthand's
1. One or more instances: +
2. Zero or more instances: *
3. Zero or one instances: ?
4. Alphabets: Σ
Rules to define regular expression
1. is a regular expression that denotes , the set containing empty
string.
2. If is a symbol in then is a regular expression,
3. Suppose and are regular expression denoting the languages
and . Then,
a. is a regular expression denoting
b. is a regular expression denoting
c. * is a regular expression denoting
d. is a regular expression denoting
*
𝜖
a
aa Infinite
aaa
aaa …..
a
aaaaa
…..
Regular expression
+
a +
• L = One or More Occurrences of a =
a
aa
aaa Infinite …..
aaaa
aaaaa…..
Precedence and associativity of operators
Operator Precedence Associative
Kleene * 1 left
Concatenation 2 left
Union | 3 left
Under these conventions, for example, we may replace the regular expression
(a)|((b)*(c)) by a|b*c. Both expressions denote the set of strings that are
either
25.Even no. of 0
∗ ∗ ∗ ∗
…. 𝑹 . 𝑬 .=(𝟏 𝟎 𝟏 𝟎 𝟏 )
26.String should have odd length
∗
…. 𝑹. 𝑬 .=( 𝟎∨𝟏 ) (( 𝟎|𝟏 ) (𝟎∨𝟏))
27.String should have even length
∗
…. 𝑹 . 𝑬 .=( ( 𝟎|𝟏 ) ( 𝟎∨𝟏))
28.String start with 0 and has odd length
∗
…. 𝑹. 𝑬 .=( 𝟎 ) ( ( 𝟎|𝟏 ) (𝟎∨𝟏))
30.String start with 1 and has even length
∗
…. 𝑹. 𝑬 .=𝟏(𝟎∨𝟏)(( 𝟎|𝟏 ) (𝟎∨𝟏))
31.All string begins or ends with 00 or 11
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟎𝟏𝟎𝟏,𝟏𝟎𝟏𝟎𝟎,𝟏𝟏𝟎,𝟎𝟏𝟎𝟏𝟏… 𝑹.𝑬.=(𝟎𝟎∨𝟏𝟏)(𝟎∨𝟏)∗∨( 𝟎|𝟏 ) ∗(𝟎𝟎∨𝟏𝟏)
Regular expression examples
31.Language of all string containing both 11 and 00 as
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟎𝟏𝟏,𝟏𝟏𝟎𝟎,𝟏𝟎𝟎𝟏𝟏𝟎,𝟎𝟏𝟎𝟎𝟏𝟏…
substring
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟏𝟏,𝟏𝟏𝟎𝟏,𝟏𝟎𝟏𝟏….
32.String ending with 𝑹1. and
𝑬 .=not( 𝟏|𝟎𝟏 )00
contain +¿
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝒂𝒓𝒆𝒂,𝒊,𝒓𝒆𝒅𝒊𝒐𝒖𝒔,𝒈𝒓𝒂𝒅𝒆𝟏….
33.Language
∗
𝑹. 𝑬 .=(¿+𝑳)(¿+𝑳+𝑫)
of C identifier
𝒘𝒉𝒆𝒓𝒆 𝑳𝒊𝒔𝑳𝒆𝒕𝒕𝒆𝒓 ∧𝐃𝐢𝐬𝐝𝐢𝐠𝐢𝐭
Algebraic laws for regular expressions
Regular definition
• A regular definition gives names to certain regular expressions
and uses those names in other regular expressions.
• Regular definition is a sequence of definitions of the form:
……
optional_fraction .digits | 𝜖
digits digit digit*
optional_exponent (E(+|-|𝜖)digits)|𝜖
num digits optional_fraction optional_exponent
Example
C identifiers are strings of letters, digits, and underscores. Here is
a regular definition for the language of C identifiers. We shall
conventionally use italics for the symbols denied in regular
definitions.
is a state
is a transition
is a start state
is a final state
Transition Diagram : Relational operator
<
0 1
=
2 return (relop,LE)
>
3 return (relop,NE)
=
other
5
4 return (relop,LT)
return (relop,EQ)
>
6 =
7 return (relop,GE)
other
8 return (relop,GT)
Recognition of Reserved Words and
Identifiers
To search for identifier lexemes, this diagram will also recognize the keywords
if, then, and else of our running example
There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially. When we find an identifier, a call to
installID places it in the symbol table if it is not already there and returns a pointer to the
symbol-table entry for the lexeme found. The function getToken examines the symbol table
entry for the lexeme found, and returns whatever token name the symbol table says this lexeme
represents either id or one of the keyword tokens.
2. Create separate transition diagrams for each keyword; Note that such a transition diagram
consists of states representing the situation after each successive letter of the keyword is seen,
followed by a test for a \nonletter-or-digit,“ i.e., any character that cannot be the continuation
Transition diagram : Unsigned number
E digit
3
5280
9 10
39.37
1.894 E - 4
2.56 E + 7
45 E + 6
96 E 2
Hard coding and automatic generation lexical
analyzers
• Lexical analysis is about identifying the pattern from the input.
• To recognize the pattern, transition diagram is constructed.
• It is known as hard coding lexical analyzer.
• Example: to represent identifier in ‘C’, the first character must
be letter and other characters are either letter or digits.
• To recognize this pattern, hard coding lexical analyzer will
work with a transition diagram.
• The automatic generation lexical analyzer takes special
notation as input.
• For example, lex compiler tool Letter
will ortake regular expression as
input and finds out the pattern digit matching to that regular
expression. Start Letter
1 2 3
Finite Automata
• Finite Automata are recognizers.
• FA simply say “Yes” or “No” about each possible input string.
• Finite Automata is a mathematical model consist of:
1. Set of states
2. Set of input symbol
3. A transition function move
4. Initial state
5. Final states or accepting states
Types of finite automata
• Types of finite automata are:
DFA
b
Deterministic finite automata (DFA): have
for each state exactly one edge leaving out a b b
1 2 3 4
for each symbol.
a
a
b a
NFA DFA
Nondeterministic finite automata (NFA): a
There are no restrictions on the edges
leaving a state. There can be several with a b b
1 2 3 4
the same symbol as label and some edges
can be labeled with .
b NFA
Regular expression to NFA using Thompson's
rule
1. For , construct the NFA 3. For regular expression
𝑖N(s) 𝑓
star � start
N(t)
t
𝑖 �
𝑓
Ex: ab
2. For in , construct the NFA
𝑖 𝑓
start a a b
1 2 3
Regular expression to NFA using
Thompson's rule
4. For regular expression 5. For regular expression *
𝜖
𝜖
𝜖
𝜖 𝜖
N(s)
𝑖
start
start
𝑖 𝑓 N(s)
𝑓
𝜖 N(t) 𝜖 𝜖
𝜖
Ex: a*
Ex: (a|b) a
𝜖 𝜖 𝜖 𝜖
2 3
𝑎
1 6
1 2 3 4
𝜖 𝜖 𝜖
4 5
b
Regular expression to NFA using Thompson's
rule
• a*b
𝜖 𝑎 𝜖 𝑏
1 2 3 4 5
𝜖
𝜖
• b*ab
𝜖 𝑏 𝜖 𝑎 𝑏
1 2 3 4 5 6
𝜖
Exercise
Convert following regular expression to NFA:
1. abba
2. bb(a)*
3. (a|b)*
4. a* | b*
5. a(a)*ab
6. aa*+ bb*
7. (a+b)*abb
8. 10(0+1)*1
9. (a+b)*a(a+b)
10.(0+1)*010(0+1)*
11.(010+00)*(10)*
12. 100(1)*00(0+1)*
Conversion from NFA
to DFA using subset
construction method
Subset construction algorithm
Input: An NFA .
Output: A DFA D accepting the same language.
Method: Algorithm construct a transition table for D. We use
the following operation:
OPERATION DESCRIPTION
Set of NFA states reachable from NFA
state on – transition alone.
Set of NFA states reachable from some
NFA state in on – transition alone.
Set of NFA states to which there is a
transition on input symbol from some
NFA state in .
Subset construction algorithm
initially be the only state in and it is unmarked;
while there is unmarked states T in do begin
mark ;
for each input symbol do begin
if is not in then
add as unmarked state to
end
end
Conversion from NFA to DFA
ab
𝜖
*
(a|b)
b
a
𝜖 𝜖
2 3
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
𝜖
Conversion from NFA to DFA
𝜖
a
𝜖 𝜖
2 3
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
𝜖- Closure(0)=
{0, 1, 7, 2, 4}
= {0,1,2,4,7} ---- A
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}
𝜖 𝜖
4 5
b
𝜖
A= {0, 1, 2,
4, 7}
)𝜖-
Move(A,a
= {3,8}
Closure(Move(A,a = {3, 6, 7, 1, 2,
)) 4,
= 8} ----
{1,2,3,4,6,7,8} B
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}
𝜖 𝜖
C = {1,2,4,5,6,7}
4 5
b
𝜖
A= {0, 1, 2,
𝜖-
4, 7}
Move(A,b)
{5}
= {5, 6, 7, 1, 2,
Closure(Move(A,b))
4}
= ----
= {1,2,4,5,6,7}
C
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B
𝜖 𝜖
C = {1,2,4,5,6,7}
4 5
b
𝜖
B = {1, 2, 3, 4, 6,
7, 8}
)𝜖-
Move(B,a
= {3,8}
Closure(Move(B,a = {3, 6, 7, 1, 2,
)) 4,
= 8} ----
{1,2,3,4,6,7,8} B
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
𝜖 𝜖
C = {1,2,4,5,6,7}
4 5 D = {1,2,4,5,6,7,9}
b
𝜖
B= {1, 2, 3, 4,
𝜖-
6, 7, 8}
Move(B,
= {5,9}
b) = {5, 6, 7, 1, 2,
Closure(Move(B,b
4,
= 9} ----
))
{1,2,4,5,6,7,9} D
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
𝜖 𝜖
C = {1,2,4,5,6,7} B
4 5 D = {1,2,4,5,6,7,9}
b
𝜖
C= {1, 2, 4, 5,
𝜖-
6 ,7}
Move(C,
= {3,8}
a)
Closure(Move(C,a = {3, 6, 7, 1, 2,
)) 4,
= 8} ----
{1,2,3,4,6,7,8} B
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
𝜖 𝜖
C = {1,2,4,5,6,7} B C
4 5 D = {1,2,4,5,6,7,9}
b
𝜖
C= {1, 2, 4,
𝜖-
5, 6, 7}
Move(C,b
{5}
)=
Closure(Move(C,b)){5, 6, 7, 1, 2,
= 4} ----
= {1,2,4,5,6,7}
C
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
𝜖 𝜖
C = {1,2,4,5,6,7} B C
4 5 D = {1,2,4,5,6,7,9} B
b
𝜖
D= {1, 2, 4, 5,
𝜖-
6, 7, 9}
Move(D,
= {3,8}
a)
Closure(Move(D,a = {3, 6, 7, 1, 2,
)) 4,
= 8} ----
{1,2,3,4,6,7,8} B
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
𝜖 𝜖
C = {1,2,4,5,6,7} B C
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10}
𝜖
D= {1, 2, 4, 5,
𝜖-
6, 7, 9}
Move(D,
= {5,10}
b) = {5, 6, 7, 1, 2,
Closure(Move(D,b
4, 10} ----
)) = {1,2,4,5,6,7,10}
E
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
𝜖 𝜖
C = {1,2,4,5,6,7} B C
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10} B
𝜖
E= {1, 2, 4, 5,
6, 7, 10}
)𝜖-
Move(E,a
= {3,8}
Closure(Move(E,a = {3, 6, 7, 1, 2,
)) 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖
a States a b
𝜖 𝜖
2 3
𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
𝜖 𝜖
C = {1,2,4,5,6,7} B C
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10} B C
𝜖
E= {1, 2, 4, 5,
𝜖-
6, 7, 10}
Move(E,b
{5}
)=
Closure(Move(E,b)){5,6,7,1,2,4}
= ----
= {1,2,4,5,6,7}
C
Conversion from NFA to DFA
a
b
States
A = {0,1,2,4,7}
a
B
b
C
a B a
D
B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C A a a b
D = {1,2,4,5,6,7,9} B E
E = {1,2,4,5,6,7,10} B C
b
C b
E
Transition
Table
Note:
b
• Accepting state in NFA is
10 DFA
• 10 is element of E
• So, E is acceptance state
in DFA
Exercise
• Convert following regular expression to DFA using subset
construction method:
1. (a+b)*a(a+b)
2. (a+b)*ab*a
DFA optimization
1. Construct an initial partition of the set of states with two
groups: the accepting states and the non-accepting states .
2. Apply the repartition procedure to to construct a new
partition .
3. If , let and continue with step (4). Otherwise, repeat step (2)
with .
for each group of do begin
partition into subgroups such that two states
and
of are in the same subgroup if and only if
for all
input symbols , states and have transitions
DFA optimization
4. Choose one state in each group of the partition as the
representative for that group. The representatives will be the
states of . Let s be a representative state, and suppose on
input a there is a transition of from to . Let be the
representative of s group. Then has a transition from to on .
Let the start state of be the representative of the group
containing start state of , and let the accepting states of be
the representatives that are in .
5. If has a dead state , then remove from . Also remove any
state not reachable from the start state.
DFA optimization
States a b
{ 𝐴, 𝐵,𝐶, 𝐷, 𝐸} A B C
B B D
Accepting States
Nonaccepting States C B C
D B E
{𝐷}
E B C
States a b
A B A
B B D
D B E
• Now no more splitting is
E B A
possible.
Optimized
• If we chose A as the Transition Table
representative for group (AC),
Conversion from
regular expression to
DFA
Rules to compute nullable, firstpos,
lastpos
• nullable(n)
• The subtree at node generates languages including the empty string.
• firstpos(n)
• The set of positions that can match the first symbol of a string generated by
the subtree at node
• lastpos(n)
• The set of positions that can match the last symbol of a string generated be
the subtree at node
• followpos(i)
• The set of positions that can follow position in the tree.
Rules to compute nullable, firstpos,
lastpos
Node n nullable(n) firstpos(n) lastpos(n)
A leaf labeled
true
by with
A leaf
false
position
firstpos(c1) lastpos(c1)
n
¿ nullable(c1)
or
c c nullable(c2) firstpos(c2) lastpos(c2)
1 2
if
n . if (nullable(c1)) (nullable(c2))
c c nullable(c1) then firstpos(c1) then
1 2 and firstpos(c2) lastpos(c1)
nullable(c2) else lastpos(c2)
n ∗ firstpos(c else )
true firstpos(c1))
1 lastpos(c
c lastpos(c12)
1
Rules to compute followpos
1. If n is concatenation node with left child c1 and right child
c2 and i is a position in lastpos(c1), then all position in
firstpos(c2) are in followpos(i)
n∗
𝟑 firstpos(c1)
c
{1,2} ¿ 1
n if (nullable(c1))
.
𝑎 𝑏 thenfirstpos(c1)
{1}𝟏 {2 𝟐
} c c firstpos(c2)
1 2 else firstpos(c1)
Conversion from regular expression to
DFA
Step 3: Calculate lastpos
Lastpos
{1,2,3} . {6 }
{1,2,3} . {5 }
{6 }¿{6 } A leaf with position
{1,2,3} . {4 } 𝟔
{5 }𝑏{5 }
{1,2,3} . {3 } {4 }𝑏 𝟓
n
¿ lastpos(c1) lastpos(c2)
{4 } c1 c2
𝟒
{1,2} ∗{1,2} {3 }𝑎 {3 } n∗
𝟑 lastpos(c1)
{1,2} ¿
c1
{1,2}
n if (nullable(c2)) then
.
𝑎 𝑏 lastpos(c1) lastpos(c2)
{1} {2 𝟐
{1}𝟏 } {2 } c1 c2 else lastpos(c2)
Conversion from regular expression to
DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos {1,2,3} . {6 }
Lastpos
{1,2,3} .{5 }
{6 }¿{6 }
{1,2,3} .{4 } 𝟔
{5 }𝑏{5 }
{1,2,3} . {3 } {4 }𝑏 𝟓 .
{4 }
𝟒
{1,2} ∗{1,2} {3 }𝑎 {3 } {1,2,3} 𝒄 𝟏{5 } {6 } 𝒄 𝟐{6 }
𝟑
{1,2} ¿{1,2}
𝑎 𝑏
{1} {2 𝟐
{1}𝟏 } {2 }
Conversion from regular expression to
DFA
Step 4: Calculate followpos Position followpos
5 6
{1,2,3} . {6 } 4 5
{1,2,3} .{5 }
{6 }¿{6 }
{1,2,3} .{4 } 𝟔
{5 }𝑏{5 }
{1,2,3} . {3 } {4 }𝑏 𝟓 .
{4 }
𝟒
{1,2} ∗{1,2} {3 }𝑎 {3 } {1,2,3} 𝒄 𝟏{4 } {5 } 𝒄 𝟐{5 }
𝟑
{1,2} ¿{1,2}
𝑎 𝑏
{1} {2 𝟐
{1}𝟏 } {2 }
Conversion from regular expression to
DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos {1,2,3} . {6 } 4 5
Lastpos
{1,2,3} .{5 } 3 4
{6 }¿{6 }
{1,2,3} .{4 } 𝟔
{5 }𝑏{5 }
{1,2,3} . {3 } {4 }𝑏 𝟓 .
{4 }
𝟒
{1,2} ∗{1,2} {3 }𝑎 {3 } {1,2,3} 𝒄 𝟏{3 } {4 } 𝒄 𝟐{4 }
𝟑
{1,2} ¿{1,2}
𝑎 𝑏
{1} {2 𝟐
{1}𝟏 } {2 }
Conversion from regular expression to
DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos {1,2,3} . {6 } 4 5
Lastpos
{1,2,3} .{5 } 3 4
{6 }¿{6 }
2 3
{1,2,3} .{4 } 𝟔
{5 }𝑏{5 } 1 3
{1,2,3} . {3 } {4 }𝑏 𝟓 .
{4 }
𝟒
{1,2} ∗{1,2} {3 }𝑎 {3 } {1,2} 𝒄 𝟏{1,2} {3 } 𝒄 𝟐{3 }
𝟑
{1,2} ¿{1,2}
𝑎 𝑏
{1} {2 𝟐
{1}𝟏 } {2 }
Conversion from regular expression to
DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos {1,2,3} . {6 } 4 5
Lastpos
{1,2,3} .{5 } 3 4
{6 }¿{6 }
2 1,2,3
{1,2,3} .{4 } 𝟔
{5 }𝑏{5 } 1 1,2,3
{1,2,3} . {3 } {4 }𝑏 𝟓
{4 }
𝟒 {1,2} *{1,2}
{1,2} ∗{1,2} {3 }𝑎 {3 } 𝒏
𝟑
{1,2} ¿{1,2}
𝑎 𝑏
{1} {2 𝟐
{1}𝟏 } {2 }
Conversion from regular expression to
DFA
Initial state = of root = {1,2,3} Position followpos
----- A 5 6
State A 4 5
δ( (1,2,3),a) = followpos(1) U 3 4
followpos(3) 2 1,2,3
1 1,2,3
=(1,2,3) U (4) =
{1,2,3,4} ----- B
States a b
A={1,2,3} B A
δ( (1,2,3),b) = followpos(2) B={1,2,3,4}
=(1,2,3) ----- A
Conversion from regular expression to
DFA
State B
Position followpos
δ( (1,2,3,4),a) = followpos(1) U followpos(3) 5 6
=(1,2,3) U (4) = {1,2,3,4} ----- B
4 5
3 4
DFA
Example 2
Convert the following RE
to DFA (Direct Method):
ba(a+b)*ab
Practice Problems
Convert the following regular expressions to
deterministic finite automata:
1. (a|b)*
2. (a*|b*)*
3. ((ε|a)|b*)*
4. (a|b)*abb(a|b)*
Solutions to practice problems
(a*|b*)*
(a|b)*
Conversion from regular expression to
DFA
Construct DFA for following regular expression:
1. (c | d)*c
2. (a+b)*+(a.c)*