0% found this document useful (0 votes)

15 views135 pages

2-Introduction to Compilation and Lexical Analysis-19!07!2024

Uploaded by

Aashish Mahato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views135 pages

2-Introduction to Compilation and Lexical Analysis-19!07!2024

Uploaded by

Aashish Mahato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 135

BCSE307L – COMPILER

DESIGN
Objective
• To provide fundamental knowledge of various language
translators.
• To make students familiar with lexical analysis and
parsing techniques.
• To understand the various actions carried out in
semantic analysis.
• To make the students get familiar with how the
intermediate code is generated.
• To understand the principles of code optimization
techniques and code generation.
• To provide foundation for study of high-performance
compiler design.
Outcomes
• Apply the skills on devising, selecting, and using tools
and techniques towards compiler design
• Develop language specifications using context free
grammars (CFG).
• Apply the ideas, the techniques, and the knowledge
acquired for the purpose of developing software
systems.
• Constructing symbol tables and generating
intermediate code.
• Obtain insights on compiler optimization and code
generation
Syllabus
• Module: 1 Introduction to Compilation and Lexical Analysis 7 hours
• Introduction to LLVM - Structure and Phases of a Compiler-Design Issues-
Patterns Lexemes-Tokens-Attributes-Specification of Tokens-Extended
Regular Expression- Regular expression to Deterministic Finite Automata
(Direct method) - Lex - A Lexical Analyzer Generator
• Module: 2 Syntax Analysis 8 hours
• Role of Parser- Parse Tree - Elimination of Ambiguity – Top Down Parsing
– Recursive Descent Parsing - LL (1) Grammars – Shift Reduce Parsers-
Operator Precedence Parsing - LR Parsers, Construction of SLR Parser
Tables and Parsing- CLR Parsing- LALR Parsing
• Module: 3 Semantic Analysis 5 hours
• Syntax Directed Definition – Evaluation Order - Applications of Syntax
Directed Translation - Syntax Directed Translation Schemes -
Implementation of L-attributed Syntax Directed Definition
• Module: 4 Intermediate Code Generation 5 hours
• Variants of Syntax trees - Three Address Code- Types – Declarations -
Procedures - Assignment Statements - Translation of Expressions -
Control Flow - Back Patching- Switch Case Statements.
Cont..
• Module: 5 Code Optimization 6 hours
• Loop optimizations- Principal Sources of Optimization -Introduction to
Data Flow Analysis - Basic Blocks - Optimization of Basic Blocks -
Peephole Optimization- The DAG Representation of Basic Blocks -Loops in
Flow Graphs - Machine Independent Optimization Implementation of a
naïve code generator for a virtual Machine- Security checking of virtual
machine code
• Module: 6 Code Generation 5 hours
• Issues in the design of a code generator- Target Machine- Next-Use
Information – Register Allocation and Assignment- Runtime Organization-
Activation Records.
• Module: 7 Parallelism 7 hours
• Parallelization- Automatic Parallelization- Optimizations for Cache
Locality and Vectorization- Domain Specific Languages-Compilation-
Instruction Scheduling and Software Pipelining- Impact of Language
Design and Architecture Evolution on Compilers Static Single Assignment
• Module: 8 Contemporary Issues 2 hours
Text Books & References
• Text Book
• A. V. Aho, Monica S. Lam, Ravi Sethi and Jeffrey D. Ullman, Compilers:
Principles, techniques, & tools, 2007, Second Edition, Pearson Education,
Boston.

• Reference Books
• Watson, Des. A Practical Approach to Compiler Construction. Germany,
Springer International Publishing, 2017
Content - Module -1
• Introduction to Compilation And Lexical
Analysis
• Introduction to LLVM
• Structure and Phases of a Compiler
• Design Issues
• Patterns Lexemes
• Tokens-Attributes
• Specification of Tokens
• Extended Regular Expression
• Regular expression to Deterministic Finite Automata (Direct
method)
• Lex - A Lexical Analyzer Generator
Translator
• A translator is a program that takes one form of
program as input and converts it into another
form.
• Types of translators are:
1. Compiler Source Translator Target
Program Program
2. Interpreter
3. Assembler
Error
Messages
Compiler
• A compiler is a program that reads a program written
in source language and translates it into an equivalent
program in target language.

void main() 0000 1100 0010

{ 0100
int a=1,b=2,c; Compiler 0111 1000 0001
c=a+b; 1111 0101 1110
printf(“%d”,c); 1100 0000 1000
} 1011

Source Error Target

Program Messages Program
Interpreter
• Interpreter is also program that reads a program
written in source language and translates it into an
equivalent program in target language line by line

void main() 0000 1100 0010

{ 0000
int a=1,b=2,c; Interpreter 1111 1100 0010
c=a+b; 1010 1100 0010
printf(“%d”,c); 0011 1100 0010
} 1111

Source Error Target

Program Messages Program
Assembler
• Assembler is a translator which takes the assembly
code as an input and generates the machine code as an
output.
MOV id3, R1 0000 1100 0010
MUL #2.0, R1 0100
MOV id2, R2 0111 1000 0001
MUL R2, R1 Assembler 1111 0101 1110
MOV id1, R2 1100 0000 1000
ADD R2, R1 1011
MOV R1, id1 1100 0000 1000
Error
Assembly Code Messages Machine Code
Analysis Synthesis model of
compilation
• There are two parts of compilation.
1. Analysis Phase
2. Synthesis Phase

void main() Analysis Synthesis

{ Phase Phase 0000 1100
int a=1,b=2,c; 0111 1000
c=a+b; 0001
printf(“%d”,c); Intermediate 1111 0101
} Representation 1000
1011
Source Code Target Code
Analysis phase & Synthesis phase
Analysis Phase Synthesis Phase
• Analysis part breaks up the • The synthesis part
source program into
constituent pieces and creates
constructs the desired
an intermediate target program from the
representation of the source intermediate
program. representation.
• Analysis phase consists of • Synthesis phase consist of
three sub phases: the following sub phases:
1. Lexical analysis
1. Code optimization
2. Syntax analysis
3. Semantic analysis
2. Code generation
Phases of compiler
Compiler

Analysis phase Synthesis phase

Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Lexical analysis
• Lexical Analysis is also called linear
analysis or scanning. Position = initial + rate*60
• Lexical Analyzer divides the given source
statement into the tokens.
Lexical analysis
• Ex: Position = initial + rate * 60 would
be grouped into the following tokens: id1=id2+ id3 * 60
Position (identifier)
= (Assignment symbol) Reads the stream of char
initial (identifier) making up the source
program & group the char
+ (Plus symbol) into meaningful sequences
called lexeme.
rate (identifier)
* (Multiplication symbol) Lexical analyzer represents
the lexeme in the form of
60 (Number) tokens.
Phases of Compiler
Compiler

Analysis phase Synthesis phase

Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Syntax analysis
Position = initial + rate*60

• Syntax Analysis is also called Parsing or Lexical analysis

Hierarchical Analysis. id1 = id2 + id3 *
• It takes token produced by lexical analyzer as 60
Input & generates the parse tree. Syntax analysis

• The syntax analyzer checks each line of the

code and spots every tiny mistake. =

• If code is error free then syntax analyzer id1 +

generates the tree.
id2 *
id3 60
Phases of compiler
Compiler

Analysis phase Synthesis phase

Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Semantic analysis
• Semantic analyzer determines the =
meaning of a source string. id1 +
• It performs following operations: id2 * int to
1. matching of parenthesis in the real
expression. id3 60

2. Matching of if..else statement.

Semantic analysis
3. Performing arithmetic operation that
are type compatible. =
4. *Note:
Checking the
Consider id1, id2scope ofrealoperation.
and id3 are
+
id1

id2 *
id3 inttoreal

60
Phases of compiler
Compiler

Analysis phase Synthesis phase

Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Intermediate code generator
• Two important properties of =
intermediate code : id1 +
1. It should be easy to produce.
id2 *
2. Easy to translate into target
t3 id3 inttoreal
program. t2 t1
60
• Intermediate form can be represented
Intermediate code
using “three address code”.
• Three address code consist of a t1= int to real(60)
sequence of instruction, each of t2= id3 * t1
t3= t2 + id2
which has at most three operands. id1= t3
Phases of compiler
Compiler

Analysis phase Synthesis phase

Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Code optimization
• It improves the intermediate code.
• This is necessary to have a faster Intermediate code
execution of code or less
t1= int to real(60)
consumption of memory. t2= id3 * t1
t3= t2 + id2
id1= t3

Code optimization

t1= id3 * 60.0

id1 = id2 + t1
Phases of compiler
Compiler

Analysis phase Synthesis phase

Lexical analysis
Intermediate Code
code optimization
Syntax analysis generation
Code
Semantic analysis generation
Code generation
• The intermediate code instructions
are translated into sequence of Code optimization
machine instruction.
t1= id3 * 60.0
id1 = id2 + t1

Code generation

MOV id3, R2
MUL #60.0, R2
MOV id2, R1
ADD R2,R1
MOV R1, id1

Id3R2
Id2R1
Symbol table
• Symbol table are data structures that are used by compilers
to hold information about source-program constructs.
• It is used to store information about the occurrences of
various entities such as, objects, classes, variable names,
functions, etc.,
• It is used by both analysis phase and synthesis phase.
• Symbol table is used for the following purposes
• It is used to store the name of all the entities in a structured form at
one place
• It is used to verify if a variable has been declared
• It is used to determine the scope of a name
• It is used to implement type checking by verifying assignments and
expression in the source code are semantically correct.
Cont.,
• Symbol table can be a linear (Linked list) or hash table
• It maintain a entry for each name as,
• <symbol name, type, attribute>
Eg.
• static int age;
• <age, int, static>
Phases of compiler
Source
program
Analysis
Lexical analysis Phase

Syntax analysis

Semantic
analysis Error
Symbol
table detection
Intermediate
and recovery
code
Variable Type Address Code
Name optimization
Position Float 0001
Code Synthesis
Initial Float 0005 generation Phase
Rate Float 0009 Target
Program
Exercise 1
• Write output of all the phases of compiler for following
statements:
1. x = b-c*2
2. I=p*n*r/100
Grouping of Phases
Front end & back end (Grouping of
phases)
Front end
• Depends primarily on source language and largely independent of the target machine.
• It includes following phases:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
5. Creation of symbol table

Back end
 Depends on target machine and do not depends on source program.
 It includes following phases:
1. Code optimization
2. Code generation phase
3. Error handling and symbol table operation
Difference between compiler &
interpreter
Compiler Interpreter
Scans the entire program and translates it It translates program’s one statement at a
as a whole into machine code. time.
It generates intermediate code. It does not generate intermediate code.
An error is displayed after entire program An error is displayed for every instruction
is checked. interpreted if any.
Memory requirement is more. Memory requirement is less.
Example: C compiler Example: Basic, Python, Ruby
Context of Compiler
(Cousins of compiler)
Context of compiler (Cousins of compiler)
Skeletal Source Program
• In addition to compiler, many other
system programs are required to Preprocessor
generate absolute machine code. Source
Program
• These system programs are: Compiler

Target Assembly
• Preprocessor Program
• Assembler Assembler
• Linker Relocatable Object
• Loader Code
Libraries & Linker / Loader
Object Files

Absolute Machine
Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Preprocessor
 Some of the task performed by preprocessor: Preprocessor

1. Macro processing: Allows user to define macros. Ex: Source

#define PI 3.14159265358979323846 Program
2. File inclusion: A preprocessor may include the header Compiler
file into the program. Ex: #include<stdio.h>
Target Assembly
3. Rational preprocessor: It provides built in macro for Program
construct like while statement or if statement.
Assembler
4. Language extensions: Add capabilities to the language
by using built-in macros. Relocatable Object
Code
 Ex: the language equal is a database query
Libraries &
language embedded in C. Statement beginning Linker / Loader
Object Files
with ## are taken by preprocessor to be database
access statement unrelated to C and translated
into procedure call on routines that perform the Absolute Machine
database access. Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Compiler
 A compiler is a program that reads a program Preprocessor

written in source language and translates it into an Source

Program
equivalent program in target language.
Compiler

Target Assembly
Program
Assembler

Relocatable Object
Code
Libraries & Linker / Loader
Object Files

Absolute Machine
Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Assembler
 Assembler is a translator which takes the assembly Preprocessor

program (mnemonic) as an input and generates Source

Program
the machine code as an output.
Compiler

Target Assembly
Program
Assembler

Relocatable Object
Code
Libraries & Linker / Loader
Object Files

Absolute Machine
Code
Context of compiler (Cousins of compiler)
Skeletal Source Program
Linker
 Linker makes a single program from a several files Preprocessor

of relocatable machine code. Source

 These files may have been the result of several Program
Compiler
different compilation, and one or more library files.
Target Assembly
Loader Program
Assembler
 The process of loading consists of:
 Taking relocatable machine code Relocatable Object
Code
 Altering the relocatable address Libraries & Linker / Loader
 Placing the altered instructions and data in Object Files
memory at the proper location.
Absolute Machine
Code
Pass structure
Pass structure
• One complete scan of a source program is called pass.
• Pass includes reading an input file and writing to the
output file.
• In a single pass compiler analysis of source statement is
immediately followed by synthesis of equivalent target
statement.
• While in a two pass compiler intermediate code is generated
between analysis and synthesis phase.
• It is difficult to compile the source program into single pass
due to: forward reference
Pass structure
Forward reference: A forward reference of a program entity is
a reference to the entity which precedes its definition in
the program.
• This problem can be solved by postponing the generation of
target code until more information concerning the entity
becomes
Pass I: available.
• It leads to multi pass model of compilation.

• Perform analysis of the source program and note relevant

Pass II:
information.
Types of compiler
Types of compiler
1. One pass compiler
• It is a type of compiler that compiles whole process in one-pass.
2. Two pass compiler
• It is a type of compiler that compiles whole process in two-pass.
• It generates intermediate code.
3. Incremental compiler
• The compiler which compiles only the changed line from the source code and
update the object code.
4. Native code compiler
• The compiler used to compile a source code for a same type of platform only.
5. Cross compiler
• The compiler used to compile a source code for a different kinds
platform.
Token, Pattern &
Lexemes
Interaction of scanner & parser
Toke
Source Lexical n
Parser
Progra Analyzer
m Get next
token

Symbol Table

• Upon receiving a “Get next token” command from parser, the

lexical analyzer reads the input character until it can identify the
next token.
• Lexical analyzer also stripping out comments and white space in
the form of blanks, tabs, and newline characters from the source
program.
Why to separate lexical analysis &
parsing?
1. Simplicity in design.

2. Improves compiler efficiency.

3. Enhance compiler portability.

Token, Pattern & Lexemes
Token Pattern
The set of rules called pattern associated
Sequence of character
with a token.
having a collective meaning
Example: “non-empty sequence of digits”,
is known as token. “letter followed by letters and digits”
Categories of Tokens:
1.Identifier Lexemes

2.Keyword The sequence of character in a source

program matched with a pattern for a
3.Operator token is called lexeme.
4.Special symbol Example: Rate, DIET, count, Flag
5.Constant
Example: Token, Pattern & Lexemes
C code:
printf("Total = %d\n", score);
 printf and score are lexemes matching the pattern for
token id,
 "Total = %d\n" is a lexeme matching literal
In many programming languages, the following classes cover most
or all of
the tokens:
Example: Token, Pattern & Lexemes
Example: total = sum + 45
Tokens:
total Identifier1

= Operator1
Tokens
sum Identifier2

+ Operator2

45 Constant1

Lexemes
Lexemes of identifier: total, sum
Lexemes of operator: =, +
Lexemes of constant: 45
Attributes of Tokens
The
When morenames
token than one lexeme
and can match
associated a pattern,
attribute thefor
values lexical
the analyzer
Fortran
must provide the subsequent compiler phases additional information
statement
Eabout
= M *the
C particular
** 2 lexeme that matched.
are
Forwritten
example,
belowtheaspattern for token
a sequence number matches both 0 and 1
of pairs.
<id,
Thus, in many
pointer cases the lexical
to symbol-table analyzer
entry for E> returns to the parser not
only a token
<assign op> name, but an attribute value that describes the lexeme
represented
<id, pointer toby the token; entry for M>
symbol-table
<mult op> information about an identifier e.g., its lexeme, its type,
Normally,
andpointer
<id, the location at which
to symbol-table it isfor
entry first
C>found is kept in the symbol
table.
<exp op>Thus, the appropriate attribute value for an identifier is a
<float> <id, limitedSquaare> <(> <id, x> <)> <{>
<float> <id, x>
<return> <(> <id, x> <op,"<="> <num, -10.0> <op, "||"> <id, x> <op, ">="> <num, 10.0>
<)> <op, "?"> <num, 100> <op, ":"> <id, x> <op, "*"> <id, x> <}>
Input buffering
• There are mainly two techniques for input buffering:
1. Buffer pairs
2. Sentinels
Buffer Pair

• The lexical analysis scans the input string from left to right one
character at a time.
• Buffer divided into two N-character halves, where N is the
number of character on one disk block.
: : :E: :=: : : C: * : * : 2 : eof :
Mi : * : : : :
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :

forward forward
lexeme_beginnig
• Pointer Lexeme Begin, marks the beginning of the current
lexeme.
• Pointer Forward, scans ahead until a pattern match is found.
• Once the next lexeme is determined, forward is set to
character at its right end.
• Lexeme Begin is set to the character immediately after the
lexeme just found.
• If forward pointer is at the end of first buffer half then second
is filled with N input character.
• If forward pointer is at the end of second buffer half then first
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :

forward forward forward

Code to advance forward pointer
lexeme_beginnig

if forward at end of first half then begin

reload second half;
forward := forward + 1;
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half;
end
else forward := forward + 1;
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof

forward
lexeme_beginnig
• In buffer pairs we must check, each time we move the forward
pointer that we have not moved off one of the buffers.
• Thus, for each character read, we make two tests.
• We can combine the buffer-end test with the test for the
current character.
• We can reduce the two tests to one if we extend each buffer to
hold a sentinel character at the end.
• The sentinel is a special character that cannot be part of the
source program, and a natural choice is the character EOF.
Specification of tokens
Strings and languages
Term Definition
Prefix of s A string obtained by removing zero or more
trailing symbol of string S.
e.g., ban is prefix of banana.
Suffix of S A string obtained by removing zero or more
leading symbol of string S.
e.g., nana is suffix of banana.
Sub string of S A string obtained by removing prefix and suffix
from S.
Proper prefix, suffix The
e.g.,proper
nan isprefixes,
substringsuffixes, and substrings of a string s are
of banana
and substring of S those, prefixes, suffixes, and substrings, respectively, of s that are
not ε or not equal to s itself

Subsequence of A string obtained by removing zero or more not

S necessarily contiguous symbol from S.
e.g., baaa is subsequence of banana.
Exercise
• Write prefix, suffix, substring, proper prefix, proper suffix and
subsequence of following string:
String: Compiler
Operations on languages
Operation Definition
Union of L and M
Written L U M
Concatenation of
L and M
Written LM
Kleene closure
of L
Written
Positive L∗of L
closure
Written L+
Example:
Let L be the set of letters {A,B,…..Z,a,b….z} and let D be the set of digits
{0,1,…9}.
1. L U D is the set of letters and digits | language with 62 strings of length one, each
of which strings is either one letter or one digit.

2. LD is the set of 520 strings of length two, each consisting of one letter followed
by one digit.

3. L4 is the set of all 4-letter strings

4. L* is the set of all strings of letters, including ε, the empty string

is the set of all strings of letters and digits beginning with a letter.
5. L(LUD)*

is the set of all strings of one or more digits.

6. D+
Regular Expression &
Regular Definition
Regular Expression for representing the C language
Identifiers :

letters_ (letter_ | digit ) *

| - union
* - ‘ Zero or more occurrence of ’
Regular expression
• A regular expression is a sequence of characters that define
a pattern.
Notational shorthand's
1. One or more instances: +
2. Zero or more instances: *
3. Zero or one instances: ?
4. Alphabets: Σ
Rules to define regular expression
1. is a regular expression that denotes , the set containing empty
string.
2. If is a symbol in then is a regular expression,
3. Suppose and are regular expression denoting the languages
and . Then,
a. is a regular expression denoting
b. is a regular expression denoting
c. * is a regular expression denoting
d. is a regular expression denoting

The language denoted by regular expression is said to be a

regular set.
Regular expression
a*
• L = Zero or More Occurrences of a =

*
𝜖
a
aa Infinite
aaa
aaa …..
a
aaaaa
…..
Regular expression

+
a +
• L = One or More Occurrences of a =

a
aa
aaa Infinite …..
aaaa
aaaaa…..
Precedence and associativity of operators
Operator Precedence Associative
Kleene * 1 left
Concatenation 2 left
Union | 3 left

Under these conventions, for example, we may replace the regular expression

(a)|((b)*(c)) by a|b*c. Both expressions denote the set of strings that are

either

a single a or are zero or more b's followed by one c.

Regular expression examples
1. 0 or 1
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 :𝟎 ,𝟏𝐑 . 𝐄 .=𝟎∨𝟏
2. 0 or 11 or 111
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 :𝟎 ,𝟏𝟏, 𝟏𝟏𝟏 𝐑 . 𝐄 .=𝟎|𝟏𝟏|𝟏𝟏𝟏
3. String having zero or more a.
𝐑 . 𝐄 .= 𝐚 ∗
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 : 𝛜 , 𝐚 , 𝐚𝐚 , 𝐚𝐚𝐚 , 𝐚𝐚𝐚𝐚 …..
4. String having one or more a.
𝐑 . 𝐄 .= 𝐚 +¿
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 : 𝐚 , 𝐚𝐚 , 𝐚𝐚𝐚 , 𝐚𝐚𝐚𝐚 …..
5. Regular expression over that represent all string of length 3.
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 : 𝐚𝐛𝐜 , 𝐛𝐜𝐚 , 𝐛𝐛𝐛 ,𝐜𝐚𝐛 ,𝐚𝐛𝐚 …. 𝐑 . 𝐄 .= ( 𝐚|𝐛|𝐜 )( 𝐚|𝐛|𝐜 ) (𝐚|𝐛|𝐜)
6. All binary string
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 :𝟎,𝟏𝟏,𝟏𝟎𝟏,𝟏𝟎𝟏𝟎𝟏,𝟏𝟏𝟏𝟏… +
Regular expression examples
7. 0 or more occurrence of either a or b or both
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝝐,𝒂,𝒂𝒂,𝒂𝒃𝒂𝒃,𝒃𝒂𝒃… 𝑹. 𝑬 .=(𝒂∨𝒃)∗
8. 1 or more occurrence of either a or b or both
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝒂,𝒂𝒂,𝒂𝒃𝒂𝒃,𝒃𝒂𝒃,𝒃𝒃𝒃𝒂𝒂𝒂… +

9. Binary no. ends with 0

𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎,𝟏𝟎,𝟏𝟎𝟎,𝟏𝟎𝟏𝟎,𝟏𝟏𝟏𝟏𝟎… *

10.Binary no. ends with 1

𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟏,𝟏𝟎𝟏,𝟏𝟎𝟎𝟏,𝟏𝟎𝟏𝟎𝟏,… 𝑹. 𝑬 .=(𝟎∨𝟏)∗𝟏
11.Binary no. starts and ends with 1
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟏𝟏,𝟏𝟎𝟏,𝟏𝟎𝟎𝟏,𝟏𝟎𝟏𝟎𝟏,…
12.String starts and ends with same character
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟎,𝟏𝟎𝟏,𝒂𝒃𝒂,𝒃𝒂𝒂𝒃…
Regular expression examples
13.All string of a and b starting with a
… *

14.String of 0 and 1 ends with 00

… 𝑹. 𝑬 .=(𝟎∨𝟏)∗𝟎𝟎
15.String ends with abb
… 𝑹. 𝑬 .=(𝒂∨𝒃)∗𝒂𝒃𝒃
16.String starts with 1 and ends with 0
… 𝑹. 𝑬 .=𝟏(𝟎∨𝟏)∗𝟎
17.All binary string with at least 3 characters and 3rd character should be zero
…
𝑹.𝑬.=( 𝟎|𝟏 )( 𝟎|𝟏) 𝟎(𝟎∨𝟏)∗
18.Language which consist of exactly two b’s over the set
…
𝑹. 𝑬 .=𝒂∗𝒃 𝒂∗𝒃𝒂∗
Regular expression examples
24.The language with where should be multiple of 3
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝒂𝒂𝒂,𝒃𝒂𝒂𝒂,𝒃𝒂𝒄𝒂𝒃𝒂,𝒂𝒂𝒂𝒂𝒂𝒂.. ∗ ∗ ∗
𝑹.𝑬.=( ( 𝒃∨𝒄 ) 𝒂 ( 𝒃∨𝒄 ) 𝒂 ( 𝒃∨𝒄 ) 𝒂 ( 𝒃∨𝒄 ) )
∗∗

25.Even no. of 0
∗ ∗ ∗ ∗
…. 𝑹 . 𝑬 .=(𝟏 𝟎 𝟏 𝟎 𝟏 )
26.String should have odd length
∗
…. 𝑹. 𝑬 .=( 𝟎∨𝟏 ) (( 𝟎|𝟏 ) (𝟎∨𝟏))
27.String should have even length
∗
…. 𝑹 . 𝑬 .=( ( 𝟎|𝟏 ) ( 𝟎∨𝟏))
28.String start with 0 and has odd length
∗
…. 𝑹. 𝑬 .=( 𝟎 ) ( ( 𝟎|𝟏 ) (𝟎∨𝟏))
30.String start with 1 and has even length
∗
…. 𝑹. 𝑬 .=𝟏(𝟎∨𝟏)(( 𝟎|𝟏 ) (𝟎∨𝟏))
31.All string begins or ends with 00 or 11
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟎𝟏𝟎𝟏,𝟏𝟎𝟏𝟎𝟎,𝟏𝟏𝟎,𝟎𝟏𝟎𝟏𝟏… 𝑹.𝑬.=(𝟎𝟎∨𝟏𝟏)(𝟎∨𝟏)∗∨( 𝟎|𝟏 ) ∗(𝟎𝟎∨𝟏𝟏)
Regular expression examples
31.Language of all string containing both 11 and 00 as
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟎𝟏𝟏,𝟏𝟏𝟎𝟎,𝟏𝟎𝟎𝟏𝟏𝟎,𝟎𝟏𝟎𝟎𝟏𝟏…
substring

𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟏𝟏,𝟏𝟏𝟎𝟏,𝟏𝟎𝟏𝟏….
32.String ending with 𝑹1. and
𝑬 .=not( 𝟏|𝟎𝟏 )00
contain +¿
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝒂𝒓𝒆𝒂,𝒊,𝒓𝒆𝒅𝒊𝒐𝒖𝒔,𝒈𝒓𝒂𝒅𝒆𝟏….
33.Language
∗
𝑹. 𝑬 .=(¿+𝑳)(¿+𝑳+𝑫)
of C identifier
𝒘𝒉𝒆𝒓𝒆 𝑳𝒊𝒔𝑳𝒆𝒕𝒕𝒆𝒓 ∧𝐃𝐢𝐬𝐝𝐢𝐠𝐢𝐭
Algebraic laws for regular expressions
Regular definition
• A regular definition gives names to certain regular expressions
and uses those names in other regular expressions.
• Regular definition is a sequence of definitions of the form:

……

Where is a distinct name & is a regular expression.

 Example: Regular definition for identifier
letter  A|B|C|………..|Z|a|b|………..|z
digit  0|1|…….|9|
id letter (letter | digit)*
Regular definition example
• Example: Unsigned Pascal numbers
3
5280
39.37
6.336E4
1.894E-4
2.56E+7
Regular Definition
digit  0|1|…..|9

optional_fraction  .digits | 𝜖
digits  digit digit*

optional_exponent  (E(+|-|𝜖)digits)|𝜖
num  digits optional_fraction optional_exponent
Example
C identifiers are strings of letters, digits, and underscores. Here is
a regular definition for the language of C identifiers. We shall
conventionally use italics for the symbols denied in regular
definitions.

letter_ -> A|B|….|Z|a|b|…|z|-

digit -> 0|1|…|9
id -> letter_(letter_ | digit ) *
Transition Diagram
Transition Diagram
• A stylized flowchart is called transition diagram.

is a state

is a transition

is a start state

is a final state
Transition Diagram : Relational operator

<
0 1
=
2 return (relop,LE)

>
3 return (relop,NE)
=
other
5
4 return (relop,LT)
return (relop,EQ)
>

6 =
7 return (relop,GE)

other
8 return (relop,GT)
Recognition of Reserved Words and
Identifiers
To search for identifier lexemes, this diagram will also recognize the keywords
if, then, and else of our running example

There are two ways that we can handle reserved words that look like identifiers:

1. Install the reserved words in the symbol table initially. When we find an identifier, a call to
installID places it in the symbol table if it is not already there and returns a pointer to the
symbol-table entry for the lexeme found. The function getToken examines the symbol table
entry for the lexeme found, and returns whatever token name the symbol table says this lexeme
represents either id or one of the keyword tokens.

2. Create separate transition diagrams for each keyword; Note that such a transition diagram
consists of states representing the situation after each successive letter of the keyword is seen,
followed by a test for a \nonletter-or-digit,“ i.e., any character that cannot be the continuation
Transition diagram : Unsigned number

digit digit digit

start digit . digit +or - digit other

1 2 3 4 5 6 7
E
8

E digit
3
5280
9 10
39.37
1.894 E - 4
2.56 E + 7
45 E + 6
96 E 2
Hard coding and automatic generation lexical
analyzers
• Lexical analysis is about identifying the pattern from the input.
• To recognize the pattern, transition diagram is constructed.
• It is known as hard coding lexical analyzer.
• Example: to represent identifier in ‘C’, the first character must
be letter and other characters are either letter or digits.
• To recognize this pattern, hard coding lexical analyzer will
work with a transition diagram.
• The automatic generation lexical analyzer takes special
notation as input.
• For example, lex compiler tool Letter
will ortake regular expression as
input and finds out the pattern digit matching to that regular
expression. Start Letter
1 2 3
Finite Automata
• Finite Automata are recognizers.
• FA simply say “Yes” or “No” about each possible input string.
• Finite Automata is a mathematical model consist of:
1. Set of states
2. Set of input symbol
3. A transition function move
4. Initial state
5. Final states or accepting states
Types of finite automata
• Types of finite automata are:
DFA
b
 Deterministic finite automata (DFA): have
for each state exactly one edge leaving out a b b
1 2 3 4
for each symbol.
a
a
b a
NFA DFA
 Nondeterministic finite automata (NFA): a
There are no restrictions on the edges
leaving a state. There can be several with a b b
1 2 3 4
the same symbol as label and some edges
can be labeled with .
b NFA
Regular expression to NFA using Thompson's
rule
1. For , construct the NFA 3. For regular expression

𝑖N(s) 𝑓
star � start
N(t)
t
𝑖 �
𝑓
Ex: ab
2. For in , construct the NFA

𝑖 𝑓
start a a b
1 2 3
Regular expression to NFA using
Thompson's rule
4. For regular expression 5. For regular expression *
𝜖
𝜖
𝜖
𝜖 𝜖
N(s)

𝑖
start
start
𝑖 𝑓 N(s)
𝑓
𝜖 N(t) 𝜖 𝜖

𝜖
Ex: a*
Ex: (a|b) a

𝜖 𝜖 𝜖 𝜖
2 3
𝑎

1 6
1 2 3 4
𝜖 𝜖 𝜖
4 5
b
Regular expression to NFA using Thompson's
rule
• a*b

𝜖 𝑎 𝜖 𝑏
1 2 3 4 5
𝜖

𝜖
• b*ab
𝜖 𝑏 𝜖 𝑎 𝑏
1 2 3 4 5 6
𝜖
Exercise
Convert following regular expression to NFA:
1. abba
2. bb(a)*
3. (a|b)*
4. a* | b*
5. a(a)*ab
6. aa*+ bb*
7. (a+b)*abb
8. 10(0+1)*1
9. (a+b)*a(a+b)
10.(0+1)*010(0+1)*
11.(010+00)*(10)*
12. 100(1)*00(0+1)*
Conversion from NFA
to DFA using subset
construction method
Subset construction algorithm
Input: An NFA .
Output: A DFA D accepting the same language.
Method: Algorithm construct a transition table for D. We use
the following operation:
OPERATION DESCRIPTION
Set of NFA states reachable from NFA
state on – transition alone.
Set of NFA states reachable from some
NFA state in on – transition alone.
Set of NFA states to which there is a
transition on input symbol from some
NFA state in .
Subset construction algorithm
initially be the only state in and it is unmarked;
while there is unmarked states T in do begin
mark ;
for each input symbol do begin

if is not in then
add as unmarked state to

end
end
Conversion from NFA to DFA
ab
𝜖
*
(a|b)
b
a
𝜖 𝜖
2 3

𝜖 𝜖 a b b
0 1 6 7 8 9 10

𝜖 𝜖
4 5
b

𝜖
Conversion from NFA to DFA
𝜖

a
𝜖 𝜖
2 3

𝜖 𝜖 a b b
0 1 6 7 8 9 10

𝜖 𝜖
4 5
b

𝜖- Closure(0)=
{0, 1, 7, 2, 4}
= {0,1,2,4,7} ---- A
Conversion from NFA to DFA
𝜖

a States a b
𝜖 𝜖
2 3

𝜖 𝜖
A = {0,1,2,4,7} B
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}

𝜖 𝜖
4 5
b

𝜖
A= {0, 1, 2,
4, 7}
)𝜖-
Move(A,a
= {3,8}
Closure(Move(A,a = {3, 6, 7, 1, 2,
)) 4,
= 8} ----
{1,2,3,4,6,7,8} B
Conversion from NFA to DFA
𝜖

a States a b
𝜖 𝜖
2 3

𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}

𝜖 𝜖
C = {1,2,4,5,6,7}
4 5
b

𝜖
A= {0, 1, 2,

𝜖-
4, 7}
Move(A,b)
{5}
= {5, 6, 7, 1, 2,
Closure(Move(A,b))
4}
= ----
= {1,2,4,5,6,7}
C
Conversion from NFA to DFA
𝜖

a States a b
𝜖 𝜖
2 3

𝜖 𝜖
A = {0,1,2,4,7} B C
a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B

𝜖 𝜖
C = {1,2,4,5,6,7}
4 5
b

𝜖
B = {1, 2, 3, 4, 6,
7, 8}
)𝜖-
Move(B,a
= {3,8}
Closure(Move(B,a = {3, 6, 7, 1, 2,
)) 4,
= 8} ----
{1,2,3,4,6,7,8} B
Conversion from NFA to DFA
𝜖

a States a b
𝜖 𝜖
2 3