0% found this document useful (0 votes)
117 views

Unit 1: Compiler Design

The document discusses compiler design and concepts related to compilers. It defines a compiler as a translator that converts a high-level language into machine language. It describes the main phases of compilation as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. It also discusses single-pass and multi-pass compilers, bootstrapping, finite state machines, regular expressions, and optimization of deterministic finite automata.

Uploaded by

lakshay oberoi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

Unit 1: Compiler Design

The document discusses compiler design and concepts related to compilers. It defines a compiler as a translator that converts a high-level language into machine language. It describes the main phases of compilation as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. It also discusses single-pass and multi-pass compilers, bootstrapping, finite state machines, regular expressions, and optimization of deterministic finite automata.

Uploaded by

lakshay oberoi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Unit 1

Compiler Design
Introduction to Compiler
• A compiler is a translator that converts the high-level
language into the machine language.
• High-level language is written by a developer and
machine language can be understood by the processor.
• Compiler is used to show errors to the programmer.
• The main purpose of compiler is to change the code
written in one language without changing the meaning
of the program.
• When you execute a program which is written in HLL
programming language then it executes into two parts.
• In the first part, the source program compiled and
translated into the object program (low level language).
• In the second part, object program translated into the
target program through the assembler.
Fig 1 : Execution process of source program in Compiler
Compiler Phases

• The compilation process contains the sequence of


various phases. Each phase takes source program in
one representation and produces output in another
representation. Each phase takes input from its
previous stage.
• There are the various phases of compiler:
Fig 2 : phases of compiler
• Lexical Analysis:
Lexical analyzer phase is the first phase of compilation
process. It takes source code as input. It reads the
source program one character at a time and converts it
into meaningful lexemes. Lexical analyzer represents
these lexemes in the form of tokens.
• Syntax Analysis
Syntax analysis is the second phase of compilation
process. It takes tokens as input and generates a parse
tree as output. In syntax analysis phase, the parser
checks that the expression made by the tokens is
syntactically correct or not.
• Semantic Analysis
Semantic analysis is the third phase of compilation process. It
checks whether the parse tree follows the rules of language.
Semantic analyzer keeps track of identifiers, their types and
expressions. The output of semantic analysis phase is the
annotated tree syntax.
• Intermediate Code Generation
In the intermediate code generation, compiler generates the
source code into the intermediate code. Intermediate code is
generated between the high-level language and the machine
language. The intermediate code should be generated in such a
way that you can easily translate it into the target machine
code.
• Code Optimization
Code optimization is an optional phase. It is used to improve
the intermediate code so that the output of the program
could run faster and take less space. It removes the
unnecessary lines of the code and arranges the sequence of
statements in order to speed up the program execution.
• Code Generation
Code generation is the final stage of the compilation process.
It takes the optimized intermediate code as input and maps
it to the target machine language. Code generator translates
the intermediate code into the machine code of the specified
computer.
Example :
Compiler Passes

A pass refers to the number of times the compiler goes


through the source code. There are single-pass compilers and 
multi-pass compilers. Single-pass compiler goes through the
program only once. In other words, the single pass compiler
allows the source code to pass through each compilation unit
only once. It immediately translates each code section into its
final machine code.
Multi-pass compiler goes through the source code several
times. In other words, it allows the source code to pass
through each compilation unit several times. Each pass takes
the result of the previous pass as input and creates
intermediate outputs. Therefore, the code improves in each
pass. The final code is generated after the final pass. Multi-
pass compilers perform additional tasks such as intermediate
code generation, machine dependent code optimization, and
machine independent code optimization.
Difference Between Phases and Passes of Compiler
Definition
Phases refer to units or steps in the compilation process.
Passes, in contrast, refer to the total number of times the
compiler goes through the source code before converting it
into the target machine code. Thus, this is the main difference
between phases and passes of compiler.
There are six main phases in the compilation process while
there are two types of compilers as single pass and multi-pass
compilers. Hence, this is another difference between phases
and passes of compiler.
Conclusion
A compiler is a special software that supports this conversion.
The main difference between phases and passes of compiler
is that phases are the steps in the compilation process while
passes are the number of times the compiler traverses
through the source code.
Bootstrapping

• Bootstrapping is widely used in the compilation


development.
• Bootstrapping is used to produce a self-hosting
compiler. Self-hosting compiler is a type of compiler
that can compile its own source code.
• Bootstrap compiler is used to compile the compiler and
then you can use this compiled compiler to compile
everything else as well as future versions of itself.
A compiler can be characterized by three languages:
1.Source Language
2.Target Language
3.Implementation Language
The T- diagram shows a compiler SCIT for Source S,
Target T, implemented in I.
Follow some steps to produce a new language L for
machine A:
1. Create a compiler SCAA for subset, S of the desired
language, L using language "A" and that compiler
runs on machine A.

2. Create a compiler LCSA for language L written in a


subset of L.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a
compiler for language L, which runs on machine A and produces
code for machine A.

The process described by the T-diagrams is called bootstrapping.


Finite state machine

• Finite state machine is used to recognize patterns.


• Finite automata machine takes the string of symbol as
input and changes its state accordingly. In the input,
when a desired symbol is found then the transition
occurs.
• While transition, the automata can either move to the
next state or stay in the same state.
• FA has two states: accept state or reject state. When
the input string is successfully processed and the
automata reached its final state then it will accept
otherwise it is in reject state.
A finite automata consists of following:

Q: finite set of states


∑: finite set of input symbol
q0: initial state
F: final state
δ: Transition function
Transition function can be defined as
δ: Q x ∑ →Q
FA is characterized into two ways:

1.DFA (finite automata)


2.NDFA (non deterministic finite automata)
DFA

DFA stands for Deterministic Finite Automata.


Deterministic refers to the uniqueness of the
computation. In DFA, the input character goes to one
state only. DFA doesn't accept the null move that
means the DFA cannot change state without any input
character.
DFA has five tuples {Q, ∑, q0, F, δ}
Q: set of all states
∑: finite set of input symbol where δ: Q x ∑ →Q
q0: initial state
F: final state
δ: Transition function
Example
See an example of deterministic finite automata:
Q = {q0,q1,q2}
∑ = {0, 1}
q0 = {q0}
F = {q2}
NDFA
• NDFA refer to the Non Deterministic Finite Automata. It is used to transit
the any number of states for a particular input. NDFA accepts the NULL
move that means it can change state without reading the symbols.

• NDFA also has five states same as DFA. But NDFA has different transition
function.

• Transition function of NDFA can be defined as:


δ: Q x ∑ →2Q
Example :
See an example of non deterministic finite automata:
Q = {q0, q1, q2}  
∑ = {0, 1}  
q0 = {q0}  
F = {q2}  
Regular expression

• Regular expression is a sequence of pattern that defines


a string. It is used to denote regular languages.
• It is also used to match character combinations in
strings. String searching algorithm used this pattern to
find the operations on string.
• In regular expression, x* means zero or more
occurrence of x. It can generate {e, x, xx, xxx,
xxxx,.....}
• In regular expression, x+ means one or more
occurrence of x. It can generate {x, xx, xxx, xxxx,.....}
Operations on Regular Language

The various operations on regular language are:


• Union: If L and M are two regular languages then their
union L U M is also a union.
L U M = {s | s is in L or s is in M}  

• Intersection: If L and M are two regular languages


then their intersection is also an intersection.
L ⋂ M = {st | s is in L and t is in M}  
Kleene closure: If L is a regular language then its
kleene closure L* will also be a regular language.
L* = Zero or more occurrence of language L.  
Example
Write the regular expression for the language:
L = {abn w:n ≥ 3, w ∈ (a,b)+}
Solution:
The string of language L starts with "a" followed by at
least three b's. It contains at least one "a" or one "b" that
is string are like abbba, abbbbbba, abbbbbbbb,
abbbb.....a
So regular expression is:
r= ab3b* (a+b)+
Here + is a positive closure i.e. (a+b)+ = (a+b)* - ∈
Optimization of DFA

To optimize the DFA you have to follow the various


steps. These are as follows:

Step 1: Remove all the states that are unreachable from


the initial state via any set of the transition of DFA.

Step 2: Draw the transition table for all pair of states.


Step 3: Now split the transition table into two tables T1
and T2. T1 contains all final states and T2 contains non-
final states.

Step 4: Find the similar rows from T1 such that:


1.δ (q, a) = p  
2.δ (r, a) = p  
That means, find the two states which have same value
of a and b and remove one of them.
• Step 5: Repeat step 3 until there is no similar rows are
available in the transition table T1.
• Step 6: Repeat step 3 and step 4 for table T2 also.
• Step 7: Now combine the reduced T1 and T2 tables.
The combined transition table is the transition table of
minimized DFA.
Example
Solution:
• Step 1: In the given DFA, q2 and q4 are the
unreachable states so remove them.
• Step 2: Draw the transition table for rest of the states.
Step 3:
Now divide rows of transition table into two sets as:
1. One set contains those rows, which start from non-
final sates:
2. Other set contains those rows, which starts from final
states.

Step 4: Set 1 has no similar rows so set 1 will be the


same.
Step 5: In set 2, row 1 and row 2 are similar since q3
and q5 transit to same state on 0 and 1. So skip q5 and
then replace q5 by q3 in the rest.
Step 6: Now combine set 1 and set 2 as:

Now this is the transition table of minimized DFA.


LEX
• Lex is a program that generates lexical analyzer. It is
used with YACC (Yet Another Compiler Compiler)
parser generator.
• The lexical analyzer is a program that transforms an
input stream into a sequence of tokens.
• It reads the input stream and produces the source
code as output through implementing the lexical
analyzer in the C program.
The function of Lex is as follows:
Firstly lexical analyzer creates a program lex.1 in the Lex
language. Then Lex compiler runs the lex.1 program and
produces a C program lex.yy.c.

Finally C compiler runs the lex.yy.c program and


produces an object program a.out.

a.out is lexical analyzer that transforms an input stream


into a sequence of tokens.
Lex file format
A Lex program is separated into three sections by %%
delimiters. The formal of Lex source is as follows:
1.{ definitions }   
2.%%  
3. { rules }   
4.%%   
5.{ user subroutines }  
• Definitions include declarations of constant, variable
and regular definitions.
• Rules define the statement of form p1 {action1} p2
{action2}....pn {action}.
• Where pi describes the regular expression
and action1 describes the actions the lexical analyzer
should take when pattern pi matches a lexeme.
• User subroutines are auxiliary procedures needed by
the actions. The subroutine can be loaded with the
lexical analyzer and compiled separately.
Formal grammar

• Formal grammar is a set of rules. It is used to identify


correct or incorrect strings of tokens in a language. The
formal grammar is represented as G.
• Formal grammar is used to generate all possible strings
over the alphabet that is syntactically correct in the
language.
• Formal grammar is used mostly in the syntactic analysis
phase (parsing) particularly during the compilation.
Formal grammar G is written as follows:
G = <V, N, P, S>  
Where:
N describes a finite set of non-terminal symbols.
V describes a finite set of terminal symbols.
P describes a set of production rules
S is the start symbol.
Example:
L = {a, b}, N = {S, R, B}
Production rules:
S = bR  
R = aR  
R = aB   
B = b 
Through this production we can produce some strings like: bab, baab,
baaab etc.
This production describes the string of shape banab.

Fig : Formal grammar


BNF Notation

• BNF stands for Backus-Naur Form. It is used to write


a formal representation of a context-free grammar. It is
also used to describe the syntax of a programming
language.
• BNF notation is basically just a variant of a context-free
grammar.
In BNF, productions have the form:
Left side → definition  
Where leftside ∈ (Vn∪ Vt)+ and definition ∈ (Vn∪ Vt)*. In
BNF, the leftside contains one non-terminal.
We can define the several productions with the same
leftside. All the productions are separated by a vertical
bar symbol "|".
There is the production for any grammar as follows:
S → aSa  
S → bSb  
S → c  
In BNF, we can represent above grammar as follows:
S → aSa| bSb| c  
YACC

• YACC stands for Yet Another Compiler Compiler.


• YACC provides a tool to produce a parser for a given
grammar.
• YACC is a program designed to compile a LALR (1)
grammar here LALR means Look ahead left to right.
• It is used to produce the source code of the syntactic
analyzer of the language produced by LALR (1)
grammar.
• The input of YACC is the rule or grammar and the output
is a C program.
These are some points about YACC:
Input: A CFG- file.y
Output: A parser y.tab.c (yacc)
• The output file "file.output" contains the parsing tables.
• The file "file.tab.h" contains declarations.
• The parser called the yyparse ().
• Parser expects to use a function called yylex () to get
tokens.
The basic operational sequence is as follows:

This file contains the desired grammar in YACC format.

It shows the YACC program.


It is the c source program created by YACC.

C Compiler

Executable file that will parse grammar given in


gram.Y
Context free grammar
Context free grammar is a formal grammar which is used to generate all
possible strings in a given formal language.
Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)  
Where, G describes the grammar

T describes a finite set of terminal symbols.

V describes a finite set of non-terminal symbols

P describes a set of production rules

S is the start symbol.


In CFG, the start symbol is used to derive the string. You
can derive the string by repeatedly replacing a non-
terminal by the right hand side of the production, until all
non-terminal have been replaced by terminal symbols.
Example:
L= {wcwR | w € (a, b)*}
Production rules:
S → aSa  
S → bSb  
S → c  
Now check that abbcbba string can be derived from the given CFG.
S ⇒ aSa  
S ⇒ abSba  
S ⇒ abbSbba  
S ⇒ abbcbba  
By applying the production S → aSa, S → bSb recursively and
finally applying the production S → c, we get the string abbcbba.
Capabilities of CFG

There are the various capabilities of CFG:


• Context free grammar is useful to describe most of the
programming languages.
• If the grammar is properly designed then an efficient parser
can be constructed automatically.
• Using the features of associatively & precedence information,
suitable grammars for expressions can be constructed.
• Context free grammar is capable of describing nested
structures like: balanced parentheses, matching begin-end,
corresponding if-then-else's & so on.
Derivation
• Derivation is a sequence of production rules. It is used
to get the input string through these production rules.
During parsing we have to take two decisions. These
are as follows:
• We have to decide the non-terminal which is to be
replaced.
• We have to decide the production rule by which the
non-terminal will be replaced.
• We have two options to decide which non-terminal to
be replaced with production rule.
Left-most Derivation

In the left most derivation, the input is scanned and


replaced with the production rule from left to right. So in
left most derivatives we read the input string from left to
right.
Example:
Production rules:
S = S + S  
S = S - S  
S = a | b |c  
Input:
a-b+c
The left-most derivation is:
S = S + S  
S = S - S + S  
S = a - S + S  
S = a - b + S  
S = a - b + c  
Right-most Derivation

In the right most derivation, the input is scanned and


replaced with the production rule from right to left. So in
right most derivatives we read the input string from right
to left.
Example : S = S + S  
S = S - S  
S = a | b |c  
Input:
a–b+c
The right-most derivation is:
S = S - S  
S = S - S + S  
S = S - S + c  
S = S - b + c  
S = a - b + c  
Parse tree

• Parse tree is the graphical representation of symbol. The


symbol can be terminal or non-terminal.
• In parsing, the string is derived using the start symbol.
The root of the parse tree is that start symbol.
• Parse tree follows the precedence of operators. The
deepest sub-tree traversed first. So, the operator in the
parent node has less precedence over the operator in the
sub-tree.
The parse tree follows these points:

• All leaf nodes have to be terminals.


• All interior nodes have to be non-terminals.
• In-order traversal gives original input string.
Construct parse tree for E --> E + E I E * E I id
Construct parse tree for s --> SS* I SS+ I a
Ambiguity

A grammar is said to be ambiguous if there exists more


than one leftmost derivation or more than one
rightmost derivation or more than one parse tree for
the given input string. If the grammar is not ambiguous
then it is called unambiguous.
Example:
S = aSb | SS  
S = ∈  
For the string aabb, the above grammar generates two
parse trees:
If the grammar has ambiguity then it is not good for a
compiler construction. No method can automatically
detect and remove the ambiguity but you can remove
ambiguity by re-writing the whole grammar without
ambiguity.
Problem :

Check whether the grammar G with production


rules −
X → X+X | X*X |X| a
is ambiguous or not.
Solution :
Let’s find out the derivation tree for the string
"a+a*a". It has two leftmost derivations.
Derivation 1 − X → X+X → a +X → a+ X*X → a+a*X → a+a*a
Parse tree 1 −
Derivation 2 − X → X*X → X+X*X → a+ X*X → a+a*X → a+a*a
Parse tree 2 −

Since there are two parse trees for a single string


"a+a*a", the grammar G is ambiguous.

You might also like