CD Course Material1
CD Course Material1
COURSE PLANNING
DOCUMENT
Prepared by
i
COMPILER DESIGN (CS 115)
Pre-requisite
• Knowledge of automata theory
• Context free languages
• Computer architecture
• Data structures and simple graph algorithms
• Logic or algebra
Learning Resources
• Textbooks, Class Notes
Text Books
1. Alfred V.Aho, Ravi Sethi and Jeffry D. Ullman “Compiler Principles, Techniques and
Tools”16th Indian Reprint, Pearson Education Asia, ISBN No.81-7808-046-X., 2004.
2. D.M.Dhamdere ”Compiler Construction“, 2nd Edition ” Mac Mellon India Ltd”, ISBN
No.0333 -90406-0,1997
Reference Books
1. Donovan,”Systems programming”, Mc. Graw Hill.
2. Leland L. Beck, “System Software – An Introduction to Systems Programming”
Addison Wesley.
Additional Resources (links etc)
1. books.google.co.in Computers Programming General
2. www.amazon.com Books Computers and Technology
3. http://nptel.iitm.ac.in
Reading materials:
1. Online Video links
ii
How to Contact Instructor:
Technology Requirements:
• Learning management system (Google classroom, etc.)
Overview of Course:
• What is the course about: its purpose?
Compiler design principles provide an in-depth view of translation and optimization
process. Compiler design covers basic translation mechanism and error detection &
recovery. It includes lexical, syntax, and semantic analysis as front end, and code
generation and optimization as back-end
iii
• What are the general topics or focus?
1. Phases of compiler
2. Lexical Analysis
3. Parsing Techniques
4. Intermediate Code Generation
5. Code Optimization
6. Code Generation
• Why would students want to take this course and learn this material?
1. Helps the student to improve problem solving skill
2. Helps in learning further programming languages
3. Helps to develop compiler
4. As it a logical oriented, students will be able to improve logical thinking
Methods of instruction
• Lecture (chalk & talk / ICT)
Workload
• Estimated amount of time student needs to spend on course readings (per week): 2
hours per week
• Estimate amount of time to student needs to spend on Homework for practicing the
problems (per week) : 2 Hours per week
iv
Assessment
Assessment No of Weightage in Marks
S. No Assessments
Methodology assessments marks scaled to
1 Quizzes 5 5
2 Class test -- -- 5
3 Assignment -- --
CIE
4 Course Activity -- -- --
5 Attendance -- -- 5
6 Internal exams 2 20 20
7 SEE -- -- -- 70
Note:
• Class test/ Quiz – The marks allotted for quiz will be graded to assignment
• Since the assessment is through online the results will be displayed to the students
immediately.
// Absentees for class assessments
In case the student is absent then a structured
enquiry problem will be given as an assignment
Absentees for Quiz with a deadline, in case the assignment is not
submitted in time then he/she will given zero
marks
Key concepts:
1. Compiler
2. Assembler, Translator
3. Lexical Analysis
v
4. Syntax Analysis
5. Semantic Analysis
6. Intermediate Code Generator
7. Code Optimizer
8. Code Generator
LESSON PLAN
Course Outcomes
(COs) / Program 1 2 3 4 5 6 7 8 9 10 11 12 PSO1 PSO2
Outcomes (POs)
Illustrate the
different phases
of a compiler, and
3 3 2 2 3 2
implement
practical aspects of
automata theory
vi
Interpret storage
organization and
allocation
3 2 2 2 2 2
strategies for
dynamic storage
system
Analyze the
knowledge of
different phases in 3 2 2 2
designing a
compiler
Apply code
Generation and
2 3 3 2 2 2 2 3 2
optimization
techniques
vii
Course Syllabus
UNIT I
Introduction to Compiling: Compiler, Phases of a compiler, Analysis of the source
program, Cousins of the compiler, grouping of phases, Compiler writing tools.
Lexical Analysis: The role of the lexical analyzer, Specification of tokens. Recognition of
tokens, A Language for specifying lexical Analyzers, Finite automata, Optimization of DFA-
based pattern matchers.
UNIT II
Syntax Analysis: The role of a parser, Context-free grammars, writing a grammar, Parsing,
Ambiguous grammar, Elimination of Ambiguity, Classification of parsing techniques
Top down parsing: Back Tracking, Recursive Descent parsing, FIRST ( ) and FOLLOW ( )
- LL Grammars, Non-Recursive descent parsing, Error recovery in predictive parsing.
UNIT III
Bottom Up parsing: SR parsing, Operator Precedence Parsing, LR grammars, LR Parsers –
Model of an LR Parsers, SLR parsing, CLR parsing, LALR parsing, Error recovery in LR
Parsing, handling ambiguous grammars.
UNIT IV
Syntax Directed Translation: Syntax Directed Definition, S-attributed definitions, L-
attributed definitions, Attribute grammar, S-attributed grammar, L-attributed grammar.
Semantic Analysis: Type Checking, Type systems, Type expressions, Equivalence of type
expressions.
Intermediate Code Generation: Construction of syntax trees, Directed Acyclic Graph,
Three Address Codes.
UNIT V
Runtime Environments: Storage organization, Storage-allocation strategies, Symbol tables,
Activation records.
Code Optimization: The principal sources of optimization, Basic blocks and Flow graphs,
data-flow analysis of flow graphs.
Code Generation: Issues in the design of a code generator, the target machine code, Next-
use information, a simple code generator, Code-generation algorithm.
viii
TEXT BOOKS
1. Alfred V.Aho, Ravi Sethi and Jeffry D. Ullman “Compiler Principles, Techniques
and Tools”16th Indian Reprint, Pearson Education Asia, ISBN No.81-7808-046-
X.,2004.
2. D.M.Dhamdere ”Compiler Construction“, 2nd Edition ” Mac Mellon India Ltd”,
ISBN No.0333 -90406-0,1997
REFERENCE BOOKS
1. Donovan,”Systems programming”, Mc. Graw Hill.
2. Leland L. Beck, “System Software – An Introduction to Systems Programming”
Addison Wesley.
WEB LINKS
1. books.google.co.in Computers Programming General
2. www.amazon.com Books Computers and Technology
3. http://nptel.iitm.ac.in
LESSON PLAN
ix
13 Optimization of DFA-based pattern matchers Chalk & Talk
Quiz will be conducted for UNIT I through Google classroom / Google forms
UNIT-II
14 Syntax Analysis: The role of a parser Chalk & Talk
15 Context-free grammars Think-Pair-Share
Writing a grammar Chalk & Talk
16
Parsing Chalk & Talk
17 & 18 Ambiguous grammar, Elimination of Ambiguity Brain storming
19 Classification of parsing techniques Chalk & Talk
20 Top down parsing –Back Tracking Chalk & Talk
21 Recursive Descent parsing Chalk & Talk
22&23 FIRST( ) and FOLLOW( )- LL Grammars Role Play
24 Non-Recursive descent parsing Chalk & Talk
25 Error recovery in predictive parsing Chalk & Talk
Quiz will be conducted for UNIT II through Google classroom / Google forms
LL(k) problems solving using Think-Pair-Share activity
UNIT-III
26 Bottom Up parsing- SR parsing Chalk & Talk
27 Operator Precedence Parsing Chalk & Talk
28 LR grammars Chalk & Talk
29 LR Parsers – Model of an LR Parsers Chalk & Talk
30 & 31 SLR parsing Chalk & Talk
32 &33 CLR parsing Chalk & Talk
34 LALR parsing Chalk & Talk
35 Error recovery in LR Parsing Chalk & Talk
36 Handling ambiguous grammars Chalk & Talk
Quiz will be conducted for UNIT III through Google classroom / Google forms
LR Gramars problems solving using Think-Pair-Share activity
I Mid Term Examinations
UNIT-IV
37 Syntax Directed Translation Chalk & Talk
x
38 Syntax-directed definition Chalk & Talk
39 S-attributed definitions, L-attributed definitions Chalk & Talk
40 Attribute grammar Chalk & Talk
41 S-attributed grammar, L-attributed grammar Chalk & Talk
42 Semantic Analysis: Type Checking Chalk & Talk
Type systems, Type expressions, Equivalence of type
43 Chalk & Talk
expressions
44 Intermediate Code Generation Chalk & Talk
45 Construction of syntax trees Chalk & Talk
46 Directed acyclic graph Chalk & Talk
47 Three address codes Chalk & Talk
Quiz will be conducted for UNIT IV through Google classroom / Google forms
UNIT-V
48 Runtime Environments PPT
49 Storage organization PPT
50 Storage-allocation strategies PPT
51 Symbol tables PPT
52 Activation records PPT
53 & 54 Code Optimization: The principal sources of optimization PPT
55 Basic blocks and Flow graphs PPT
56 Data-flow analysis of flow graphs PPT
57 Code Generation: Issues in the design of a code generator PPT
58 The target machine code PPT
59 Next-use information, A simple code generator PPT
60 Code-generation algorithm PPT
Quiz will be conducted for UNIT V through Google classroom / Google forms
II Mid Term Examinations
xi
Compiler Design - Introduction to Compiling, Lexical Analysis
UNIT-1
Introduction to Compiling: Compiler, Phases of a compiler, Analysis of the source
program, Cousins of the compiler, grouping of phases, Compiler writing tools.
Lexical Analysis: The role of the lexical analyzer, Specification of tokens. Recognition of
tokens, A Language for specifying lexical Analyzers, Finite automata, Optimization of DFA-
based pattern matchers.
UNIT WISE PLAN
Blooms
S. No. Topic Learning Outcomes Cos
Levels
Source Program
Compiler
Target Program
supposed to perform. Despite this apparent complexity, the basic tasks that any compiler
must perform are essentially the same.
• The analysis part breaks up the source program into constant piece and creates an
intermediate representation of the source program.
• The synthesis part constructs the desired target program from the intermediate
representation.
In Compiling, analysis part consists of three phases:
1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
In Compiling, synthesis part consists of three phases:
1. Intermediate code generator
2. Code optimization
3. Code generator
Lexical analysis:
In a compiler linear analysis is called lexical analysis or scanning. The lexical analysis
phase reads the characters in the source program and grouped into them tokens that are
sequence of characters having a collective meaning.
Example:
Source program - position: = initial + rate * 60
Identifiers – position, initial, rate.
Operators - + , *
Assignment symbol - : =
Number - 60
Blanks – eliminated.
Syntax analysis:
Hierarchical Analysis is called parsing or syntax analysis. It involves grouping the
tokens of the source program into grammatical phrases that are used by the complier to
synthesize output. They are represented using a syntax tree as shown in Fig. 1.2.2
• A syntax tree is the tree generated as a result of syntax analysis in which the interior
nodes are the operators and the exterior nodes are the operands.
• This analysis shows an error when the syntax is incorrect.
Example:
Example:
temp1: = int to real (10)
temp2: = id3 * temp1
temp3: = id2 + temp2
id1: = temp3
Code Optimization
The code optimization phase attempts to improve the intermediate code, so that faster
running machine codes will result. Some optimizations are trivial. There is a great variation
in the amount of code optimization different compilers perform. In those that do the most,
called “optimizing compilers”, a significant fraction of the time of the compiler is spent on
this phase.
Example:
temp1=id3*60.0
id1=id2+temp1
Code Generation
The final phase of the compiler is the generation of target code, consisting normally
of relocatable machine code or assembly code. Memory locations are selected for each of the
variables used by the program. Then, intermediate instructions are each translated into a
sequence of machine instructions that perform the same task. A crucial aspect is the
assignment of variables to registers.
Example:
MOVF id3,r2
MULF #60.0,r2
MOVF id2,r2
ADDF r2,r1
MOVF r1,id1
Symbol table management
An essential function of a compiler is to record the identifiers used in the source
program and collect information about various attributes of each identifier. A symbol table is
a data structure containing a record for each identifier, with fields for the attributes of the
identifier. The data structure allows us to find the record for each identifier quickly and to
store or retrieve data from that record quickly. When an identifier in the source program is
detected by the lexical analyzer, the identifier is entered into the symbol table.
Let us first understand how a program, using C compiler, is executed on a host machine.
• User writes a program in C language (high-level language).
• The C compiler compiles the program and translates it to assembly program (low-
level language).
• An assembler then translates the assembly program into machine code (object).
• A linker tool is used to link all the parts of the program together for execution
(executable machine code).
• A loader loads all of them into memory and then the program is executed.
1. Preprocessor 2. Assembler 3. Loader and Link-editor
1.4.1. Preprocessor
A preprocessor is a program that processes its input data to produce output that is
used as input to another program. The output is said to be a preprocessed form of the input
data, which is often used by some subsequent programs like compilers.
They may perform the following functions:
1. Macro processing 3. Rational Preprocessors
2. File Inclusion 4. Language extension
1. Macro processing:
A macro is a rule or pattern that specifies how a certain input sequence should be
mapped to an output sequence according to a defined procedure. The mapping process that
instantiates a macro into a specific output sequence is known as macro expansion.
2. File Inclusion:
Preprocessor includes header files into the program text. When the preprocessor finds
an #include directive it replaces it by the entire content of the specified file.
3. Rational Preprocessors:
These processors change older languages with more modern flow-of-control and data-
structuring facilities.
4. Language extension:
These processors attempt to add capabilities to the language by what amounts to built-
in macros. For example, the language Equel is a database query language embedded in C.
1.4.2. Assembler
Assembler creates object code by translating assembly instruction mnemonics into
machine code. There are two types of assemblers:
• One-pass assemblers go through the source code once and assume that all symbols
will be defined before any instruction that references them.
• Two-pass assemblers create a table with all symbols and their values in the first pass,
and then use the table in a second pass to generate code
1.5. ASSEMBLER:
Programmers found it difficult to write or read programs in machine language. They
begin to use a mnemonic (symbols) for each machine instruction, which they would
subsequently translate into machine language. Such a mnemonic machine language is now
called an assembly language. Programs known as assembler were written to automate the
translation of assembly language in to machine language. The input to an assembler program
is called source program, the output is a machine language translation (object program).
INTERPRETER:
An interpreter is a program that appears to execute a source program as if it were machine
language.
Lexeme
Collection or group of characters forming tokens is called Lexeme. A lexeme is a
sequence of characters in the source program that is matched by the pattern for the token. For
example in the Pascal’s statement const pi = 3.1416; the substring pi is a lexeme for the token
identifier.
Patterns
A pattern is a rule describing a set of lexemes that can represent a particular token in
source program. The pattern for the token const in the above table is just the single string
const that spells out the keyword.
Token Lexeme Pattern
Const Const Const
If If If
Relation <. <=, +, < >,>=, > < or <= or < > or >= or letter
followed by letters & digit
Pi Pi Any numeric constant
Num 3.14 Any character b/w and
except
Literal “core” pattern
Fig.1.9.2 Example of Token, Lexeme and Pattern
Certain language conventions impact the difficulty of lexical analysis. Languages
such as FORTRAN require a certain constructs in fixed positions on the input line. Thus the
alignment of a lexeme may be important in determining the correctness of a source program.
Attributes of Token
The lexical analyzer returns to the parser a representation for the token it has found.
The representation is an integer code if the token is a simple construct such as a left
parenthesis, comma, or colon. The representation is a pair consisting of an integer code and a
pointer to a table if the token is a more complex element such as an identifier or constant.
The integer code gives the token type, the pointer points to the value of that token.
Pairs are also retuned whenever we wish to distinguish between instances of a token.
The attributes influence the translation of tokens.
i. Constant : value of the constant
LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognize a lexeme as a valid token for you lexer. Syntax errors, on the other
side, will be thrown by your scanner when a given set of already recognized valid tokens
don't match any of the right sides of your grammar rules. Simple panic-mode error handling
system requires that we return to a high-level parsing function when a parsing or lexical error
is detected.
Error Recovery Strategies in Lexical Analysis
The following are the error-recovery actions in lexical analysis:
1. Deleting an extraneous character
2. Inserting a missing character
3. Replacing an incorrect character by a correct character
4. Transforming two adjacent characters
5. Panic mode recovery: Deletion of successive characters from the token until error is
resolved
In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For
example, banana is a string of length six. The empty string, denoted ε, is the string of length
zero.
Operations on strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the
end of string s. For example, ban is a prefix of banana
2. A suffix of string s is any string obtained by removing zero or more symbols from the
beginning of s. For example, nana is a suffix of banana
3. A substring of s is obtained by deleting any prefix and any suffix from s. For example,
nan is a substring of banana
4. The proper prefixes, suffixes, and substrings of a string s are those prefixes,
suffixes, and substrings, respectively of s that are not ε or not equal to s itself
5. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s. For example, baan is a subsequence of banana
Operations on languages:
The following are the operations that can be applied to languages:
1. Union
2. Concatenation
3. Kleene closure
4. Positive closure
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}
1. Union : L U S={0,1,a,b,c}
2. Concatenation : L.S={0a,1a,0b,1b,0c,1c}
3. Kleene closure : L*={ε,0,1,00…..}
4. Positive closure : L+={0,1,00…..}
Regular Expressions
• Each regular expression r denotes a language L(r)
• Here are the rules that define the regular expressions over some alphabet Σ and the
languages that those expressions denote:
1. ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is
the empty string
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with ‘a’ in its one position
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then,
Regular set
A language that can be defined by a regular expression is called a regular set. If two regular
expressions r and s denote the same regular set, we say they are equivalent and write r = s.
There are a number of algebraic laws for regular expressions that can be used to manipulate
into equivalent forms.
For instance, r | s = s | r is commutative; r | (s | t) = (r | s) | t is associative.
Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Σ is an alphabet
of basic symbols, then a regular definition is a sequence of definitions of the form
dl → r 1
d2 → r2
………
dn → rn
1. Each di is a distinct name.
2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example: Identifiers is the set of strings of letters and digits beginning with a letter. Regular
definition for this set:
letter → A | B | …. | Z | a | b | …. | z |
digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *
Shorthands
Certain constructs occur so frequently in regular expressions that it is convenient to
introduce notational short hands for them.
3. Character Classes:
• The notation [abc] where a, b and c are alphabet symbols denotes the regular expression
a|b|c
• Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z
• We can describe identifiers as being strings generated by the regular expression, [A–
Za–z][A– Za–z0–9]*
Non-regular Set
A language which cannot be described by any regular expression is a non-regular set.
Example: The set of all strings of balanced parentheses and repeating strings cannot be
described by a regular expression. This set can be specified by a context-free grammar.
For this language fragment the lexical analyzer will recognize the keywords if, then, else, as
well as the lexemes denoted by relop, id, and num.
Consider the following grammar fragment:
stmt → if expr then stmt
| if expr then stmt else stmt | ε
Prepared by G. Sunil Reddy, Asst. Professor,
Department of CSE, SR Engineering College, Warangal 17
Compiler Design - Introduction to Compiling, Lexical Analysis
Transition diagrams
Transition Diagram has a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns. Edges are directed from one state of the transition diagram to
another. Each edge is labeled by a symbol or set of symbols. If we are in one state s, and the
next input symbol is a, we look for an edge out of state s labeled by a. if we find such an edge
,we advance the forward pointer and enter the state of the transition diagram to which that
edge leads.
The above TD for an identifier, defined to be a letter followed by any no of letters or digits. A
sequence of transition diagram can be converted into program to look for the tokens specified
by the diagrams. Each state gets a segment of code.
Recognizing Numbers
The diagram below is from the second edition. It is essentially a combination of the three
diagrams in the first edition.
• Lex
• YACC
Lex is a computer program that generates lexical analyzers. Lex is commonly used with the
yacc parser generator.
Creating a lexical analyzer
p2 {action2}
…
pn {actionn}
Where pi is regular expression and actioni describes what action the lexical analyzer
should take when pattern pi matches a lexeme. Actions are written in C code.
The next, translation rules, section gives the patterns of the lexemes that the lexer will
recognize and the actions to be performed upon recognition. Normally, these actions
include returning a token name to the parser and often returning other information
about the token via the shared variable yylval.
• User subroutines are auxiliary procedures needed by the actions. These can be
compiled separately and loaded with the lexical analyzer. If a return is not specified
the lexer continues executing and finds the next lexeme present.
YACC provides a general tool for describing the input to a computer program. The
YACC user specifies the structures of his input, together with code to be invoked as each
such structure is recognized.
YACC turns such a specification into a subroutine that handles the input process;
frequently, it is convenient and appropriate to have most of the flow of control in the user's
application handled by this subroutine.
Finite Automata is one of the mathematical models that consist of a number of states
and edges. It is a transition diagram that recognizes a regular expression or grammar.
There are two types of Finite Automata:
ii. There is at most one transition from each state on any input
iii. For each symbol a and state s, there is at most one labeled edge a leaving s. i.e.
transition function is from pair of state-symbol to state (not set of states)
DFA has five tuples denoted by
M = {Q, Ʃ, δ, q0, F}
Q: Set of all states.
∑: Set of input symbols. (Symbols which machine takes as input)
qo: Initial state. (Starting state of a machine)
F: Set of final state.
δ: Transition Function, defined as δ : Q X ∑ Q.
Example:
• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
Example:
The regular expression is converted into minimized DFA by the following procedure:
Regular expression → NFA → DFA → Minimized DFA
• This is one way to convert a regular expression into a NFA
• There can be other ways (much efficient) for the conversion
Algorithm 1.13.1:
Thomson’s Construction is simple and systematic method
• It guarantees that the resulting NFA will have exactly one final state, and one start
state.
• Construction starts from simplest parts (alphabet symbols).
• To create a NFA for a complex regular expression, NFAs of its sub-expressions are
combined to create its NFA.
• To recognize an empty string ε
Example:
For a RE (a | b) * a, the NFA construction is shown below.
Fig 1.14.1 A Lex program is turned into a transition table and actions, which are used
by a finite-automaton simulator
These components are:
• A transition table for the automaton.
• Those functions that are passed directly through Lex to the output
• The actions from the input program, which appear as fragments of code to be invoked
at the appropriate time by the automaton simulator.
Example: We shall illustrate the ideas of this section with the following simple, abstract
example:
In particular, string abb matches both the second and third patterns, but we shall consider it a
lexeme for pattern p2, since that pattern is listed first in the above Lex program. Then, input
strings such as aabbb. Have many prefixes that match the third pattern. The Lex rule is to
take the longest, so we continue reading 6's, until another a is met, whereupon we report the
lexeme to be the initial a's followed by as many 6's as there are.
If the lexical analyzer simulates an NFA, then it must read input beginning at the point on its
input which we have referred to as lexeme Begin. As it moves the pointer
called forward ahead in the input, it calculates the set of states it is in at each point, following
Algorithm.
Eventually, the NFA simulation reaches a point on the input where there are no next states.
At that point, there is no hope that any longer prefix of the input would ever get the NFA to
an accepting state; rather, the set of states will always be empty. Thus, we are ready to decide
on the longest prefix that is a lexeme matching some pattern.
Fig 1.14.5 Sequence of sets of states entered when processing input aaba
We look backwards in the sequence of sets of states, until we find a set that includes one
or more accepting states. If there are several accepting states in that set, pick the one
associated with the earliest pattern pi in the list from the Lex program. Move
the forward pointer back to the end of the lexeme, and perform the action Ai associated with
pattern pi.
Fig 1.14.6 transition graph for DFA handling the patterns a, abb, and a*b+
To begin our discussion of how to go directly from a regular expression to a DFA, we must
first dissect the NFA construction of Algorithm 1.13.1 and consider the roles played by
various states. We call a state of an NFA important if it has a non-e out-transition. Notice that
the subset construction (Algorithm 1.13.1) uses only the important states in a set T when it
computes e-closure (move(T, a)), the set of states reachable from T on input a. That is, the set
of states move(s, a) is nonempty only if state s is important. During the subset construction,
two sets of NFA states can be identified (treated as if they were the same set) if they:
When the NFA is constructed from a regular expression by Algorithm 1.13.1, we can say
more about the important states. The only important states are those introduced as initial
states in the basis part for a particular symbol position in the regular expression. That is, each
important state corresponds to a particular operand in the regular expression.
The constructed NFA has only one accepting state, but this state, having no out-
transitions, is not an important state. By concatenating a unique right endmarker # to a
regular expression r, we give the accepting state for r a transition on #, making it an important
state of the NFA for ( r ) # . In other words, by using the augmented regular expression ( r ) #
, we can forget about accepting states as the subset construction proceeds; when the
construction is complete, any state with a transition on # must be an accepting state.
The important states of the NFA correspond directly to the positions in the regular
expression that hold symbols of the alphabet. It is useful, as we shall see, to present the
regular expression by its syntax tree, where the leaves correspond to operands and the interior
nodes correspond to operators. An interior node is called a cat-node, or-node, or star-node if
it is labeled by the concatenation operator (dot), union operator |, or star operator *,
respectively. We can construct a syntax tree for a regular expression just as we did for
arithmetic expressions.
Example 1: Figure 1.15.1 shows the syntax tree for the regular expression of our running
example. Cat-nodes are represented by circles.
Leaves in a syntax tree are labeled by e or by an alphabet symbol. To each leaf not labeled e,
we attach a unique integer. We refer to this integer as the position of the leaf and also as a
position of its symbol. Note that a symbol can have several positions; for instance, a has
positions 1 and 3 in Fig. 1.15.1. The positions in the syntax tree correspond to the important
states of the constructed NFA.
Example 2: Figure 1.15.2 shows the NFA for the same regular expression as Fig. 1.15.1,
with the important states numbered and other states represented by letters. The numbered
states in the NFA and the positions in the syntax tree correspond in a way we shall soon see.
SOLVED PROBLEMS
1. Construct finite automata for the Regular expression (b|ab*ab*)*
7. Explain how the following statement will be translated into every phase.
sum: = oldsum +rate * 50.
18. Which of the following strings is not generated by the following grammar? ( )
S → SaSbS|ε
a) aabb b) abab c) aababb d) aaabbb
19. ______________are an important notation for specifying lexeme patterns
20. What is the Regular Expression Matching Zero or More Specific Characters ( )
a) + b) # c) * d) &
21. What is the Regular Expression Matching One or More Specific Characters ( )
a) + b) # c) * d) &
22. The _____________put all the executable object files into main memory for execution.( )
a) Text Editor b) Assembler c) Linker d) Loader
23. Regular expression (x|y)(x|y) denotes the set ( )
a) {xy,xy} b) {xx,xy,yx,yy} c) {x,y} d) {x,y,xy}
24. A compiler for a high-level language that runs on one machine and produces code for a
different machine is called ( )
a) Optimizing compiler b) One pass compiler
c) Cross compiler d) Multipass compiler
25. The output of lexical analyzer is ( )
a) A set of Regular Expressions b) Syntax Tree
c) Set of Tokens d) String Character
26. What is the regular expression to print character literally ( )
a) “c” b){c} c)c+ d)c$
27. In which phase the concept of grammar is used in compilation ( )
a) Lexical analysis b) Parser
c) Code generation d) Code optimization
28. The set of all strings over ∑ = {a,b} in which all strings having bbbb as substring is
a) (a+b)* bbbb (a+b)* b) (a+b)* bb (a+b)*bb
c) bbb(a+b)* d) bb (a+b)*
29. The set of all strings over ∑ ={a,b} in which a single a is followed by any number of b’s
or a single b followed by any number of a’s is ( )
a) ab* | ba* b) ab*ba* c) a*b + b*a d) None of the mentioned
30. Regular expressions are used to represent which language ( )
a) Recursive language b) Context free language
c) Regular language d) All of the mentioned
31. The set of all strings over ∑ = {a,b} in which strings consisting a’s and b’s and ending
with in bb is ( )
a) ab b) a*bbb c) (a+b)* bb d) All of the mentioned
32. Which of the following regular expression denotes zero or more instances of a or b?
a) a|b b) (a|b)* c) (ab)* d) a*b
33. The string (a)|((b)*(c)) is equivalent to ( )
a) Empty b) abcabc c) b*c|a d) None of the mentioned
34. Which of the following is not a cousin of compiler ( )
a. Assembler b. Linker c. Sentinel d. Loader
35. Output file of Lex is _____ ? ( )
a) Myfile.e b) Myfile.yy.c c) Myfile.lex d) Myfile.obj
36. The number of tokens in the following C statement is ( )
printf("i = %d, &i = %x", i, &i);
a. 3 b. 26 c. 10 d. 21
37. __________ accept the stream of character as input and produces stream of token as
output.
a. Parser b. Lexical analyzer c. Scanner d. b and c
38. The sequence of characters in the program that is matched by pattern for a token is known
as____________________.
a) Lexeme b) Regular Expression c) Loader d) Scanner
39. LEX specification consists of__________parts. ( )
a. 1 b. 2 c.3 d. 4
40. The Regular Expression a+ denotes________________________________ ( )
a. Set of all string of one (or) more number of a’s
b. Set of all string of zero (or) more number of a’s
c. Set of all string of two consecutive a’s
d. All the above
41. The Regular Expression a? denotes________________________________ ( )
a. Set of all string of one (or) more number of a’s
b. Set of all string of zero (or) more number of a’s
c. Set of all string of two consecutive a’s
d. Set of all string of zero (or) one number of a’s
42. The Regular Expression a* denotes________________________________ ( )
a. Set of all string of one (or) more number of a’s
Prepared by G. Sunil Reddy, Asst. Professor,
Department of CSE, SR Engineering College, Warangal 38
Compiler Design - Introduction to Compiling, Lexical Analysis
SHORT QUESTIONS
UNIT –I
INTRODUCTION TO COMPILING & LEXICAL ANALYSIS
CO Blooms Mark
S. No Short Questions
Addressing level s
1 Define Compiler? 1 1 2
2 Illustrate the cousins of compiler 1 1 2
3 What is the role of preprocessors 1 1 2
4 Define Assembler? 1 1 2
Illustrate the role of Loader in program
5 1 1 2
compilation
Illustrate the role of Linker in program
6 1 1 2
compilation
7 Define Interpreter? 1 1 2
8 Compare the Compiler and Interpreter 1 1 2
9 List the phases of compiler 1 1 2
10 Define Scanner 1 1 2
11 Define Parser 1 1 2
12 Define Symbol table? 1 1 2
List the phases of synthesis part of compiler
13 1 1 2
phases
List the phases of analysis part of compiler
14 1 1 2
phases
15 What is the role of semantic analyzer? 1 1 2
16 Differentiate Scanner and parser 1 2 2
17 Define the two main parts of compilation? 1 1 2
18 What is the role of error Handler? 1 1 2
19 What are the tools used to construct scanner and 1 1 2
parser
20 What is pass and phase? 1 1 2
21 List the compiler writing tools. 1 1 2
Grammars are used to create parse tree.
22 Justify that above statement is true or false? 1 2 2
Why?
How the symbol table interact with Lexical
23 1 2 2
analyzer?
24 What is the role of Lexical analyzer 1 1 2
25 Define regular expression? Give an example 1 2 2
26 What is lexeme? 1 1 2
27 What is token? 1 1 2
28 Classify tokens in the expression int a,b; 1 2 2
Identify the relation among Token, Pattern and
29 1 1 2
Lexemes
Write a regular expression for a floating point
30 1 2 2
number
Justify why lexical analyzer stripping out some
31 1 2 2
tokens and statements
32 What are Lexical errors? Give a example 1 2 2
Describe the possible strings for following
regular expressions
33 1 2 2
i)a(a|b)*a
ii)(a|b)*a(a|b)(a|b)
Lexeme is a sequence of characters and Token
34 is a output of Lexical analyzer .Justify the 1 2 2
above statement.
35 Construct NFA for (a|b)*ab 1 2 2
Determine whether the following regular
36 expressions derive the same strings or not. 1 2 2
(ab)* and a*b*
What is meant by Kleene Closure? Give an
37 1 2 2
example.
4. What are the main two parts of compilation? What are they performing
The two main parts are
• Analysis part breaks up the source program into constituent pieces and creates
an intermediate representation of the source program.
• Synthesis part constructs the desired target program from the intermediate
representation
Preprocessor
19. What is the need for separating the analysis phase into lexical analysis and parsing?
(Or) What are the issues of lexical analyzer?
Simpler design is perhaps the most important consideration. The separation of lexical
analysis from syntax analysis often allows us to simplify one or the other of these
phases.
Compiler efficiency is improved.
Compiler portability is enhanced.
23. What is a regular expression? State the rules, which define regular expression?
Regular expression is a method to describe regular language
Rules:
1) ɛ-is a regular expression that denotes { ɛ } that is the set containing the empty
string
2) If a is a symbol in ∑,then a is a regular expression that denotes {a}
3) Suppose r and s are regular expressions denoting the languages L(r ) and L(s)
Then,
a) (r )/(s) is a regular expression denoting L(r) U L(s).
b) (r )(s) is a regular expression denoting L(r )L(s)
c) (r )* is a regular expression denoting L(r)*.
d) (r) is a regular expression denoting L(r ).
LONG QUESTIONS
CO Blooms
S. No Long Questions Marks
Addressing level
Define compiler? Explain various phases of a
1 1 2 10
compiler in detail
2 Explain compiler writing tools in detail 1 2 5
What is regular expression? Explain about
3 different operators used in construction of 1 3 5
regular expression with example
Explain with a example statement a: = b*c-d for
4 1 3 10
all the phases of a compiler
a. Explain cousins of a Compiler
5 b. Describe how various phases could be 1 3 10
combined as a pass in a compiler?
6 Explain the role Lexical Analyzer in detail with
1 2 10
a example source code
7 Explain in detail about the Language for
1 2 10
specifying lexical Analyzers
8 Construct DFA for given regular expressions.
1 4 10
(a|b)*abb
9 Explain the general format of a LEX program
1 2 5
with example?
10 Explain how the following statement will be
1 3 10
translated into every phase a:=b+c*60
11 Explain the phases of compiler? And how the
following statement will be translated into 1 2 10
every phase.
4. ,
5. i
6. ,
7. &
8. i
9. )
10. ;
2. A lexical analyzer uses the following patterns to recognize three tokens T1, T2, and
T3 over the alphabet {a,b,c}.
T1: a?(b∣c)*a
T2: b?(a∣c)*b
T3: c?(b∣a)*c
Note that ‘x?’ means 0 or 1 occurrence of the symbol x. Note also that the analyzer
outputs the token that matches the longest possible prefix.
If the string bbaacabc is processes by the analyzer, which one of the following is the
sequence of tokens it outputs?
(A) T1T2T3
(B) T1T1T3
(C) T2T1T3
(D) T3T3
Answer: (D)
Explanation: 0 or 1 occurrence of the symbol x.
T1 : (b+c)* a + a(b+c)* a
T2 : (a+c)* b + b(a+c)* b
T3 : (b+a)* c + c(b+a)* c
Given String : bbaacabc
Longest matching prefix is ” bbaac ” (Which can be generated by T3)
The remaining part (after Prefix) “abc” (Can be generated by T3)
So, the answer is T3T3
D. dataflow analysis
Answer: (C)
Explanation: Lexical analysis is the process of converting a sequence of characters into a
sequence of tokens. A token can be a keyword.
4. The lexical analysis for a modern computer language such as Java needs the power of
which one of the following machine models in a necessary and sufficient sense?
A. Finite state automata
B. Deterministic pushdown automata
C. Non-Deterministic pushdown automata
D. Turing Machine
Answer (A)
Explanation: Lexical analysis is the first step in compilation. In lexical analysis, program is
divided into tokens. Lexical analyzers are typically based on finite state automata. Tokens
can typically be expressed as different regular expressions:
An identifier is given by [a-zA-Z][a-zA-Z0-9]*
The keyword if is given by if.
Integers are given by [+-]?[0-9]+.
Answer: (B)
Explanation: Type checking is done at semantic analysis phase and parsing is done at
syntax analysis phase. And we know Syntax analysis phase comes before semantic
analysis. So Option (B) is False.
UNIT II
Syntax Analysis: The role of a parser, Context-free grammars, writing a grammar, Parsing,
Ambiguous grammar, Elimination of Ambiguity, Classification of parsing techniques
Top down parsing: Back Tracking, Recursive Descent parsing, FIRST ( ) and FOLLOW ( )
- LL Grammars, Non-Recursive descent parsing, Error recovery in predictive parsing.
2. SYNTAX ANALYSIS
Syntax analysis is the second phase of the compiler. It gets the input from the tokens
and generates a syntax tree or parse tree.
Advantages of grammar for syntactic specification:
1. A grammar gives a precise and easy-to-understand syntactic specification of a
programming language.
2. An efficient parser can be constructed automatically from a properly designed
grammar.
3. A grammar imparts a structure to a source program that is useful for its translation
into object code and for the detection of errors.
4. New constructs can be added to a language more easily when there is a grammatical
description of the language.
Parser for any grammar is program that takes as input string w (obtain set of strings
tokens from the lexical analyzer) and produces as output either a parse tree for w , if w is a
valid sentences of grammar or error message indicating that w is not a valid sentences of
given grammar. The goal of the parser is to determine the syntactic validity of a source string
is valid, a tree is built for use by the subsequent phases of the computer. The tree reflects the
sequence of derivations or reduction used during the parser. Hence, it is called parse tree. If
string is invalid, the parse has to issue diagnostic message identifying the nature and cause of
the errors in string. Every elementary sub tree in the parse tree corresponds to a production of
the grammar.
There are two ways of identifying an elementary sub tree:
1. By deriving a string from a non-terminal or
2. By reducing a string of symbol to a non-terminal.
The two types of parsers employed are:
a. Top down parser: which build parse trees from top (root) to bottom (leaves)
b. Bottom up parser: which build parse trees from leaves and work up the root
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer
and verifies that the string can be generated by the grammar for the source language. It
reports any syntax errors in the program. It also recovers from commonly occurring errors so
that it can continue processing its input.
Error productions:
The parser is constructed using augmented grammar with error productions. If an error
production is used by the parser, appropriate error diagnostics can be generated to indicate
the erroneous constructs recognized by the input.
Prepared by G. Sunil Reddy, Asst. Professor,
Department of CSE, SR Engineering College, Warangal 52
UNIT – II: Compiler Design - Syntax Analysis, Top down parsing
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to
find a parse tree for a string y, such that the number of insertions, deletions and changes of
tokens is as small as possible. However, these methods are in general too costly in terms of
time and space.
Terminals: These are the basic symbols from which strings are formed.
Non-Terminals: These are the syntactic variables that denote a set of strings. These help to
define the language generated by the grammar.
Start Symbol: One non-terminal in the grammar is denoted as the “Start-symbol” and the
set of strings it denotes is the language defined by the grammar.
Productions: It specifies the manner in which terminals and non-terminals can be
combined to form strings. Each production consists of a non-terminal, followed by an arrow,
followed by a string of non-terminals and terminals.
grammar, L(G) is a context-free language. Two grammars G1 and G2 are equivalent, if they
produce same grammar.
Consider the production of the form S ⇒ α, If α contains non-terminals, it is called as a
sentential form of G. If α does not contain non-terminals, it is called as a sentence of G.
Derivation is a process that generates a valid string with the help of grammar by replacing the
non-terminals on the left with the string on the right side of the production.
Example:
Consider the following grammar for arithmetic expressions:
E→E+E|E*E|(E)|-E| id
To generate a valid string - ( id+id ) from the grammar the steps are
1. E → - E
2. E → - ( E )
3. E → - ( E+E )
4. E → - ( id+E )
5. E → - ( id+id )
In the above derivation, E is the start symbol, -(id+id) is therequiredsentence(only terminals).
Strings such as E, -E, -(E), . . . are called sentinel forms.
Types of derivations:
The two types of derivation are:
1. Left most derivation
2. Right most derivation.
• In leftmost derivations, the leftmost non-terminal in each sentinel is always chosen
first for replacement.
• In rightmost derivations, the rightmost non-terminal in each sentinel is always chosen
first for replacement.
Example:
Given grammar G : E → E+E | E*E | ( E ) | - E | id Sentence to be derived : - (id+id)
Left Most Derivation
E→-E
E→-(E)
E→-(E+E)
E→-(id+E)
E→-(id+id)
Prepared by G. Sunil Reddy, Asst. Professor,
Department of CSE, SR Engineering College, Warangal 55
UNIT – II: Compiler Design - Syntax Analysis, Top down parsing
PARSE TREE
• Inner nodes of a parse tree are non-terminal symbols
• The leaves of a parse tree are terminal symbols
• A parse tree can be seen as a graphical representation of a derivation
A grammar that produces more than one parse for some sentence is said to be ambiguous
grammar.
Example:
Given grammar G: E → E+E | E*E | (E) | - E | id
The sentence id+ id* id has the following two distinct leftmost derivations:
E → E+ E E → E* E
E → id + E E→E+ E * E
E → id + E * E E → id + E * E
E → id + id * E E → id + id * E
E → id + id * id E → id + id * id
The two corresponding trees are,
Regular Expression
It is used to describe the tokens of programming languages.
It is used to check whether the given input is valid or not using transition d
The transition diagram has set of states and edges.
It has no start symbol.
It is useful for describing the structure of lexical constructs such as identifiers, constants,
keywords, and so forth.
Each parsing method can handle grammars only of a certain form hence, the initial grammar
may have to be rewritten to make it parsable.
Reasons for using the regular expression to define the lexical syntax of a language
• The lexical rules of a language are simple and RE is used to describe them
• Regular expressions provide a more concise and easier to understand notation
for tokens than grammars
• Efficient lexical analyzers can be constructed automatically from RE than from
grammars
• Separating the syntactic structure of a language into lexical and non lexical parts
provides a convenient way of modularizing the front end into two manageable-sized
components
Ambiguity of the grammar that produces more than one parse tree for leftmost or rightmost
derivation can be eliminated by re-writing the grammar.
To disambiguate the grammar E → E+E | E*E | E^E | id | (E), we can use precedence of
operators as follows:
^ (right to left)
/,* (left to right)
-,+ (left to right)
We get the following unambiguous grammar:
E → E+T | T
T → T*F | F
F → G^F | G
G → id | (E)
Consider this example, G: stmt → if expr then stmt | if expr then stmt else stmt | other
This grammar is ambiguous since the string if E1 then if E2 then S1 else S2 has
the following
Two parse trees for leftmost derivation:
Left factoring:
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive parsing. When it is not clear which of two alternative productions to use to expand
a non-terminal A, we can rewrite the A-productions to defer the decision until we have seen
enough of the input to make the right choice.
If there is any production A → αβ1 | αβ2 , it can be rewritten as
A → αA’
Prepared by G. Sunil Reddy, Asst. Professor,
Department of CSE, SR Engineering College, Warangal 61
UNIT – II: Compiler Design - Syntax Analysis, Top down parsing
A’ → β1 | β2
Consider the grammar, G: S → iEtS | iEtSeS | a
E→b
Left factored, this grammar becomes
S → iEtSS’ | a
S’ → eS | ε
E→b
2.6. PARSING
• Top-down parsing: A parser can start with the start symbol and try to transform it to
the input string. Example: LL Parsers.
• Bottom-up parsing: A parser can start with input and attempt to rewrite it into the start
symbol. Example: LR Parsers.
• This parsing method may involve backtracking, that is, making repeated scans of
the input
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second
symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’. But the third leaf of tree is b which does not match with the
input symbol d. Hence discard the chosen production and reset the pointer to second position.
This is called backtracking.
Step4:
Now try the second alternative for A.
E → E+T | T
T → T*F | F
F → (E) | id
After eliminating the left-recursion the grammar becomes,
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
Now we can write the procedure for grammar as follows:
Recursive procedure:
Procedure E()
begin
T( );
EPRIME( );
End
Procedure EPRIME( )
begin
If input_symbol=’+’ then
ADVANCE( );
T( );
EPRIME( );
end
Procedure T( )
begin
F( );
TPRIME( );
End
Procedure TPRIME( )
begin
If input_symbol=’*’ then
ADVANCE( );
F( );
TPRIME( );
Prepared by G. Sunil Reddy, Asst. Professor,
Department of CSE, SR Engineering College, Warangal 65
UNIT – II: Compiler Design - Syntax Analysis, Top down parsing
end
Procedure F( )
begin
If input-symbol=’id’ then
ADVANCE( );
else if input-symbol=’(‘ then
ADVANCE( );
E( );
else if input-symbol=’)’ then
ADVANCE( );
end
else ERROR( );
Stack implementation:
PROCEDURE INPUT STRING
E( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
EPRIME( ) id+id*id
ADVANCE( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
ADVANCE( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
Predictive Parsing
Prepared by G. Sunil Reddy, Asst. Professor,
Department of CSE, SR Engineering College, Warangal 66
UNIT – II: Compiler Design - Syntax Analysis, Top down parsing
The parser is controlled by a program that considers X, the symbol on top of stack, and a, the
current input symbol. These two symbols determine the parser action. There are three
possibilities:
1. If X = a = $, the parser halts and announces successful completion of parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next
input symbol.
3. If X is a non-terminal , the program consults entry M[X, a] of the parsing table M.
This entry will either be an X-production of the grammar or an error entry.
If M[X, a] = {X → UVW},the parser replaces X on top of the stack by UVW
If M[X, a] = error, the parser calls an error recovery routine.
Example:
Consider the following grammar:
E→E+T|T
T→T*F|F
F→(E)|id
After eliminating left-recursion the grammar is
E →TE’
E’ → +TE’ | ε
T →FT’
T’ → *FT’ | ε
F → (E)|id
First( ):
FIRST(E) = { ( , id}
FIRST(E’) ={+ , ε }
FIRST(T) = { ( , id}
FIRST(T’) = {*, ε }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }
Stack Implementation
The parsing table entries are single entries. So each location has not more than one entry.
This type of grammar is called LL(1) grammar.
Consider this following grammar:
S→iEtS | iEtSeS| a
E→b
After eliminating left factoring, we have
S→iEtSS’|a
S’→ eS | ε
E→b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e, ε }
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}
Parsing table:
Since there are more than one production, the grammar is not LL(1) grammar.
Actions performed in predictive parsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementation of predictive parser:
1. Elimination of left recursion, left factoring and ambiguous grammar
2. Construct FIRST() and FOLLOW() for all non-terminals
3. Construct predictive parsing table
4. Parse the given input string using stack and parsing table
• An error is detected during the predictive parsing when the terminal on top of the
stack does not match the next input symbol, or when nonterminal A on top of the
stack, a is the next input symbol, and parsing table entry M[A,a] is empty
Panic-mode error recovery
• Panic-mode error recovery is based on the idea of skipping symbols on the input until
a token in a selected set of synchronizing tokens
Fig. 2.14.2. Parsing and error recovery moves made by a predictive parser
The above discussion of panic-mode recovery does not address the important issue of error
messages. The compiler designer must supply informative error messages that not only
describe the error, they must draw attention to where the error was discovered.
Phrase - level Recovery
Phrase-level error recovery is implemented by filling in the blank entries in the predictive
parsing table with pointers to error routines. These routines may change, insert, or delete
symbols on the input and issue appropriate error messages. They may also pop from the
stack. Alteration of stack symbols or the pushing of new symbols onto the stack is
questionable for several reasons. First, the steps carried out by the parser might then not
correspond to the derivation of any word in the language at all. Second, we must ensure that
there is no possibility of an infinite loop. Checking that any recovery action eventually results
in an input symbol being consumed (or the stack being shortened if the end of the input has
been reached) is a good way to protect against such loops.
SOLVED PROBLEMS
Problem 1
Table-based LL(1) Predictive Top-Down Parsing. Consider the following CFG G =
(N={S, A, B, C, D}, T={a,b,c,d}, P, S) where the set of productions P is given below:
S→A
A → BC | DBC
B → Bb | ε
C→c|ε
D→a|d
Answers:
i. No because it is left-recursive. You can expand B using a production with B as the left-
most symbol without consuming any of the input terminal symbols. To eliminate this left
recursion we add another non-terminal symbol, B’ and productions as follows:
S→A
A → BC | DBC
B → bB’ | ε
B’ → bB’ | ε
C→c|ε
D→a|d
ii. FIRST(S) = { a, b, c, d, ε }
FOLLOW(S) = { $ }
FIRST(A) = { a, b, c, d, ε }
FOLLOW(A) = { $ }
FIRST(B) = { b, ε }
FOLLOW(B) = { c, $ }
FIRST(B’) = { b, ε }
FOLLOW(B’) ={ c, $ }
FIRST(C) = { c, ε }
FOLLOW(C) = { $ }
FIRST(D) = { a, d }
FOLLOW(D) = { b, c, $ }
Non-terminals A, B, B’, C and S are all nullable.
iii. The stack and input are as shown below using the predictive, table-driven parsing
algorithm:
Problem 2
Perform left recursion and left factoring. S → ( )
S→a
S→(A)
A→S
A→A,S
Answer
S → ( S’ S → a S’ → ) S’ → A )
A → SA’
A’ → ,SA’
A’ → ε
Problem 3
Eliminate immediate left recursion for the following grammar
E E+T | T
T T*F|F
F (E) | id.
Answer
The rule to eliminate the left recursion is A->Aα|β can be converted as A-> βA’ and A’->
αA’|ε. So, the grammar after eliminating left recursion is
E TE’
E’ +TE’| ε
T FT’
T’ *FT’ | ε
F (E) | id.
SHORT QUESTIONS
SYNTAX ANALYSIS & TOP DOWN PARSING
CO Blooms
S. No Short Questions Marks
Addressing level
1 What is the role of parser 1 1 2
2 Define Context-free grammars 1 1 2
3 What is Left Recursion? 2 1 2
4 Define Backtracking 2 1 2
Explain the procedure of eliminating Left
5 2 2 2
Recursion from the grammar.
Eliminate the left recursion for the given
6 grammar 2 2 2
A→Aa|Aad|bd
1. What is the output of syntax analysis phase? What are the three general types of
parsers for grammars?
Parser (or) parse tree is the output of syntax analysis phase. General types of parsers:
1) Universal parsing
2) Top-down
3) Bottom-up
2. What are the different strategies that a parser can employ to recover from a
syntactic error?
• Panic mode
• Phrase level
• Error productions
• Global correction
On discovering an error, a parser may perform local correction on the remaining input;
that is, it may replace a prefix of the remaining input by some string that allows the
parser to continue. This is known as phrase level error recovery.
6. Define context free language. When will you say that two CFGs are equal?
• Derivations in which only the leftmost nonterminal in any sentential form is replaced
at each step are termed leftmost derivations
• Derivations in which the rightmost nonterminal is replaced at each step are termed
canonical derivations.
A parse tree may be viewed as a graphical representation for a derivation that filters out
the choice regarding replacement order. Each interior node of a parse tree is labeled by
some nonterminal A and that the children of the node are labeled from left to right by
symbols in the right side of the production by which this A was replaced in the
derivation. The leaves of the parse tree are terminal symbols.
• A grammar that produces more than one parse tree for some sentence is said to be
ambiguous
• An ambiguous grammar is one that produces more than one leftmost or rightmost
derivation for the same sentence
Example: E → E+E | E*E | id
11. Why do we use regular expressions to define the lexical syntax of a language?
i. The lexical rules of a language are frequently quite simple, and to describe them we
do not need a notation as powerful as grammars
ii. Regular expressions generally provide a more concise and easier to understand
notation for tokens than grammars
iii. More efficient lexical analyzers can be constructed automatically from regular
expressions than from arbitrary grammars
iv. Separating the syntactic structure of a language into lexical and non lexical parts
provides a convenient way of modularizing the front end of a compiler into two
manageable-sized components
12. When will you call a grammar as the left recursive one?
Starting with the root, labeled, does the top-down construction of a parse tree with the
starting nonterminal, repeatedly performing the following steps.
i. At node n, labeled with non terminal “A”, select one of the productions for “A” and
construct children at n for the symbols on the right side of the production
ii. Find the next node at which a sub tree is to be constructed
Recursive Descent Parsing is top down method of syntax analysis in which we execute a
set of recursive procedures to process the input. A procedure is associated with each
nonterminal of a grammar.
LONG QUESTIONS
CO Blooms Mark
S. No Long Questions
Addressing level s
Construct LL(1) for the given grammar,
S → ABC
1 A → aA | C 2 3 10
B→b
C → c.
Consider the given grammar
E → E+E | E-E | E*E | E/E | a | b
2 1 3 5
Obtain left most and right most derivation for
the string a+b*a+b.
Define ambiguous grammar? Test whether the
3 following grammar is ambiguous or not. 1 3 5
E → E+E | E-E | E*E | E/E | E↑ | (E) | -E | id
Write a non recursive descent parser for the
grammar.
bexpr → bexpr or bterm | bterm
4 bterm → bterm and bfactor | bfactor 2 4 10
bfactor → notebfactor | (bexpr) | true | false.
Where or, and , not,(,),true, false are terminals
of the grammar.
Check whether the following grammar is a
5 2 4 10
LL(1)grammar or not.
S→ iEtS | iEtSeS | a
E→ b
Explain FIRST( ) and FOLLOW( ) techniques
6 2 2 10
with a suitable example
Construct the predictive parser the following
grammar and show that the given grammar is
7 LL(1) or not 2 3 10
S→ (L) | a
L→L,S | S
Consider a grammar
S → (L) | a
L → L, S | S
i) What are the terminals, non-terminals and
8 2 3 10
start symbol?
ii) Construct leftmost and rightmost derivations
for the string (a, (a, a)).
iii. Construct FIRST( ) and FOLLOW( )
Eliminate ambiguity from the given grammar,
Construct FIRST( ) and FOLLOW( ) and check
whether the given grammar is LL(1) or not
9 2 4 10
E→E+T|T
T→T*F|F
F → ( E ) | id
Construct a predictive parsing table for the
grammar & parse the string id+id*id.
10 E→E+T|T 2 4 10
T→T*F|F
F → ( E ) | id
Construct the non recursive predictive parsing
for the following grammar and show that the
11 given grammar is LL(1) or not 2 4 10
S → AA
A →aA | b
predictive parsing
a. S →AC | CB
C → aC | a | b
A → aA | ε
B →Bb | ε
b. S → aS | Sb | a | b
c. S →AC | CB
C → aCb
A → aA | ε
B → Bb | ε
d. S →AC | CB
C → aCb | ε
A → aA | a
B → Bb | b
SOLUTION
• Language L contains the strings : {abb, aab, abbb, aabbb, aaabb, aa, bb, .......}, i.e, all
a's appear before b's in a string, and "number of a's" is not equal to "number of b's",
So i ≠ j.
• Here Grammar a, b & c also generate the string "ab", where i = j, and many more
strings with i = j, hence these grammars do not generate the language L, because for a
string that belongs to language L, exponent i should not be equal to exponent j.
• Grammar d: This Grammar never generates a string with equal no of a's and b's, i.e.
i=j. Hence this grammar generates the language L.
• Hence (d) is correct option.
Group 1 Group 2
P. Regular expression 1. Syntax analysis
Q. Pushdown automata 2. Code generation
R. Dataflow analysis 3. Lexical analysis
S. Register allocation 4. Code optimization
a. P - 4, Q – 1, R - 2, S - 3 b. P - 3, Q – 1, R - 4, S - 2
c. P - 3, Q – 4, R - 1, S – 2 d. P - 2, Q – 1, R - 4, S - 3
B→e
c. S → Aa | B | ε
A → Bd | Sc | ε
B→d
d. S →Aa |Bb | c
A → bd | ε
B → Ae | ε
10. Which of the following derivations does a top-down parser use while parsing an
input string? The input is assumed to be scanned in left to right order
(a) Leftmost derivation
(b) Leftmost derivation traced out in reverse
(c) Rightmost derivation
(d) Rightmost derivation traced out in reverse
Answer (a)
Top-down parsing (LL)
In top down parsing, we just start with the start symbol and compare the right side of the
different productions against the first piece of input to see which of the productions should be
used. A top down parser is called LL parser because it parses the input from Left to right, and
constructs a Leftmost derivation of the sentence.
12. The grammar A → AA | (A) | ε is not suitable for predictive-parsing because the
grammar is
(A) ambiguous
(B) left-recursive
(C) right-recursive
(D) an operator-grammar
Answer: (A)
Explanation: Since given grammar can have infinite parse trees for string ‘ε’, so grammar is
ambiguous, and also A → AA has left recusion.
For predictive-parsing, grammar should be:
• Free from ambiguity
• Free from left recursion
• Free from left factoring
Given grammar contains both ambiguity and left factoring, so it can not have predictive
parser.
We always expect first grammar free from ambiguity for parsing. Option (A) is more strong
option than option (B) here.
13. Which of the following suffices to convert an arbitrary CFG to an LL(1)
grammar?
(A) Removing left recursion alone
(B) Factoring the grammar alone
(C) Removing left recursion and factoring the grammar
(D) None of these
Answer: (D)
Explanation: Removing left recursion and factoring the grammar do not suffice to convert an
arbitrary CFG to LL(1) grammar.
In the predictive parse table. M, of this grammar, the entries M[S’, e] and M[S’, $]
respectively are
(A) {S’ → e S} and {S’ → e}
(B) {S’ → e S} and {}
(C) {S’ → ε} and {S’ → ε}
(D) {S’ → e S, S’→ ε} and {S’ → ε}
Answer: (D)
Explanation: Here representing the parsing table as M[ X , Y ], where X represents rows(
Non terminals) and Y represents columns(terminals).
Here are the rules to fill the parsing table.
For each distinct production rule A->α, of the grammar, we need to apply the given rules:
Rule 1: if A –> α is a production, for each terminal ‘a’ in FIRST(α), add A–>α to M[ A , a ]
Rule 2: if ‘ ε ‘ is in FIRST(α), add A –> α to M [ A , b ] for each ‘b’ in FOLLOW(A).
As Entries have been asked corresponding to Non-Terminal S’, hence we only need to
consider its productions to get the answer.
For S’ → eS, according to rule 1, this production rule should be placed at the entry M[ S’,
FIRST(eS) ], and from the given grammar, FIRST(eS) ={e}, hence S’->eS is placed in the
parsing table at entry M[S’ , e].
Similarly,
For S’->ε, as FIRST(ε) = {ε}, hence rule 2 should be applied, therefore, this production rule
should be placed in the parsing table at entry M[S’,FOLLOW(S’)], and FOLLOW(S’) =
FOLLOW(S) = { e, $ }, hence R->ε is placed at entry M[ S’, e ] and M[ S’ , $ ].
Therefore Answer is option D.
UNIT III
Bottom Up parsing: SR parsing, Operator Precedence Parsing, LR grammars, LR Parsers –
Model of an LR Parsers, SLR parsing, CLR parsing, LALR parsing, Error recovery in LR
Parsing, handling ambiguous grammars.
3. BOTTOM-UP PARSING
Constructing a parse tree for an input string beginning at the leaves and going towards
the root is called bottom-up parsing. A general type of bottom-up parser is a shift-reduce
parser.
Handles:
A handle of a string is a substring that matches the right side of a production, and whose
reduction to the non-terminal on the left side of the production represents one step along the
reverse of a rightmost derivation.
Example:
Consider the grammar:
E→E+E
E→E*E
E→(E)
E→id
And the input string id1+id2*id3
1. Shift-Reduce Conflict:
Example:
Consider the grammar:
E→E+E | E*E | id and input id+id*id
2. Reduce-reduce conflict:
Consider the grammar:
M → R+R | R+c | R
R→c
and input c+c
Viable prefixes:
• α is a viable prefix of the grammar if there is w such that αw is a right.
• The set of prefixes of right sentinel forms that can appear on the stack of a shift-
reduce parser are called viable prefixes.
• The set of viable prefixes is a regular language.
$ id – id x id $ Shift
$ id – id x id $ Reduce E → id
$E – id x id $ Shift
$E– id x id $ Shift
$ E – id x id $ Reduce E → id
$E–E x id $ Shift
$E–Ex id $ Shift
$ E – E x id $ Reduce E → id
$E–ExE $ Reduce E → E x E
$E–E $ Reduce E → E – E
$E $ Accept
Problem-02:
Consider the following grammar-
S→(L)|a
L→L,S|S
Parse the input string ( a , ( a , a ) ) using a shift-reduce parser.
Solution-
$ (a,(a,a))$ Shift
$( a,(a,a))$ Shift
$(L,S )$ Reduce L → L , S
$(L )$ Shift
$S $ Accept
Problem-03:
Consider the following grammar-
S→TL
T → int | float
L → L , id | id
Parse the input string int id , id ; using a shift-reduce parser.
Solution-
$ int id , id ; $ Shift
$T id , id ; $ Shift
$ T id , id ; $ Reduce L → id
$TL , id ; $ Shift
$TL, id ; $ Shift
$ T L , id ;$ Reduce L → L , id
$TL ;$ Shift
$TL; $ Reduce S → T L
$S $ Accept
Problem-04:
Considering the string “10201”, design a shift-reduce parser for the following grammar-
S 0S0 | 1S1 | 2
Solution:
$ 10201$ Shift
$1 0201$ Shift
$10S0 1$ Reduce S → 0 S 0
$1S 1$ Shift
$1S1 $ Reduce S → 1 S 1
$S $ Accept
Example 1:
This is the example of operator grammar:
E E+E/E*E/id
However, grammar that is given below is not an operator grammar because two non
non-
terminals are adjacent to each other:
S SAS/a
A bSb/b
Although, we can convert it into an operator grammar:
S SbSbS/SbS/a
A bSb/b
Operator precedence parser –
An operator precedence parser is a one of the bottom-up
bottom up parser that interprets an operator
operator-
precedence grammar. This parser is only used for operator grammars. Ambiguous grammars
are not allowed in case of any parser except operator precedence parser.
There are two methods for determining what precedence relations should hold between a pair
of terminals:
1. Use the conventional associativity and precedence of operator.
2. The second method of selecting operator-precedence
operator precedence relations is first to construct an
unambiguous grammar for the language, a grammar that reflects the correct
associativity and precedence in its parse trees.
This parser relies on the following three precedence relations: ⋖, ≐, ⋗
a ⋖ b This means a “yields precedence to” b.
a ⋗ b This means a “takes precedence over” b.
a ≐ b This means a “has precedence as” b.
Prepared by G. Sun
Sunil Reddy, Asst. Professor,
104
UNIT – III: Compiler Design - Bottom Up parsing
operators than size of table will be n*n and complexity will be 0(n2). In order
rder to increase the
size of table, use operator function table.
he operator precedence parsers usually do not store the precedence table with the relations;
rather they are implemented in a special way. Operator precedence parsers use precedence
functions that map terminal symbols to integers, and so the precedence relations between the
symbols are implemented by numerical comparison. The parsing table can be encoded by two
precedence functions f and g that map terminal symbols to integers. We select f and g such
that:
1. f(a) < g(b) whenever a is precedence to b
2. f(a) = g(b) whenever a and b having precedence
3. f(a) > g(b) whenever a takes precedence over b
Example – Consider the following grammar:
E E + E/E * E/( E )/id
The directed graph representing the precedence function:
Since there is not any cycle in the graph so we can make function table:
• fid → gx → f+ → g+ → f$
• gid → fx → gx → f+ → g+ → f$
Size of the table is 2n.
Prepared by G. Sun
Sunil Reddy, Asst. Professor,
105
UNIT – III: Compiler Design - Bottom Up parsing
One disadvantage of function table is that evev though we have blank entries in relation we
got non-blank entries in function table. Blank entries are also called error. Hence error
detection capability of relational table is greater than function table.
$ is a null production here which are also not allowed in operator grammars.
Advantages –
1. It can easily be constructed by hand
2. It is simple to implement this type of parsing
Disadvantages –
1. It is hard to handle tokens like the minus sign (-), which has two different precedence
(depending on whether it is unary or binary)
2. It is applicable only to small class of grammars
Example 2:
Consider the grammar:
E → EAE | (E) | -E | id
A→+|-|*|/|↑
Since the right side EAE has three consecutive non-terminals, the grammar can be written as
follows: E → E+E | E-E | E*E | E/E | E↑E | -E | id
Operator precedence relations:
There are three disjoint precedence relations namely
<. - Less than
= - equal to
.> - greater than
The relations give the following meaning:
a<.b - a yields precedence to b
a=b - a has the same precedence as b
.
a >b - a takes precedence over b
Rules for binary operations:
1. If operator θ1 has higher precedence than operator θ2, then make
θ1 . > θ2 and θ2 < . θ1
2. If operators θ1 and θ2, are of equal precedence, then make
θ1 . > θ2 and θ2 . > θ1 if operators are left associative
Example:
Consider the grammar E → E+E | E-E | E*E | E/E | E↑E | (E) | id. Input string is id+id*id
.The implementation is as follows:
3.3. LR GRAMMARS
i. SLR(1)
ii. CLR(1)
iii. LALR(1)
3.4. LR PARSERS
An efficient bottom-up syntax analysis technique that can be used CFG is called
LR(k) parsing. The ‘L’ is for left-to-right scanning of the input, the ‘R’ for constructing a
rightmost derivation in reverse, and the ‘k’ for the number of input symbols. When ‘k’ is
omitted, it is assumed to be 1.
Advantages of LR parsing:
• It recognizes virtually all programming language constructs for which CFG can be
written
• It is an efficient non-backtracking shift-reduce parsing method
• A grammar that can be parsed using LR method is a proper superset of a grammar
that can be parsed with predictive parser
• It detects a syntactic error as soon as possible
Drawbacks of LR method:
It is too much of work to construct a LR parser by hand for a programming language
grammar. A specialized tool, called a LR parser generator, is needed. Example: YACC.
Types of LR parsing method:
1. SLR- Simple LR
• Easiest to implement, least powerful.
2. CLR- Canonical LR
• Most powerful, most expensive.
3. LALR- Look-Ahead LR
• Intermediate in size and cost between the other two methods.
Action: The parsing program determines sm, the state currently on top of stack, and ai,
the current input symbol. It then consults action[sm,ai] in the action table which can have one
of four values:
1. Shift s, where s is a state,
2. Reduce by a grammar production A → β,
3. Accept,
4. Error.
Goto: The function goto takes a state and grammar symbol as arguments and produces a
state.
LR Parsing algorithm:
Input: An input string w and an LR parsing table with functions action and goto for grammar
G. Output: If w is in L(G), a bottom-up-parse for w; otherwise, an error indication.
Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in the
input buffer. The parser then executes the following program:
set ip to point to the first input symbol of w$; repeat forever begin
let s be the state on top of the stack and
a the symbol pointed to by ip;
if action[s, a] = shift s’ then begin
push a then s’ on top of the stack; advance ip to the next input symbol end
else if action[s, a] = reduce A→β then begin pop 2* | β | symbols off the stack;
let s’ be the state now on top of the stack; push A then goto[s’, A] on top of the stack;
output the production A→ β
end
else if action[s, a] = accept then
return
else error( )
end
Augment Grammar
Augmented grammar G` will be generated if we add one more production in the given
grammar G. It helps the parser to identify when to stop the parsing and announce the
acceptance of the input.
Example
Given grammar
1. S → AA
2. A → aA | b
The Augment grammar G` is represented by
1. S`→ S
2. S → AA
3. A → aA | b
Canonical Collection of LR(0) items
An LR (0) item is a production G with dot at some position on the right side of the
production. LR(0) items is useful to indicate that how much of the input has been scanned up
to a given point in the process of parsing.
In the LR (0), we place the reduce node in the entire row.
Example
Given grammar:
1. S → AA
2. A → aA | b
Add Augment Production and insert '•' symbol at the first position for every production in G
1. S` → •S
2. S → •AA
3. A → •aA
4. A → •b
I0 State:
Add Augment production to the I0 State and Compute the Closure
I0 = Closure (S` → •S)
Add all productions starting with S in to I0 State because "•" is followed by the non-terminal.
So, the I0 State becomes
I0 = S` → •S
S → •AA
Add all productions starting with "A" in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.
I0= S` → •S
S → •AA
A → •aA
A → •b
I1= Go to (I0, S) = closure (S` → S•) = S` → S•
Here, the Production is reduced so close the State.
I1= S` → S•
I2= Go to (I0, A) = closure (S → A•A)
Add all productions starting with A in to I2 State because "•" is followed by the non-terminal.
So, the I2 State becomes
I2 =S→A•A
A → •aA
A → •b
Go to (I2,a) = Closure (A → a•A) = (same as I3)
Go to (I2, b) = Closure (A → b•) = (same as I4)
I3= Go to (I0,a) = Closure (A → a•A)
Add productions starting with A in I3.
A → a•A
A → •aA
A → •b
Go to (I3, a) = Closure (A → a•A) = (same as I3)
Go to (I3, b) = Closure (A → b•) = (same as I4)
I4= Go to (I0, b) = closure (A → b•) = A → b•
I5= Go to (I2, A) = Closure (S → AA•) = SA → A•
I6= Go to (I3, A) = Closure (A → aA•) = A → aA•
Drawing DFA:
The DFA contains the 7 states I0 to I6.
LR(0) Table
o If a state is going to some other state on a terminal then it correspond to a shift move.
o If a state is going to some other state on a variable then it correspond to go to move.
o If a state contain the final item in the particular row then write the reduce node
completely.
Explanation:
o I0 on S is going to I1 so write it as 1.
o I0 on A is going to I2 so write it as 2.
o I2 on A is going to I5
5 so write it as 5.
o I3 on A is going to I6 so write it as 6.
o I0, I2and I3on a are going to I3 so write it as S3 which means that shift 3.
o I0, I2 and I3 on b are going to I4 so write it as S4 which means that shift 4.
Prepared by G. Sunil
Sun Reddy, Asst. Professor,
114
UNIT – III: Compiler Design - Bottom Up parsing
o I4, I5 and I6 all states contains the final item because they contain • in the right most
end. So rate the production as production number.
LR(0) items:
An LR(0) item of a grammar G is a production of G with a dot at some position of the right
side. For example, production A → XYZ yields the four items:
A→.XYZ
A → X . YZ
A → XY . Z
A → XYZ .
Closure operation:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I by
the two rules:
5. F → ( E )
6. F → id
The string to be parsed is id + id * id
The moves of the LR parser are shown on the next page. For reference:
FOLLOW(E) = { + , $ }
FOLLOW(T) = { * , + , $ }
FOLLOW(F) = { * , + , $ }
T → .T * F
T → .F
F → .( E )
F → .id
I5:
F → id.
I6:
E → E + .T
T → .T * F
T → .F
F → .( E )
F → .id
I7:
T → T * .F
F → .( E)
F → .id
I8:
F → ( E .)
E → E. + T
I9:
E → E + T.
T → T. * F
I10:
T → T * F.
I11:
F → ( E ).
If the right-most column is now traversed upwards, and the productions by which the reduce
steps occur are arranged in that sequence, then that will constitute a leftmost derivation of the
string by this grammar. This highlights the bottom-up nature of SLR parsing.
(a A), what we could expect next (B e), and a lookahead that agrees with what should follow
in the input if we ever reduce by the production S → a A B e. By incorporating such
lookahead information into the item concept, we can make wiser reduce decisions. The
lookahead of an LR(1) item is used directly only when considering reduce actions (i.e., when
the · marker is at the right end).
The core of an LR(1) item [S → a A · B e, c] is the LR(0) item S → a A · B e. Different
LR(1) items may share the same core. For example, if we have two LR(1) items of the form
• [A → α ·, a] and
• [B → α ·, b],
we take advantage of the lookahead to decide which reduction to use. (The same setting
would perhaps produce a reduce/reduce conflict in the SLR approach.)
Validity
The notion of validity changes. An item [A → β1 · β2, a] is valid for a viable prefix α β1 if
there is a rightmost derivation that yields α A a w which in one step yields α β1β2 a w
Initial item
To get the parsing started, we begin with the initial item of
[S’ → · S, $].
Here $ is a special character denoting the end of the string.
Closure
Closure is more refined. If [A → α · B β, a] belongs to the set of items, and B → γ is a
production of the grammar, then we add the item [B → · γ, b] for all b in FIRST(β a).
Every state is closed according to Closure.
Goto
Goto is the same. A state containing [A → α · X β, a] will move to a state containing [A → α
X · β, a] with label X.
Every state has transitions according to Goto.
Shift actions
The shift actions are the same. If [A → α · b β, a] is in state Ik and Ik moves to state Im with
label b, then we add the action
action[k, b] = "shift m"
Reduce actions
The reduce actions are more refined. If [A→α., a] is in state Ik, then we add the action:
"Reduce A → α" to action[Ik, a]. Observe that we don’t use information from FOLLOW(A)
anymore. The goto part of the table is as before.
Add all productions starting with A in I2 State because "•" is followed by the non-
terminal. So, the I2 State becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
I3= Go to (I0, a) = Closure ( A → a•A, a/b )
Add all productions starting with A in I3 State because "•" is followed by the non-
terminal. So, the I3 State becomes
I3= A → a•A, a/b
A → •aA, a/b
A → •b, a/b
Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)
Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)
I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•, a/b
I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)
Add all productions starting with A in I6 State because "•" is followed by the non-
terminal. So, the I6 State becomes
I6 = A → a•A, $
A → •aA, $
A → •b, $
Go to (I6, a) = Closure (A → a•A, $) = (same as I6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)
I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $
I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b
I9= Go to (I6, A) = Closure (A → aA•, $) A → aA•, $
If we analyze then LR (0) items of I3 and I6 are same but they differ only in their
lookahead.
I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b}
I6= { A → a•A, $
A → •aA, $
A → •b, $ }
Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can
combine them and called as I36.
I36 = { A → a•A, a/b/$
A → •aA, a/b/$
A → •b, a/b/$ }
The I4 and I7 are same but they differ only in their look ahead, so we can combine them
and called as I47.
I47 = {A → b•, a/b/$}
The I8 and I9 are same but they differ only in their look ahead, so we can combine them
and called as I89.
I89 = {A → aA•, a/b/$}
Drawing DFA:
On the stack and else as the first input symbol, should we shift else onto the stack (i.e., shift
e) or reduce if expr then stmt (i.e, reduce by S —> iS)? The answer is that we should shift
else, because it is "associated" with the previous then . In the terminology of grammar (4.67),
the e on the input, standing for else, can only form part of the body beginning with the iS now
on the top of the stack. If what follows e on the input cannot be parsed as an 5, completing
body iSeS, then it can be shown that there is no other parse possible.
We conclude that the shift/reduce conflict in J4 should be resolved in favor of shift on input
e. The SLR parsing table constructed from the sets of items of Fig. 4.48, using this resolution
of the parsing-action conflict in I4 on input e, is shown in Fig. 4.51. Productions 1 through 3
are S -> iSeS, S ->• iS, and S -)• a, respectively.
An LR parser will detect an error when it consults the parsing action table and finds an error
entry. Errors are never detected by consulting the goto table. An LR parser will announce an
error as soon as there is no valid continuation for the portion of the input thus far scanned. A
canonical LR parser will not make even a single reduction before announcing an error. SLR
and LALR parsers may make several reductions before announcing an error, but they will
never shift an erroneous input symbol onto the stack.
In LR parsing, we can implement panic-mode error recovery as follows. We scan down the
stack until a state s with a goto on a particular nonterminal A is found. Zero or more input
symbols are then discarded until a symbol a is found that can legitimately follow A. The
parser then stacks the state GOTO(s, A) and resumes normal parsing. There might be more
than one choice for the nonterminal A. Normally these would be nonterminals representing
major program pieces, such as an expression, statement, or block. For example, if A is the
nonterminal stmt, a might be semicolon or }, which marks the end of a statement sequence.
This method of recovery attempts to eliminate the phrase containing the syntactic error. The
parser determines that a string derivable from A contains an error. Part of that string has
already been processed, and the result of this processing is a sequence of states on top of the
stack. The remainder of the string is still in the input and the parser attempts to skip over the
remainder of this string by looking for a symbol on the input that can legitimately follow
By removing states from the stack, skipping over the input, and pushing
GOTO(s, A) on the stack, the parser pretends that it has found an instance of A and resumes
normal parsing.
Phrase-level recovery is implemented by examining each error entry in the LR parsing table
and deciding on the basis of language usage the most likely programmer error that would
give rise to that error. An appropriate recovery procedure can then be constructed;
presumably the top of the stack and/or first input symbols would be modified in a way
deemed appropriate for each error entry.
In designing specific error-handling routines for an LR parser, we can fill in each blank entry
in the action field with a pointer to an error routine that will take the appropriate action
selected by the compiler designer. The actions may include insertion or deletion of symbols
from the stack or the input or both, or alteration and transposition of input symbols. We must
make our choices so that the LR parser will not get into an infinite loop. A safe strategy will
assure that at least one input symbol will be removed or shifted eventually, or that the stack
will eventually shrink if the end of the input has been reached. Popping a stack state that
covers a nonterminal should be avoided, because this modification eliminates from the stack
a construct that has already been successfully parsed.
E -> E + E | E * E | (E) | id
Figure 4.53 shows the LR parsing table from Fig. 4.49 for this grammar, modified for error
detection and recovery. We have changed each state that calls for a particular reduction on
some input symbols by replacing error entries in that state by the reduction. This change has
the effect of postponing the error detection until one or more reductions are made, but the
error will still be caught before any shift move takes place. The remaining blank entries from
Fig. 4.49 have been replaced by calls to error routines.
e3: Called from states 1 or 6 when expecting an operator, and an id or right parenthesis is
found. Push state 4 (corresponding to symbol +) onto the stack; issue diagnostic "missing
operator."
e4: Called from state 6 when the end of the input is found.
Push state 9 (for a right parenthesis) onto the stack; issue diagnostic "missing right
parenthesis."
On the erroneous input id + ) , the sequence of configurations entered by the parser is shown
in Fig. 4.54. •
SOLVED PROBLEMS
1. Construct SLR(1) Parsing table for the given grammar
S→E
E→E+T|T
T→T*F|F
F → id
Add Augment Production and insert '•' symbol at the first position for every production in
G
S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id
I0 State:
Add Augment production to the I0 State and Compute the Closure
I0 = Closure (S` → •E)
Add all productions starting with E in to I0 State because "." is followed by the non-
terminal. So, the I0 State becomes
I0 = S` → •E
E → •E + T
E → •T
Add all productions starting with T and F in modified I0 State because "." is followed by
the non-terminal. So, the I0 State becomes.
I0= S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id
I1= Go to (I0, E) = closure (S` → E•, E → E• + T)
I2= Go to (I0, T) = closure (E → T•T, T• → * F)
I3= Go to (I0, F) = Closure ( T → F• ) = T → F•
Explanation:
First (E) = First (E + T) ∪ First (T)
First (T) = First (T * F) ∪ First (F)
First (F) = {id}
First (T) = {id}
First (E) = {id}
Follow (E) = First (+T) ∪ {$} = {+, $}
Follow (T) = First (*F) ∪ First (F)
= {*, +, $}
Follow (F) = {*, +, $}
o I1 contains the final item which drives S → E• and follow (S) = {$}, so action {I1, $}
= Accept
o I2 contains the final item which drives E → T• and follow (E) = {+, $}, so action {I2,
+} = R2, action {I2, $} = R2
o I3 contains the final item which drives T → F• and follow (T) = {+, *, $}, so action
{I3, +} = R4, action {I3, *} = R4, action {I3, $} = R4
o I4 contains the final item which drives F → id• and follow (F) = {+, *, $}, so action
{I4, +} = R5, action {I4, *} = R5, action {I4, $} = R5
o I7 contains the final item which drives E → E + T• and follow (E) = {+, $}, so action
{I7, +} = R1, action {I7, $} = R1
I8 contains the final item which drives T → T * F• and follow (T) = {+, *, $}, so action
{I8, +} = R3, action {I8, *} = R3, action {I8, $} = R3.
S → AA
A → aA
A→b
Add Augment Production, insert '•' symbol at the first position for every production in G
and also add the lookahead
S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I0 State:
Add Augment production to the I0 State and Compute the Closure
I0 = Closure (S` → •S)
Add all productions starting with S in to I0 State because "." is followed by the non-
terminal. So, the I0 State becomes
I0 = S` → •S, $
S → •AA, $
Add all productions starting with A in modified I0 State because "." is followed by the
non-terminal. So, the I0 State becomes.
I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $
I2= Go to (I0, A) = closure ( S → A•A, $ )
Add all productions starting with A in I2 State because "." is followed by the non-
terminal. So, the I2 State becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
I3= Go to (I0, a) = Closure ( A → a•A, a/b )
Add all productions starting with A in I3 State because "." is followed by the non-
terminal. So, the I3 State becomes
Drawing DFA:
6. Find the SLR parsing table for the given grammar and parse the sentence (a+b)*c.
E →E+E | E*E | (E) | id.
Answer
Given grammar:
1. E →E+E
2. E →E*E
3. E →(E)
4. E →id
Augmented grammar
E’ →E
E →E+E
E →E*E
E →(E)
E →id
I0: E’ →.E
E →.E+E
E →.E*E
E →.(E)
E →.id
I1: goto(I0, E)
E’ →E.
E →E.+E
E →E.*E
I2: goto(I0, ()
E → (.E)
E →.E+E
E →.E*E
E →.(E)
E →.id
I3: goto(I0, id)
E →id.
I4: goto(I1, +)
E →E+.E
E →.E+E
E →.E*E
E →.(E)
E →.id
I5: goto(I1, *)
E →E*.E
E →.E+E
E →.E*E
E →.(E)
E →.id
I6: goto(I2, E)
E → (E.)
E →E.+E
E →E.*E
I7: goto(I4, E)
E →E+E.
E →E.+E
E →E.*E
I8: goto(I5, E)
E →E*E.
E →E.+E
SHORT QUESTIONS
CO Blooms Mark
S. No Short Questions
Addressing level s
1 What do you mean by Handle pruning? 2 1 2
What is Bottom Up Parsing? Explain with an
2 2 1 2
example.
3 What are the actions in Shift Reduce Parsing? 2 1 2
4 Define LR(0) items in bottom up parsing? 2 2 2
5 How LR is different from LL 2 2 2
CLR is more powerful parsing technique?
6 2 2 2
Justify the statement
7 Explain types of LR parsers? 2 2 2
8 Write the conflicts of shift-reduce parsing. 2 2 2
9 List the techniques bottom up parsing? 2 2 2
What is the difference between Top Down and
10 2 1 2
Bottom Up parsing?
11 Why CLR is more powerful than SLR & LALR 2 2 2
12 What is meant by a Look ahead? 2 1 2
13 What is shift – reduce conflict? 2 3 2
14 What is reduce – reduce conflict? 2 1 2
15 What do you mean by Item set in LR(0) ? 2 1 2
16 What is the difference between LR(0),LR(1)? 2 2 2
17 What do you mean by Closure of Item Sets? 2 1 2
18 Write Canonical Collection of LR items? 2 1 2
19 Define Closure and Goto functions. 2 2 2
20 What is the difference between LR(0) and SLR 2 2 2
Left recursion not effects on Bottom Up
21 2 2 2
Parsing? Justify?
What is the difference between CLR and
22 2 1 2
LALR?
CLR is more powerful than LR Parsers?
23 2 1 2
Justify?
24 What is operator grammar? 2 2 2
25 What do you mean by augmented Grammar 2 1 2
26 Define handle 2 1 2
27 Write short notes on YACC 2 1 2
28 What are kernel & non-kernel items? 2 1 2
29 What is phrase level error recovery? 2 1 2
30 Define LR(0) items 2 1 2
Parsing method in which construction starts at the leaves and proceeds towards the
root is called as Bottom Up Parsing.
A general style of bottom-up syntax analysis, which attempts to construct a parse tree
for an input string beginning at the leaves and working up towards the root.
• An Handle of a string is a sub string that matches the right side of production and
whose reduction to the nonterminal on the left side of the production represents one
step along the reverse of a rightmost derivation.
• The process of obtaining rightmost derivation in reverse is known as Handle Pruning.
• The set of prefixes of right sentential forms that can appear on the stack of a shift-
reduce parser are called viable prefixes.
• A viable prefix is that it is a prefix of a right sentential form that does not continue
the past the right end of the rightmost handle of that sentential form.
i. It is hard to handle tokens like the minus sign, which has two different
precedence.
ii. Since the relationship between a grammar for the language being parsed and the
operator – precedence parser itself is tenuous, one cannot always be sure the
parser accepts exactly the desired language.
iii. Only a small class of grammars can be parsed using operator precedence
techniques.
There are two points in the parsing process at which an operator-precedence parser can
discover the syntactic errors:
i. If no precedence relation holds between the terminal on top of the stack and the
current input
ii. If a handle has been found, but there is no production with this handle as a right
side
9. LR (k) parsing stands for what?
The “L” is for left-to-right scanning of the input, the “R” for constructing a rightmost
derivation in reverse, and the k for the number of input symbols of look ahead that are
used in making parsing decisions.
• The function goto takes a state and grammar symbol as arguments and produces a
state.
• The goto function of a parsing table constructed from a grammar G is the transition
function of a DFA that recognizes the viable prefixes of G.
Ex: goto(I,X)
Where I is a set of items and X is a grammar symbol to be the closure of the set of all
items [A→αX.ẞ] such that [A→α.X ẞ] is in I
i. S, and all items whose dots are not at the left end are known as kernel item
S. →′The set of items which include the initial item, S
ii. The set of items, which have their dots at the left end, are known as non kernel items.
15. Why SLR and LALR are more economical to construct than canonical LR?
For a comparison of parser size, the SLR and LALR tables for a grammar always
have the same number of states, and this number is typically several hundred states
for a language like Pascal. The canonical LR table would typically have several
thousand states for the same size language. Thus, it is much easier and more
economical to construct SLR and LALR tables than the canonical LR tables.
LONG QUESTIONS
CO Blooms Mark
S. No Long Questions
Addressing level s
Construct SLR(1) parsing table for the given
grammar and parse the string ( )( ).
1 2 4 10
S→S(S)
S→ε
Construct the SLR(1) parsing table for the
following grammar.
2 S → CC 2 4 10
C → cC
C→d
Check whether the given grammar is CLR(1) or
not.
S →AS
3 2 4 10
S →b
A →SA
A →a
Construct LALR(1) parsers for the following
grammar.
S→L=R
4 S→R 2 4 10
L→*R
L → id
R→L
Show the given Grammar is LL(1) but not
SLR(1)
5 S→AaAb|BbBa 2 4 10
A→ ε
B→ ε
Show the given Grammar is SLR(1) but not
LL(1).
6 2 4 10
S → SA|A
A→a
7 Discuss error recovery in LR parsing. 2 2 10
Explain CLR parsing, justify how it is efficient
8 2 2 10
over LR parsers.
Explain the common conflicts that can be
9 encountered in a shift- reduce parser with a 2 3 10
example?
Below given Grammar is SLR(1) or not .
10 S → AS|b 2 4 10
A→SA|a
Consider the grammar
E→ E + E|E *E|(E)| id
11 Show the sequence of moves made by the shift- 2 4 10
reduce parser on the input id1+id2*id3 and
determine whether the given string is accepted
SLR(1)
E→E+T|T
T→i
Check whether the given grammar is LR(0) or
SLR(1)
24 2 4 10
E→T+E|T
T→i
Show the given Grammar is CLR(1) or not
25 S → SA|A 2 4 10
A→a
Explanation:
First(aSa) = a
First(bS) = b
First(c) = c
All are mutually disjoint i.e no common terminal between them, the given grammar is LL(1).
As the grammar is LL(1) so it will also be LR(1) as LR parsers are more powerful then LL(1)
parsers. and all LL(1) grammar are also LR(1)
So option C is correct.
2. An LALR(1) parser for a grammar G can have shift-reduce (S-R) conflicts if and
only if
(A) The SLR(1) parser for G has S-R conflicts
(B) The LR(1) parser for G has S-R conflicts
(C) The LR(0) parser for G has S-R conflicts
(D) The LALR(1) parser for G has reduce-reduce conflicts
Answer: (B)
Explanation:
Both LALR(1) and LR(1) parser uses LR(1) set of items to form their parsing tables. And
LALR(1) states can be find by merging LR(1) states of LR(1) parser that have the same set of
first components of their items.
i.e. if LR(1) parser has 2 states I and J with items A a.bP,x and A a.bP,y respectively,
where x and y are look ahead symbols, then as these items are same with respect to their first
component, they can be merged together and form one single state, let’s say K. Here we have
to take union of look ahead symbols. After merging, State K will have one single item as
A a.bP,x,y. This way LALR(1) states are formed ( i.e. after merging the states of LR(1) ).
E -> F
F -> id
Consider the following LR(0) items corresponding to the grammar above.
(i) S -> S * .E
(ii) E -> F. + E
(iii) E -> F + .E
Given the items above, which two of them will appear in the same set in the canonical sets-
of-items for the grammar?
(A) (i) and (ii)
(B) (ii) and (iii)
(C) (i) and (iii)
(D) None of the above
Answer: (D)
Explanation: Let’s make the LR(0) set of items. First we need to augment the grammar with
the production rule S’ -> .S , then we need to find closure of items in a set to complete a set.
Below are the LR(0) sets of items.
6. Consider the grammar defined by the following production rules, with two
operators ∗ and +
S T*P
T U|T*U
P Q+P|Q
Q Id
U Id
Which one of the following is TRUE?
(A) + is left associative, while ∗ is right associative
(B) + is right associative, while ∗ is left associative
(C) Both + and ∗ are right associative
(D) Both + and ∗ are left associative
Answer: (B)
Explanation: From the grammar we can find out associative by looking at grammar.
Let us consider the 2nd production
T T*U
T is generating T*U recursively (left recursive) so * is left associative.
Similarly
P Q+P
Right recursion so + is right associative.
So option B is correct.
13. Which of the following is essential for converting an infix expression to the postfix
from efficiently?
(A) An operator stack
(B) An operand stack
(C) An operand stack and an operator stack
(D) A parse tree
Answer: (A)
15. Consider the following expression grammar. The seman-tic rules for expression
calculation are stated next to each grammar production.
E → number E.val = number. val
| E '+' E E(1).val = E(2).val + E(3).val
| E '×' E E(1).val = E(2).val × E(3).val
The above grammar and the semantic rules are fed to a yacc tool (which is an LALR (1)
parser generator) for parsing and evaluating arithmetic expressions. Which one of the
following is true about the action of yacc for the given grammar?
(A) It detects recursion and eliminates recursion
(B) It detects reduce-reduce conflict, and resolves
(C) It detects shift-reduce conflict, and resolves the conflict in favor of a shift over a reduce
action
(D) It detects shift-reduce conflict, and resolves the conflict in favor of a reduce over a shift
action
Answer: (C)
Explanation:
Background
yacc conflict resolution is done using following rules:
shift is preferred over reduce while shift/reduce conflict.
first reduce is preferred over others while reduce/reduce conflict.
You can answer to this question straightforward by constructing LALR(1) parse table, though
its a time taking process. To answer it faster, one can see intuitively that this grammar will
have a shift-reduce conflict for sure. In that case, given this is a single choice question, (C)
option will be the right answer.
Fool-proof explanation would be to generate LALR(1) parse table, which is a lengthy
process. Once we have the parse table with us, we can clearly see that
i. reduce/reduce conflict will not arise in the above given grammar
ii. shift/reduce conflict will be resolved by giving preference to shift, hence making the
expression calculator right associative.
According to the above conclusions, only correct option seems to be (C).
UNIT IV
• Note: Terminal symbols are assumed to have synthesized attributes supplied by the
lexical analyzer.
• Procedure calls (e.g. print in the next slide) define values of Dummy synthesized
attributes of the non terminal on the left-hand side of the production.
• Example: Let us consider the Grammar for arithmetic expressions. The Syntax
Directed Definition associates to each non terminal a synthesized attribute called val.
Note: A parse tree showing the values of its attributes is called annotated parse tree.
Inherited Attributes
• Inherited Attributes are useful for expressing the dependence of a construct on the
context in which it appears.
• It is always possible to rewrite a syntax directed definition to use only synthesized
attributes, but it is often more natural to use both synthesized and inherited attributes.
• Evaluation Order: Inherited attributes cannot be evaluated by a simple PreOrder
traversal of the parse-tree:
• Unlike synthesized attributes, the order in which the inherited attributes of the
children are computed is important!!! Indeed:
• Inherited attributes of the children can depend from both left and right
siblings!
Example: Let us consider the syntax directed definition with both inherited and synthesized
attributes for the grammar for “type declarations”:
• The non terminal T has a synthesized attribute, type, determined by the keyword in
the declaration.
• The production D → T L is associated with the semantic rule L.in := T .type which set
the inherited attribute L.in.
Dependency Graphs
Evaluation Order
• The evaluation order of semantic rules depends from a Topological Sort derived from
the dependency graph.
• Topological Sort: Any ordering m 1, m 2, . . . , m k such that if m i → m j is a link in
the dependency graph then m i < m j .
• Any topological sort of a dependency graph gives a valid order to evaluate the
semantic rules.
Topological Order:
Example:
Build the dependency graph for the parse-tree of real id1, id2, id3
Formalisms for which an attribute evaluation order can be fixed at compiler construction
time.
• They form a class that is less general than the class of non-circular definitions.
• In the following we illustrate two kinds of strictly non-circular definitions: S-
Attributed and L-Attributed Definitions.
• Extra fields are added to the stack to hold the values of synthesized attributes.
• In the simple case of just one attribute per grammar symbol the stack has two fields:
state and val
Example: Consider the S-attributed definitions for the arithmetic expressions. To evaluate
attributes the parser executes the following code
• The variable ntop is set to the new top of the stack. After a reduction is done top is set
to ntop: When a reduction A α is done with |α| = r, then ntop = top − r + 1
• During a shift action both the token and its value are pushed into the stack.
• The following Figure shows the moves made by the parser on input 3*5+4n.
– Stack states are replaced by their corresponding grammar symbol;
– Instead of the token digit the actual value is shown.
• L-Attributed Definitions are a class of syntax directed definitions whose attributes can
always be evaluated by single traversal of the parse-tree.
• The following procedure evaluate L-Attributed Definitions by mixing PostOrder
(synthesized) and PreOrder (inherited) traversal.
TRANSLATION SCHEMES:
Example:
Consider the Translation Scheme for the L-Attributed Definition for “type declarations”:
D T {L.in := T.type} L
T int {T.type :=integer}
T real {T.type :=real}
L {L1.in := L.in} L1, id {addtype(id.entry, L.in)}
L id {addtype(id.entry, L.in)}
The parse-tree with semantic actions for the input real id1, id2, id3 is:
Traversing the Parse-Tree in depth-first order (PostOrder) we can evaluate the attributes
Example:
A simple translation scheme that converts infix expressions to the corresponding postfix
expressions.
E→TR
R → + T { print(“+”) } R1
R→ε
T → id { print(id.name) }
a+b+c ab+c+
The depth first traversal of the parse tree (executing the semantic actions in that order) will
produce the postfix representation of the infix expression.
The right part of the CFG contains the semantic rules that specify how the grammar should be
interpreted. Here, the values of non-terminals E and T are added together and the result is
copied to the non-terminal E.
Semantic attributes may be assigned to their values from their domain at the time of parsing
and evaluated at the time of assignment or conditions. Based on the way the attributes get
their values, they can be broadly divided into two categories: synthesized attributes and
inherited attributes.
The design of a type checker for a language is based on information about the syntactic
constructs in the language, the notion of types, and the rules for assigning types to language
constructs.
For example: “if both operands of the arithmetic operators of +,- and * are of type integer,
then the result is of type integer ”
The type of a language construct will be denoted by a “type expression”. A type expression is
either a basic type or is formed by applying an operator called a type constructor to other type
expressions. The sets of basic types and constructors depend on the language to be checked.
The following are the definitions of type expressions:
1. Basic types such as boolean, char, integer, real are type expressions. A special basic
type, type_error , will signal an error during type checking; void denoting “the
absence of a value” allows statements to be checked
2. Since type expressions may be named, a type name is a type expression
3. A type constructor applied to type expressions is a type expression
Constructors include:
Arrays: If T is a type expression then array (I,T) is a type expression denoting the
type of an array with elements of type T and index set I.
For example:
type row = record
address: integer;
lexeme: array[1..15] of char
end;
var table: array[1...101] of row;
Declares the type name row representing the type expression record((address X
integer) X (lexeme X array(1..15,char))) and the variable table to be an array of
records of this type.
Pointers: If T is a type expression, then pointer(T) is a type expression denoting the
type “pointer to an object of type T”. For example, var p: ↑ row declares variable p to
have type pointer(row).
Functions: A function in programming languages maps a domain type D to a range
type R. The type of such function is denoted by the type expression D → R
4. Type expressions may contain variables whose values are type expressions.
Error Recovery
Since type checking has the potential for catching errors in program, it is desirable for
type checker to recover from errors, so it can check the rest of the input. Error handling has to
be designed into the type system right from the start; the type checking rules must be
prepared to cope with errors.
INTRODUCTION
The front end translates a source program into an intermediate representation from which the
back end generates target code.
1. Retargeting is facilitated. That is, a compiler for a different machine can be created by
attaching a back end for the new machine to an existing front end
2. A machine-independent code optimizer can be applied to the intermediate representation.
INTERMEDIATE LANGUAGES
Three ways of intermediate representation:
• Postfix notation
• Three address code
• Syntax tree
The semantic rules for generating three-address code from common programming language
constructs are similar to those for constructing syntax trees or for generating postfix notation.
Postfix Notation –
The ordinary (infix) way of writing the sum of a and b is with operator in the middle: a + b
The postfix notation for the same expression places the operator at the right end as ab +. In
general, if e1 and e2 are any postfix expressions, and + is any binary operator, the result of
applying + to the values denoted by e1 and e2 is postfix notation by e1e2 +. No parentheses
are needed in postfix notation because the position and arity (number of arguments) of the
operators permit only one way to decode a postfix expression. In postfix notation the operator
follows the operand.
Three-Address Code –
A statement involving no more than three references (two for operands and one for result) is
known as three address statement. A sequence of three address statements is known as three
address code. Three address statement is of the form x = y op z , here x, y, z will have address
(memory location). Sometimes a statement might contain less than three references but it is
still called three address statements.
Example – The three address code for the expression a + b * c + d:
T1 = b * c
T2 = a + T1
T3 = T2 + d
T1, T2, T3 are temporary variables.
Syntax Tree –
Syntax tree is nothing more than condensed form of a parse tree. The operator and keyword
nodes of the parse tree are moved to their parents and a chain of single productions is
replaced by single link in syntax tree the internal nodes are operators and ch
child nodes are
operands. To form syntax tree put parentheses in the expression, this way it's easy to
recognize which operand should come first.
Example –
x = (a + b * c) / (a – b * c)
Three-address code
Fig. 4.11.3 Three-Address Code corresponding to the syntax tree and DAG
The reason for the term “Three-Address Code” is that each statement usually contains three
addresses, two for the operands and one for the result.
Three address code is a type of intermediate code which is easy to generate and can be easily
converted to machine code. It makes use of at most three addresses and one operator to
represent an expression and the value computed at each instruction is stored in temporary
variable generated by compiler. The compiler decides the order of operation given by three
address code.
General representation –
a = b op c
Where a, b or c represents operands like names, constants or compiler generated temporaries
and op represents the operator
1. Quadruple
2. Triples
3. Indirect Triples
1. Quadruple –
• It is structure with consist of 4 fields namely op, arg1, arg2 and result. op denotes the
operator and arg1 and arg2 denotes the two operands and result is used to store the
result of the expression.
• The contents of fields arg1, arg2 and result are normally pointers to the symbol-
entries for the names represented by these fields. If so, temporary names must be
entered into the symbol table as they are created
Advantage –
• Easy to rearrange code for global optimization.
• One can quickly access value of temporary variables using symbol table.
Disadvantage –
• Contain lot of temporaries.
• Temporary variable creation increases time and space complexity.
Disadvantage –
• Temporaries are implicit and difficult to rearrange code.
• It is difficult to optimize because optimization involves moving intermediate code.
When a triple is moved, any other triple referring to it must be updated also. With
help of pointer one can directly access symbol table entry.
Syntax trees are abstract or compact representation of parse trees. They are also called as
Abstract Syntax Trees.
Example-
NOTE-
Syntax trees are called as Abstract Syntax Trees because-
• They are abstract representation of the parse trees.
• They do not provide every characteristic information from the real syntax.
• For example- no rule nodes, no parenthesis etc.
Example:
Considering the following grammar-
E→E+T|T
T→TxF|F
F → ( E ) | id
Generate the following for the string id + id x id
• Parse tree
• Syntax tree
• Directed Acyclic Graph (DAG)
Solution-
Parse Tree-
Syntax Tree-
Properties-
• Reachability relation forms a partial order in DAGs.
• Both transitive closure & transitive reduction are uniquely defined for DAGs.
• Topological Orderings are defined for DAGs.
Applications-
DAGs are used for the following purposes-
• To determine the expressions which have been computed more than once (called
common sub-expressions).
• To determine the names whose computation has been done outside the block but used
inside the block.
• To determine the statements of the block whose computed value can be made
available outside the block.
• To simplify the list of Quadruples by not executing the assignment instructions x:=y
unless they are necessary and eliminating the common sub-expressions.
Construction of DAGs-
Following rules are used for the construction of DAGs-
Rule-01:
In a DAG,
• Interior nodes always represent the operators.
• Exterior nodes (leaf nodes) always represent the names, identifiers or constants
Rule-02:
While constructing a DAG,
• A check is made to find if there exists any node with the same value.
• A new node is created only when there does not exist any node with the same value.
• This action helps in detecting the common sub-expressions and avoiding the re-
computation of the same.
Rule-03:
The assignment instructions of the form x:=y are not performed unless they are necessary.
Example:
1. Consider the following expression and construct a DAG for it-
(a+b)x(a+b+c)
Solution-
Three Address Code for the given expression is-
T1 = a + b
T2 = T1 + c
T3 = T1 x T2
Now, Directed Acyclic Graph is--
This illustrates how the construction scheme of a DAG identifies the common sub
sub-expression
and helps in eliminating its re-computation
computation later.
SOLVED PROBLEMS
1. Write quadruple, triples and indirect triples for following expression : (x + y) * (y +
z) + (x + y + z)
Explanation – The three address code is:
t1 = x + y
t2 = y + z
t3 = t1 * t2
t4 = t1 + z
t5 = t3 + t4
# Op Arg1 Arg2 Result
1 + x Y t1
2 + y Z t2
3 * t1 t2 t3
4 + t1 Z t4
5 + t3 t4 t5
Quadruple representation
# Op Arg1 Arg2
1 + X Y
2 + Y Z
3 * 1 2
4 + 1 Z
5 + 3 4
Triple representation
List of pointers to table
# Op Arg1 Arg2 # Statement
(14) + x y (1) (14)
(15) + y z (2) (15)
(16) * (14) (15) (3) (16)
(17) + (14) z (4) (17)
(18) + (16) (17) (5) (18)
( a + b ) * ( c – d ) + ( ( e / f ) * ( a + b ))
Solution-
Step-01:
We convert the given arithmetic expression into a postfix expression as-
(a+b)*(c–d)+((e/f)*(a+b))
ab+ * ( c – d ) + ( ( e / f ) * ( a + b ) )
ab+ * cd- + ( ( e / f ) * ( a + b ) )
ab+ * cd- + ( ef/ * ( a + b ) )
ab+ * cd- + ( ef/ * ab+ )
ab+ * cd- + ef/ab+*
ab+cd-* + ef/ab+*
ab+cd-*ef/ab+*+
Step-02:
We draw a syntax tree for the above postfix expression.
Steps Involved
Start pushing the symbols of the postfix expression into the stack one by one.
When an operand is encountered,
• Push it into the stack.
Continue in the similar manner and draw the syntax tree simultaneously.
Solution-
Directed Acyclic Graph for the given expression is-
is
Solution-
Directed Acyclic Graph for the given block is-
is
SHORT QUESTIONS
CO Blooms Mark
S. No Short Questions
Addressing level s
1 What is syntax directed definition? 4 1 2
2 Explain the usage of syntax directed definition? 4 1 2
3 What are the two types of attribute grammars 4 1 2
4 Define Annotated parse tree 4 1 2
What is the purpose of semantic analysis in a
5 2 1 2
compiler?
30 What is DAG? 4 1 2
Design DAG for the given expression
31 4 2 2
a:=b*c+b*-c
32 Construct a DAG for the expression a: = a + 30? 4 2 2
What are the functions used to create the nodes
33 4 2 2
of syntax trees?
34 What is the role of Intermediate code generator 4 1 2
Write the quadruples notation for the given
35 4 2 2
expression (a+b)-(c/d)*e
Write the triple notation for the given
36 4 2 2
expression (a+b)-(c/d)*e
Write the Indirect Triple notation for the given
37 4 2 2
expression (a+b)-(c/d)*e
Write the intermediate representation for the
38 4 2 2
given expression (a+b)-(c/d)*e
4. What is a syntax tree? Draw the syntax tree for the assignment statement a := b *
-c + b * -c.
• A syntax tree depicts the natural hierarchical structure of a source program.
• Syntax tree:
9. State quadruple
A quadruple is a record structure with four fields, which we call op, arg1, arg2 and
result.
A sequence of actions taken on entry to and exit from each procedure is known as
calling sequence.
15. What is the intermediate code representation for the expression a or b and not c?
(Or) Translate a or b and not c into three address code.
Three-address sequence is
t1:= not c
t2:= b and t1
t3:= a or t2
16. What are the methods of representing a syntax tree?
i. Each node is represented as a record with a field for its operator and additional
fields for pointers to its children
ii. Nodes are allocated from an array of records and the index or position of the
node serves as the pointer to the node
LONG QUESTIONS
CO Blooms Mark
S. No Long Questions
Addressing level s
1 Explain Syntax Directed Definition in detail 4 2 10
2 Explain Syntax Directed Translation in detail 4 2 10
Construct a parse, syntax tree and annotated
parse tree for given input String 5*6+7
S→EN
3 4 4 10
E→E+T|T
T→T*F|F
F → ( E ) | id
Design the dependency graph for following
grammar
4 S→T List 4 4 10
T→int|float|char|double
List → List , id | id
Draw the syntax tree and DAG for the
5 4 4 10
expression (a*b)+(c-d)*(a*b)+b
What is syntax tree? Write syntax directed
definition for constructing a syntax tree for an
6 expression using the given grammar 4 4 10
E→E+T|E-T|T
T→(E)|id|num
Define translation scheme and mention how it is
7 4 4 10
different from syntax-directed definition
8 Consider the following grammar G 4 5 10
E1→E
E→E+n|n
Give the parsing action of a bottom up parser
for following string n+n.
Explain why every S-attributed definition is L-
9 4 2 5
attributed.
Explain in detail about how an L-attributed
10 grammar can be converted into a translation 4 2 10
scheme.
With the help of an example, explain how to
11 4 3 10
evaluate an SDD at the nodes of a parse tree?
Give annotated parse trees for the following
expressions:
12 a) (3 + 4 ) * ( 5 + 6 ) n 4 3 10
b) 1 * 2 * 3 * (4 + 5) n.
c) (9 + 8 * (7 + 6 ) + 5) * 4 n .
13 Explain the concept of type conversion. 4 3 5
Explain how declaration is done using syntax
14 4 3 10
directed translation?
Write the quadruples, triples and indirect triples
for the following expressions.
15 4 3 10
a. (a+b)/(c-d)*(e/f + g).
b. (a+b-c/d/e).
a. Construct the syntax tree for the expression
16 (p+q) + (r-s) * (p*q) - q. 4 4 10
b. Give the applications of DAG.
Give DAG representation scheme for the
17 following expression. ( ( a - b ) * c ) – d 4 4 10
Answer: (D)
Explanation: Background Required to solve the question – Syntax Directed Translation and
Parse Tree Construction.
2. Consider the grammar with the following translation rules and E as the start
symbol.
Left associativity for operator * in a grammar is enforced by making sure that for a
production rule like S -> S1 * S2 in grammar, S2 should never produce an expression with *.
On the other hand, to ensure right associativity, S1 should never produce an expression with
*.
In the given grammar, both ‘#’ and & are left-associative.
S→TR
R → + T {print ('+');} R | ε
T → num {print (num.val);}
Here num is a token that represents an integer and num.val represents the corresponding
integer value. For an input string ‘9 + 5 + 2’, this translation scheme will print
(A) 9 + 5 + 2
(B) 9 5 + 2 +
(C) 9 5 2 + +
(D) + + 9 5 2
Answer: (B)
Explanation: Let us make the parse tree for 9+5+2 in top down manner, left first derivation.
Steps:
1) Exapnd S->TR
2) apply T->Num...
3) apply R -> +T...
4) appy T->Num...
5) apply R-> +T..
6) apply T-> Num..
7) apply R-> epsilon
After printing through the print statement in the parse tree formed you will get the answer as
95+2+
(A) Context-free grammar can be used to specify both lexical and syntax rules.
(B) Type checking is done before parsing.
(C) High-level language programs can be translated to different Intermediate
Representations.
(D) Arguments to a function can be passed using the program stack.
Answer: (B)
Explanation: Type checking is done at semantic analysis phase and parsing is done at syntax
analysis phase. And we know Syntax analysis phase comes before semantic analysis. So
Option (B) is False.
All other options seems Correct.
Blooms
S. No. Topic Learning Outcomes COs
Levels
5. RUN-TIME ENVIRONMENTS
Control stack:
A control stack is used to keep track of live procedure activations. The idea is to push
the node for an activation onto the control stack as the activation begins and to pop the node
when the activation ends. The contents of the control stack are related to paths to the root of
the activation tree. When node n is at the top of control stack, the stack contains the nodes
along the path from n to the root.
Binding of names:
Even if each name is declared once in a program, the same name may denote different
data objects at run time. “Data object” corresponds to a storage location that holds values.
The term environment refers to a function that maps a name to a storage location. The term
state refers to a function that maps a storage location to the value held there. When an
environment associates storage location s with a name x, we say that x is bound to s. This
association is referred to as a binding of x.
Fig. 5.1.1 Typical subdivision of run-time memory into code and data areas
• Run-time storage comes in blocks, where a byte is the smallest unit of memory. Four
bytes form a machine word. Multibyte objects are bytes and given the address of first
byte.
• The storage layout for data objects is strongly influenced by the addressing constraints
of the target machine.
• A character array of length 10 needs only enough bytes to hold 10 characters, a
compiler may allocate 12 bytes to get alignment, leaving 2 bytes unused.
• This unused space due to alignment considerations is referred to as padding.
STATIC ALLOCATION:
In static allocation, names are bound to storage as the program is compiled, so there is
no need for a run-time support package. Since the bindings do not change at run-time,
everytime a procedure is activated, its names are bound to the same storage locations.
Therefore values of local names are retained across activations of a procedure.
That is, when control returns to a procedure the values of the locals are the same as
they were when control left the last time. From the type of a name, the compiler decides the
amount of storage for the name and decides where the activation records go. At compile time,
we can fill in the addresses at which the target code can find the data it operates on.
Calling sequences:
Procedures called are implemented in what is called as calling sequence, which
consists of code that allocates an activation record on the stack and enters information into its
fields. A return sequence is similar to code to restore the state of machine so the calling
procedure can continue its execution after the call. The code in calling sequence is often
divided between the calling procedure (caller) and the procedure it calls (callee).
When designing calling sequences and the layout of activation records, the following
principles are helpful:
• Values communicated between caller and callee are generally placed at the beginning
of the callee’s activation record, so they are as close as possible to the caller’s
activation record.
Fixed length items are generally placed in the middle. Such i the control link, the
access link, and the machine status fields
Items whose size may not be known early enough are placed at the end of the
activation record. The most common example is dynamically sized array, where
the value of one of the callee’s parameters determines the length of the array.
We must locate the top-of-stack pointer judiciously. A common approach is to
have it point to the end of fixed-length fields in the activation record. Fixed-length
data can then be accessed by fixed offsets, known to the intermediate-code
generator, relative to the top-of-stack pointer.
Heap allocation parcels out pieces of contiguous storage, as needed for activation
records or other objects. Pieces may be deallocated in any order, so over the time the heap
will consist of alternate areas that are free and in use.
Fig. 2.10 Records for live activations need not be adjacent in heap
• The record for an activation of procedure r is retained when the activation ends.
• Therefore, the record for the new activation q(1 , 9) cannot follow that for s physically.
• If the retained activation record for r is deallocated, there will be free space in the heap
between the activation records for s and q.
Symbol table is an important data structure used in a compiler. Symbol table is used to store
the information about the occurrence of various entities such as objects, classes, variable
name, interface, function name etc. it is used by both the analysis and synthesis phases.
Operations
The symbol table provides the following operations:
Insert ()
Insert () operation is more frequently used in the analysis phase when the tokens are
identified and names are stored in the table.
The insert() operation is used to insert the information in the symbol table like the unique
name occurring in the source code.
lookup()
In the symbol table, lookup() operation is used to search a name. It is used to determine:
• The existence of symbol in the table.
• The declaration of the symbol before it is used.
• Check whether the name is used in the scope.
• Initialization of the symbol.
• Checking whether the name is declared multiple times.
The basic format of lookup() function is as follows:
lookup (symbol)
This format is varies according to the programming language.
INTRODUCTION
The code produced by the straight forward compiling algorithms can often be made to
run faster or take less space, or both. This improvement is achieved by program
transformations that are traditionally called optimizations. Compilers that apply code-
improving transformations are called optimizing compilers.
3. The transformation must be worth the effort. It does not make sense for a compiler
writer to expend the intellectual effort to implement a code improving transformation
and have the compiler expe6nd the additional time compiling source programs if this
effort is not repaid when the target programs are executed. “Peephole”
transformations of this kind are simple enough and beneficial enough to be included
in any compiler.
Function-Preserving Transformations
There are a number of ways in which a compiler can improve a program without
changing the function it computes.
Function preserving transformations examples:
• Common sub expression elimination
• Copy propagation,
• Dead-code elimination
• Constant folding
The other transformations come up primarily when global optimizations are performed.
Frequently, a program will include several calculations of the offset in an array. Some
of the duplicate calculations cannot be avoided by the programmer because they lie below the
level of detail accessible within the source language.
Copy Propagation:
Assignments of the form f : = g called copy statements, or copies for short. The idea
behind the copy-propagation transformation is to use g for f, whenever possible after the copy
statement f: = g. Copy propagation means use of one variable instead of another. This may
not appear to be an improvement, but as we shall see it gives us an opportunity to eliminate x.
For example:
x=Pi;
A=x*r*r;
The optimization using copy propagation can be done as follows: A=Pi*r*r;
Here the variable x is eliminated
Dead-Code Eliminations:
A variable is live at a point in a program if its value can be used subsequently;
otherwise, it is dead at that point. A related idea is dead or useless code, statements that
compute values that never get used. While the programmer is unlikely to introduce any dead
code intentionally, it may appear as the result of previous transformations.
Example:
i=0;
if(i=1)
{
a=b+5;
}
Here, ‘if’ statement is dead code because this condition will never get satisfied.
Loop Optimizations:
In loops, especially in the inner loops, programs tend to spend the bulk of their time.
The running time of a program may be improved if the number of instructions in an inner
loop is decreased, even if we increase the amount of code outside that loop.
Three techniques are important for loop optimization:
• Code motion, which moves code outside a loop;
• Induction-variable elimination, which we apply to replace variables from inner loop.
• Reduction in strength, which replaces and expensive operation by a cheaper one, such
as a multiplication by an addition.
Induction Variables:
Loops are usually processed inside out. For example consider the loop around B3.
Note that the values of j and t4 remain in lock-step; every time the value of j decreases by 1,
that of t4 decreases by 4 because 4*j is assigned to t4. Such identifiers are called induction
variables.
When there are two or more induction variables in a loop, it may be possible to get rid
of all but one, by the process of induction-variable elimination. For the inner loop around B3
in Fig.5.5.1 we cannot get rid of either j or t4 completely; t4 is used in B3 and j in B4.
However, we can illustrate reduction in strength and illustrate a part of the process of
induction-variable elimination. Eventually j will be eliminated when the outer loop of B2- B5
is considered.
Example:
As the relationship t4:=4*j surely holds after such an assignment to t4 in Fig. and t4 is
not changed elsewhere in the inner loop around B3, it follows that just after the statement
j:=j-1 the relationship t4:= 4*j-4 must hold. We may therefore replace the assignment t4:= 4*j
by t4:= t4-4. The only problem is that t4 does not have a value when we enter block B3 for
Prepared by G. Sunil Reddy, Asst. Professor,
Department of CSE, SR Engineering College, Warangal 223
UNIT – V: Compiler Design - Runtime Environments, Code Optimization, Code
Generation
the first time. Since we must maintain the relationship t4=4*j on entry to the block B3, we
place an initializations of t4 at the end of the block where j itself is initialized, shown by the
dashed addition to block B1 in Fig.5.6.1.
Reduction in Strength:
Reduction in strength replaces expensive operations by equivalent cheaper ones on
the target machine. Certain machine instructions are considerably cheaper than others and can
often be used as special cases of more expensive operators. For example, x² is invariably
cheaper to implement as x*x than as a call to an exponentiation routine. Fixed-point
multiplication or division by a power of two is cheaper to implement as a shift. Floating-point
division by a constant can be implemented as multiplication by a constant, which may be
cheaper.
PEEPHOLE OPTIMIZATION
A statement-by-statement code-generations strategy often produces target code that
contains redundant instructions and suboptimal constructs. The quality of such target code
can be improved by applying “optimizing” transformations to the target program.
A simple but effective technique for improving the target code is peephole
optimization, a method for trying to improving the performance of the target program by
examining a short sequence of target instructions (called the peephole) and replacing these
instructions by a shorter or faster sequence, whenever possible.
The peephole is a small, moving window on the target program. The code in the
peephole need not be contiguous, although some implementations do require this. It is
characteristic of peephole optimization that each improvement may spawn opportunities for
additional improvements.
Unreachable Code:
Another opportunity for peephole optimizations is the removal of unreachable
instructions. An unlabeled instruction immediately following an unconditional jump may be
removed. This operation can be repeated to eliminate a sequence of instructions. For
#define debug 0
….
If ( debug ) {
Print debugging information
}
In the intermediate representations the if-statement may be translated as:
If debug =1 goto L1 goto L2
L1: print debugging information L2: ………………………… (a)
One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter what
the value of debug; (a) can be replaced by:
If debug ≠1 goto L2
Print debugging information
L2: …………………………… (b)
If debug ≠0 goto L2
Print debugging information
L2: …………………………… (c)
As the argument of the statement of (c) evaluates to a constant true it can be replaced By goto
L2. Then all the statement that print debugging aids are manifestly unreachable and can be
eliminated one at a time.
Flows-Of-Control Optimizations:
The unnecessary jumps can be eliminated in either the intermediate code or the target
code by the following types of peephole optimizations. We can replace the jump sequence
goto L1
….
L1: gotoL2 (d)
by the sequence
goto L2
If there are now no jumps to L1, then it may be possible to eliminate the statement L1:goto
L2 provided it is preceded by an unconditional jump .Similarly, the sequence
if a < b goto L1
….
L1: goto L2 (e)
can be replaced by
If a < b goto L2
….
L1: goto L2
While the number of instructions in(e) and (f) is the same, we sometimes skip the
unconditional jump in (f), but never in (e).Thus (f) is superior to (e) in execution time
Algebraic Simplification:
There is no end to the amount of algebraic simplification that can be attempted
through peephole optimization. Only a few algebraic identities occur frequently enough that
it is worth considering implementing them. For example, statements such as
x := x+0 or
x := x * 1
Reduction in Strength:
Reduction in strength replaces expensive operations by equivalent cheaper ones on
the target machine. Certain machine instructions are considerably cheaper than others and can
often be used as special cases of more expensive operators.
For example, x² is invariably cheaper to implement as x*x than as a call to an
exponentiation routine. Fixed-point multiplication or division by a power of two is cheaper to
implement as a shift. Floating-point division by a constant can be implemented as
multiplication by a constant, which may be cheaper.
X2 → X*X
i:=i+1 → i++
i:=i-1 → i- -
1. Structure-Preserving Transformations:
The primary Structure-Preserving Transformation on basic blocks are:
• Common sub-expression elimination
2. Algebraic Transformations:
The compiler writer should examine the language specification carefully to determine
what rearrangements of computations are permitted, since computer arithmetic does not
always obey the algebraic identities of mathematics. Thus, a compiler may evaluate x*y-x*z
as x*(y-z) but it may not evaluate a+(b-c) as (a+b)-c.
Consider the following source code for dot product of two vectors:
Since the second and fourth expressions compute the same expression, the basic block can be
transformed as above.
d) Interchange of statements:
Suppose a block has the following two adjacent statements:
t1 : = b + c
t2 : = x + y
We can interchange the two statements without affecting the value of the block if and only if
neither x nor y is t1 and neither b nor c is t2.
2. Algebraic transformations:
Algebraic transformations can be used to change the set of expressions computed by a basic
block into an algebraically equivalent set.
Examples:
i. x : = x + 0 or x : = x * 1 can be eliminated from a basic block without changing the set
of expressions it computes.
ii. The exponential statement x : = y * * 2 can be replaced by x : = y * y.
Flow Graphs
• Flow graph is a directed graph containing the flow-of-control information for the set
of basic blocks making up a program.
• The nodes of the flow graph are basic blocks. It has a distinguished initial node.
• E.g.: Flow graph for the vector dot product is given as follows:
• B1 is the initial node. B2 immediately follows B1, so there is an edge from B1 to B2.
The target of jump from last statement of B1 is the first statement B2, so there is an
edge from B1 (last statement) to B2 (first statement).
• B1 is the predecessor of B2, and B2 is a successor of B1.
Loops
A loop is a collection of nodes in a flow graph such that
1. All nodes in the collection are strongly connected.
2. The collection of nodes has a unique entry.
• A loop that contains no other loops is called an inner loop.
It is the analysis of flow of data in control flow graph, i.e., the analysis that
determines the information regarding the definition and use of data in program. With the help
of this analysis optimization can be done. In general, its process in which values are
computed using data flow analysis. The data flow property represents information which can
be used for optimization.
Basic Terminologies –
Definition Point: a point in a program containing some definition.
Reference Point: a point in a program containing a reference to a data item.
Advantage –
• It is used to eliminate common sub expressions.
Example –
Prepared by G. Sunil Reddy, Asst. Professor,
235
UNIT – V: Compiler Design - Runtime Environments, Code Optimization, Code
Generation
Advantage –
• It is used in constant and variable propagation.
Live variable – A variable is said to be live at some point p if from p to end the variable is
used before it is redefined else it becomes dead.
Example –
Advantage –
• It is useful for register allocation.
• It is used in dead code elimination.
Busy Expression – An expression is busy along a path iff its evaluation exists along that path
and none of its operand definition exists before its evaluation along the path.
Advantage –
• It is used for performing code movement optimization.
5. Register allocation
• Instructions involving register operands are shorter and faster than those involving
operands in memory. The use of registers is subdivided into two sub problems :
1. Register allocation - the set of variables that will reside in registers at a point in
the program is selected.
2. Register assignment - the specific register that a value picked
3. Certain machine requires even-odd register pairs for some operands and results.
For example, consider the division instruction of the form : D x, y
Where, x - dividend even register in even/odd register pair y-divisor, even
register holds the remainder, odd register holds the quotient
6. Evaluation order
• The order in which the computations are performed can affect the efficiency of the
target code. Some computation orders require fewer registers to hold intermediate
results than others
• Familiarity with the target machine and its instruction set is a prerequisite for
designing a good code generator.
• The target computer is a byte-addressable machine with 4 bytes to a word.
• It has n general-purpose registers, R0, R1, . . . , Rn-1.
• It has two-address instructions of the form:
op source, destination where, op is an op-code, and source and destination are
data fields. It has the following op-codes.
For example: MOV R0, M stores contents of Register R0 into memory location M.
Instruction costs:
• Instruction cost = 1+cost for source and destination address modes. This cost
corresponds to the length of the instruction.
• Address modes involving registers have cost zero.
• Address modes involving memory location or literal have cost one.
• Instruction length should be minimized if space is important. Doing so also minimizes
the time taken to fetch and perform the instruction.
For example: MOV R0, R1 copies the contents of register R0 into R1. It has cost one, since
occupies only one word of memory.
• The three-address statement a : = b + c can be implemented by many different instruction
sequences:
i. MOV b, R0
ADD c, R0 cost = 6 MOV R0, a
ii. MOV b, a
ADD c, a cost = 6
iii. Assuming R0, R1 and R2 contain the addresses of a, b, and c : MOV *R1, *R0
ADD *R2, *R0 cost = 2
• In order to generate good code for target machine, we must utilize its addressing
capabilities efficiently.
Static allocation
Implementation of call statement:
The codes needed to implement static allocation are as follows:
MOV #here + 20, callee.static_area /*It saves return address*/
GOTO callee.code_area /*It transfers control to the target code for the called
procedure */
where,
callee.static_area - Address of the activation record
callee.code_area - Address of the first instruction for called procedure
#here + 20 - Literal return address which is the address of the instruction
following GOTO.
Implementation of return statement:
Stack allocation
Static allocation can become stack allocation by using relative addresses for storage in
activation records. In stack allocation, the position of activation record is stored in
register so words in activation records can be accessed as offsets from the value in
this register.
The codes needed to implement stack allocation are as follows:
Initialization of stack:
MOV #stackstart , SP /* initializes stack */
Code for the first procedure
HALT /* terminate execution */
• If the name in a register is no longer needed, then we remove the name from the register
and the register can be used to store some other names.
5. YACC Builds up
a) SLR parsing Table b) Canonical LR parsing Table
c) LALR parsing Table d) None of the mentioned
Answer: c
Explanation: YACC provides a general tool for describing the input to a computer program.
6. Object program is a
a) Program written in machine language
b) Translated into machine language
c) Translation of high-level language into machine language
d) None of the mentioned
Answer: c
Explanation: A computer program when from the equivalent source program into machine
language by the compiler or assembler.
Loop Optimization
1. In a single pass assembler, most of the forward references can be avoided by putting the
restriction
a) On the number of strings/life reacts
b) Code segment to be defined after data segment
c) On unconditional rump d) None of the mentioned
Answer: b
Explanation: A single pass assembler scans the program only once and creates the equivalent
binary program.
7. Pass I
a) Assign address to all statements b) Save the values assigned to all labels for use in pass 2
c) Perform some processing d) All of the mentioned
Answer: d
Explanation: The pass 1 of a compiler the above mentioned functions are performed.
8. Which table is a permanent database that has an entry for each terminal symbol?
a) Terminal Table b) Literal Table c) Identifier Table d) None of the mentioned
Answer: a
Explanation: A database that has entry for each terminal symbols such as arithmetic
operators, keywords, punctuation characters such as ‘;’, ‘,’etc Fields: Name of the symbol.
SHORT QUESTIONS
CO Blooms Mark
S. No Short Questions
Addressing level s
1 What is activation record 3 1 2
2 What is Code optimization 5 1 2
3 What is Code generator 5 1 2
List common methods for associating actual and
4 3 1 2
formal parameters?
5 Define back patching? 3 1 2
List different data structures used for symbol
6 4 1 2
table?
Write the steps to search an entry in the hash
7 4 1 2
table?
8 Write general activation record? 3 1 2
9 What are storage allocation strategies? 3 1 2
10 What is storage organization? 3 1 2
11 List the principle sources of optimization? 5 1 2
12 List 3 areas of code optimization? 5 1 2
13 List the local optimization techniques? 5 1 2
14 List the loop optimization techniques? 5 1 2
15 List the global optimization techniques? 5 1 2
16 Define local optimization? 5 1 2
17 Define constant folding? 5 1 2
List the advantages of the organization of code
18 5 2 2
optimizer?
19 Define Common Sub expressions? 5 1 2
20 Explain Dead Code elimination? 5 1 2
21 Define Reduction in strength? 5 1 2
22 Define peephole optimization? 5 1 2
23 List the different data flow properties? 5 1 2
24 Explain inner loops? 5 2 2
25 Define flow graph? 5 2 2
LONG QUESTIONS
CO Blooms Mark
S. No Long Questions
Addressing level s
1 Explain in detail about the storage organization 3 2 10
Explain in detail about the storage allocation
2 3 2 10
strategies
Define and explain in detail about the symbol
3 4 2 10
table
4 What are Activation Records? Explain In detail 3 2 10
What is Code Optimization? Explain principle
5 5 2 10
sources of optimization
6 Explain Local optimization in detail? 5 2 10
7 Explain peephole optimization? 5 2 10
8 Explain Loop optimization in detail? 5 2 10
Discuss about the following
I. Copy propagation
9 5 2 10
II. Dead code elimination
III. Code motion
Explain different schemes of storing name
10 4 2 10
attribute in symbol table.
11 Explain various Global optimization techniques 5 2 10
SOLUTION
i. Statement is false since global variables are required for recursions with static
storage. This is due to unavailability of stack in static storage.
ii. This is true
iii. In dynamic allocation heap structure is used, so it is false.
iv. False since recursion can be implemented.
Statement is completely true.
So only II & V are true.
Hence (A) is correct option.
3. What data structure in a complier is used for managing information about variables
and their attributes?
(A) Abstract syntax tree (B) Symbol table
(C) Semantic stack (D) Parse table
SOLUTION
Symbol table is used for storing the information about variables and their attributes by
compiler.
Hence (B) is correct option.
SOLUTION
Prepared by G. Sunil Reddy, Asst. Professor,
Department of CSE, SR Engineering College, Warangal 261
UNIT – V: Compiler Design - Runtime Environments, Code Optimization, Code
Generation
Dynamic memory allocation is maintained by heap data structure. So to allow dynamic data
structure heap is required.
Hence (C) is correct option.
7. Some code optimizations are carried out on the intermediate code because
(A) they enhance the portability of the compiler to other target processors
(B) program analysis is more accurate on intermediate code than on machine code
(C) the information from dataflow analysis cannot otherwise be used for optimization
(D) the information from the front end cannot otherwise be used for optimization
Answer: (A)
Explanation: Option (B) is also true. But the main purpose of doing some code-optimization
on intermediate code generation is to enhance the portability of the compiler to target
processors. So Option A) is more suitable here.
10. Which of the following class of statement usually produces no executable code when
compiled?
(A) Declaration
(B) Assignment statements
(C) Input and output statements
(D) Structural statements
Answer: (A)
PART – B
Answer any FIVE questions
All questions carry equal marks
5 x 10
1. a) With a neat sketch of diagram explain about phases of compiler.
b) Discuss about cousins of compiler.
2. Construct DFA directly from regular expression (a|b)*abb(a|b)*(a|b).
3. Consider the grammar
S (L) | a L L,S | S
Eliminate left recursion from the grammar and construct predictive parser for the grammar and
show the behaviour of the parser for the sentence (a,(a,a)).
4. Consider the following grammar
E E+T | T T T*F | F F F* | a | b
Construct SLR parsing tables for this grammar
5. Consider the grammar
E E1+T E E1-T E T T (E) T num
Write translation scheme (top down translation) for the grammar and eliminate left recursion
from the translation scheme
6. Translate the expression (a+b)*(c+d)+(a+b+c) into
a) Quadruples b) triples c) indirect triples
7. Use code generation algorithm to generate code for the following C program
Main()
{
Int i;
Int a[10];
While(i<=10)
a[i]=0;
}
8. Discuss about principle sources of optimization.
*
PART – B
Answer any FIVE questions
All questions carry equal marks
7 x 10
1. Explain different phases of compiler by showing the output of each phase using
following statement, float m,n; m=m*55 + n + 3;
2. Briefly explain about loaders and linkers.
3. Discuss top-down parsing and bottom-up parsing with suitable examples.
4. Explain shift-reduce parsing with appropriate example.
5. Write a syntax directed translation scheme to construct a syntax tree. Explain with an
example.
6. What is a type expression? Explain equivalence of type expressions with an example.
7. Compare different storage allocation strategies.
8. Write short notes on peephole optimization techniques.
*
PART – B
Answer any FIVE questions
All questions carry equal marks
5 x 10
1. Explain different phases of compiler by showing the output of each phase using
following
statement, float m,n; m=m*55 + n + 3;
2. Briefly explain about loaders and linkers.
3. Discuss top-down parsing and bottom-up parsing with suitable examples.
4. Explain shift-reduce parsing with appropriate example.
5. Write a syntax directed translation scheme to construct a syntax tree. Explain with an
example.
6. What is a type expression? Explain equivalence of type expressions with an example.
7. Compare different storage allocation strategies.
8. Write short notes on peephole optimization techniques.
*
PART – B
Answer any FIVE questions
All questions carry equal marks
8 x 10
1. a) What kind of source program errors would be detected during lexical analysis?
b) Explain the reasons for separating lexical analysis phase from syntax analysis.
2. Eliminate ambiguities from the following grammar.
S →iEtSeS|iEtS|a
E →b|c|d where a, b, c, d, e, i, t are terminals.
3. Suppose that the type of each identifier is a sub range of integers for expressions with
the operators +, -, , div and mod as in pascal. Write type checking rules that assign to
each sub expression, the sub range its value must lie in.
4. Discuss and analyze about all the allocation strategies in run-time storage environment.
5. What are legal evolution orders and names for the values at the nodes for the DAG for
following?
d=b+c
b=b c
a=c–d
e=a+b
6. What is a flow graph? Explain how flow graph can be constructed for a given program.
7. Write a procedure for constructing deterministic finite automata from a non-
deterministic automata, explain with an example.
8. What is bottom up parsing? Explain various bottom up parsing techniques.
*
PART – B
Answer any FIVE questions
All questions carry equal marks
9 x 10
1. What is the procedure for recognizing tokens? Explain.
2. What are the conditions to be satisfied for LL(1) grammar?
3. Differentiate between LL and LR parsers.
4. With the help of an example, explain how the syntax trees will be constructed?
5. List and explain the rules for type checking.
6. What are the different dynamic storage allocation techniques? Explain.
7. What is the importance of Dag representation? Explain in detail.
8. Draw the DAG for the following expressions:
a=b+c
b=a–d
c=b+c
d=a–d
*
PART – B
Answer any FIVE questions
All questions carry equal marks
10 x 10
1. Explain the various phases of a compiler.
2. a) Draw a NFA that accepts (a|b)* a
b) Explain the role of Transition Diagrams in the construction of a lexical analyzer.
3. Generate SLR parsing table for the following grammar
E E+T|T
T T*F|F
F a|b
4. a) Discuss Static Versus Dynamic Storage Allocation strategies.
b) Explain about Activation Records.
5. a) Explain Three Address Code with an example. [4]
b) Explain Flow Graph with an example block of code. [6]
6. Explain optimization of basic blocks in code generation.
7. a) Briefly explain the logical structure of a compiler front end. [4]
b) Give DAG representation scheme for the following expression. [6]
((a-b)*c)–d
8. Write about:
a) Copy Propagation [3]
b) Dead-Code Elimination [3]
c) Type Conversions [4]
*
10 x 2
PART – B
Answer any FIVE questions
All questions carry equal marks
5 x 10
1. What is the importance of a lexical analyser? Explain in detail.
2. Differentiate between lexical analysis and syntactic analysis.
3. List and explain the steps involved in constructing SLR table.
4. With the help of an example, explain how to evaluate an SDD at the nodes of a parse
tree?
5. Explain the concept of type conversion.
6. List and explain the storage allocation strategies.
7. Write a short notes on:
a) Assignment statements
b) Boolean expressions
8. How to generate code from a given DAG expression?
*
PART – B
Answer any FIVE questions
All questions carry equal marks
11 x 10 = 50
1. What is compiler? Explain the phases of a compiler.
2. Write about the role of the lexical analyzer.
3. Briefly explain "L-attributed definitions" in bottom-up evaluation of inherited attributes.
4. Explain type conversions with a suitable example.
5. Explain briefly storage allocation techniques.
6. Explain Boolean expressions.
7. List and discuss the issues in the design of a code generator.
8. Write about Code-improving Transformations.
*
PART – B
Answer any FIVE questions
All questions carry equal marks
12 x 10 = 50
1. Give a brief note on analysis of the source program.
2. Briefly explain specification of tokens.
3. Explain Bottom-Up parsing with an example.
4. Write about Symbol tables.
5. Explain briefly storage organization.
6. Explain three-address code and its types.
7. Write about backpatching.
8. Briefly explain loops in flow graphs.
*
1. Compiler - A program that reads a program written in one language and translates it in to
an equivalent program in another language.
2. Analysis part - Breaks up the source program into constituent pieces and creates an
intermediate representation of the source program.
3. Synthesis part - Constructs the desired target program from the intermediate
representation.
4. Structure editor - Takes as input a sequence of commands to build a source program.
Pretty printer - analyses a program and prints it in such a way that the structure of the
program becomes clearly visible.
5. Static checker - Reads a program, analyses it and attempts to discover potential bugs
without running the program.
6. Linear analysis - This is the phase in which the stream of characters making up the
source program is read from left to right and grouped in to tokens that are sequences of
characters having collective meaning.
7. Hierarchical analysis - This is the phase in which characters or tokens are grouped
hierarchically in to nested collections with collective meaning.
8. Semantic analysis - This is the phase in which certain checks are performed to ensure
that the components of a program fit together meaningfully.
9. Loader - Is a program that performs the two functions: Loading and Link editing
10. Loading - Taking relocatable machine code, altering the relocatable address and placing
the altered instructions and data in memory at the proper locations.
11. Link editing - Makes a single program from several files of relocatable machine code.
12. Preprocessor - Produces input to compilers and expands macros into source language
statements.
13. Symbol table - A data structure containing a record for each identifier, with fields for the
attributes of the identifier. It allows us to find the record for each identifier quickly and to
store or retrieve data from that record quickly.
14. Assembler - A program, which converts the source language in to assembly language