QA of Compiler
QA of Compiler
1. Define parser.
Ans. Ambiguity:- Ambiguity can make it difficult to translate from concrete syntax to
abstract syntax, and can affect how an expression is interpreted.
Left recursion:-Left recursion can sometimes cause a reasonable set of productions to be
rejected due to shift/reduce or reduce/reduce conflicts.
Left factoring:- The right parsing technique depends on several factors, including the
programming language, grammar complexity, parsing efficiency requirements, error
recovery capabilities, and available resources
Ans. Separating lexical and syntax analyzer makes the parser simpler and more efficient, and
improves the portability of the compiler:
Simplicity:-Separating the details of lexical analysis from the syntax analyzer makes
the syntax analyzer smaller and less complex.
Efficiency:-Separating the lexical and syntax analyzers allows for optimization of the
lexical analyzer.
Portability:-The lexical analyzer may not be portable because it reads source files,
but the parser always is.
4. Define CFG?/what do you mean by CFG?
Ans. A context-free grammar (CFG) is a formal system that defines the structure of a
language by describing how to form sentences:
Definition:- A CFG is a set of rules that specify how to replace symbols or groups of
symbols with other symbols.
Use:-CFGs are used to describe programming languages, natural languages, and
parser programs in compilers.
For example, a CFG for the language of palindromes might include the rules
P → ǫ, P → 0, P → 1, P → 0P0, and P → 1P1.
Components: A CFG is defined by four tuples: G = (V, T, P, S):
V: The set of nonterminal symbols, represented by capital letters
T: The set of terminal symbols, represented by lowercase letters
P: The set of production rules, which replace nonterminal symbols with
terminals
S: The start symbol, which is used to derive the string
5. Define the terms language translator and compiler
Ans. Language translator:- A program that translates a program written in one language
into a program in another language. The translator preserves the functional or logical
structure of the original code.
Compiler:- A type of language translator that converts a high-level language program into
machine code. Compilers translate the entire source code into machine code in one step.
Here are some other types of language translators:
Interpreter: Translates high-level languages into machine code one line at a time.
Assembler: Converts assembly language code (a low-level symbolic language) into machine
code.
Language translators are used because computers only understand machine language, which
is made up of 0s and 1s.
6. What is a flow graph? Explain with example.
Ans. A flow graph is a directed graph that represents the flow of control among basic blocks
in a program. It shows how program control is parsed among the blocks.
Here's how a flow graph works: Nodes: Represent basic blocks, Links: Represent decision
paths between nodes. Flow graphs are used to represent the logic of a program.
There are also other types of graphs that are related to flow graphs, including:
Signal-flow graph:-A directed graph that represents a set of linear algebraic equations. The
nodes represent variables, and the branches represent the coefficients relating the variables.
Data flow graph:-A representation of a machine-language program. It consists of nodes
called actors or operators connected by directed arcs or links. The arcs contain tokens or
data values.
Ans. The primary object code forms include relocatable object code and absolute object
code; with relocatable code being the most common, allowing for flexibility in memory
placement during linking, while absolute code is directly executable at a fixed memory
location.
Intermediate Object Code: Some compilers may generate an intermediate representation
of the program before producing final object code, facilitating optimization and code
generation.
Object File Formats:- Different operating systems and compilers may utilize different
object file formats (like COFF, ELF, PE) to store object code, including additional
information like symbol tables and debugging data.
Ans. An abstract syntax tree (AST) and a directed acyclic graph (DAG) are both used to
represent intermediate code, but they differ in how they represent the structure of a program:
Abstract syntax tree (AST):- A simplified parse tree that retains the
syntactic structure of code. An AST is a tree representation of the
abstract syntactic structure of text written in a formal language. Each
node in the tree denotes a construct occurring in the text.
Directed acyclic graph (DAG):- graphical representations of symbolic
expressions where any two provably equal expressions share a single
node. A DAG is a directed graph with no directed cycles. DAGs are
generated as a combination of trees, where operands that are being
reused are linked together.
DAGs and ASTs are both essential techniques used in compiler design
to optimize and translate source code into machine code or other intermediate
representations.
9. Define left recursion. Is the following grammar left recursive? EE + E/E * E/ a/ b
Ans. A grammar is left-recursive if and only if there exists a nonterminal symbol that can
derive to a sentential form with itself as the leftmost symbol. For example:- EE + E/E * E/
a/ b
Ans. Hashing is a technique that converts a key or string of characters into a shorter, fixed-
length value. This process is used to make it easier to find or use the original string. Hashing
is used in a variety of applications, including:
Hash tables: Hashing is commonly used to set up hash tables, which are array-based
structures that store key-value pairs.
Digital forensics and data security: Hashing algorithms, such as Message Digest 5 (MD5)
and Secure Hashing Algorithm (SHA) 1 and 2, are used in these fields.
Universities and libraries: Hashing is used to assign unique identifiers to students and
books, which can then be used to retrieve information about them.
Database management systems:- Hashing can be used to calculate the direct position of a
data record on a disk without using an index structure.
Hashing works by using a hash function to generate a new value based on a mathematical
algorithm
12. What do you mean by activation record?
Ans. An activation record is a contiguous block of storage that
manages information required by a single execution of a procedure.
When you enter a procedure, you allocate an activation record, and
when you exit that procedure, you de-allocate it. Basically, it stores
the status of the current activation function. So, whenever a function
call occurs, then a new activation record is created and it will be
pushed onto the top of the stack. It will remain in stack till the
execution of that function. So, once the procedure is completed and
it is returned to the calling function, this activation function will be
popped out of the stack.
If a procedure is called, an activation record is pushed into the stack,
and it is popped when the control returns to the calling function.
13. Give the full form and definition of DAG?
Ans . refer Q.8
14. What is Intermediate code?
Ans. Intermediate code is a machine-independent representation of a program generated
during the compilation process, acting as a bridge between the high-level source code and
the machine code, allowing for easier optimization and portability across different computer
architectures; essentially, it's a simplified version of the original program that is easier to
manipulate before being translated into the final executable code for a specific machine.
Machine-independent: Unlike machine code, which is specific to a particular processor,
intermediate code can be processed without considering the target hardware.
Used in compilers: A compiler generates intermediate code during the translation process,
allowing for optimizations to be performed on this representation before generating the final
machine code.
Different forms: Intermediate code can be represented in various formats like syntax trees,
three-address code (TAC), or quadruples, depending on the compiler design.
Benefits: Portability: Enables code to be compiled and run on different machines with
minimal changes.
Optimization: Provides a convenient level of abstraction for applying optimizations before
generating machine code.
Code analysis: Facilitates static analysis and error detection during compilation
15. What is input buffering?
Ans. Input buffering is an important concept in compiler design that refers to the way in
which the compiler reads input from the source code. In many cases, the compiler reads
input one character at a time, which can be a slow and inefficient process. Input buffering is
a technique that allows the compiler to read input in larger chunks, which can improve
performance and reduce overhead.
Buffer Pairs:- Because of the amount of time taken to process characters and the large
number of characters that must be processed during the compilation of a large source
program, specialized buffering techniques have been developed to reduce the amount of
overhead required to process a single input character. An important scheme involves two
buffers that are alternately reloaded
PART-B
1. Define LL(1) grammar. In the following grammar LL(1)
G: Si E t S / i E t S e /a ; E b
Also write the rules for computing FIRST() & FOLLOW().
Grammar FIRST() FOLLOW()
Si E t S {i} {$, t}
Si E t S e {i} {$,t, e}
S a {a} {$}
Eb {b} {b,t}
2. What is LALR(1) grammar ? construct LALR parsing table for the following grammar
SAA, AaA/b
Ans. Definitions are at the top of the YACC input file. They include header files or any
information about the regular definitions or tokens. We define the tokens using a
modulus sign, whereas we place the code specific to C, such as the header files,
within %{%{ and %}%}.
Some examples are as follows:
%token ID
{% #include <stdio.h> %}
Rules are between %%%% and %%%%. They define the actions we take when we scan the
tokens. They execute whenever a token in the input stream matches with the grammar. Any
action in C is placed between curly brackets ( {}{}).
Auxiliary routines includes functions that we may require in the rules section. Here, we
write a function in regular C syntax. This section includes the main() function, in which
the yyparse() function is always called.
The yyparse() function reads the tokens, performs the actions, and returns to the main when it
reaches the end of the file or when an error occurs. It returns 00 if the parsing is successful,
and 11 if it is unsuccessful.
Example
The following code is an example of a YACC program of a simple calculator taking two
operands. The .y extension file contains the YACC code for the parser generation, which
uses the .l extension file that includes the lexical analyzer.
file.y
Lexfile.l
1 %{
2 #include <ctype.h>
3 #include <stdio.h>
4 int yylex();
5 void yyerror();
6 int tmp=0;
7 %}
8
9 %token num
10 %left '+' '-'
11 %left '*' '/'
12 %left '(' ')'
13
14 %%
15
16 line :exp {printf("=%d\n",$$); return 0;};
17
18 exp :exp '+' exp {$$ =$1+$3;}
19 | exp '-' exp {$$ =$1-$3;}
20 | exp '*' exp {$$ =$1*$3;}
21 | exp '/' exp {$$ =$1/$3;}
22 | '(' exp ')' {$$=$2;}
23 | num {$$=$1;};
24
25 %%
26
27 void yyerror(){
28 printf("The arithmetic expression is incorrect\n");
29 tmp=1;
30 }
31 int main(){
32 printf("Enter an arithmetic expression(can contain +,-,*,/ or parenthesis):\n");
33 yyparse();
34 }
35
An example of YACC and Lex code for a calculator
Explanation
In the file.y YACC file:
Lines 1–7: We initialize the header files with the function definitions.
Lines 9–12: We initialize the tokens for the grammar.
Lines 16–23: We define the grammar used to build the calculator.
Lines 27–34: We define an error function with the main function.
In lexfile.l Lex file:
Lines 1–7: The header files are initialized along with the YACC file included as a header
file.
Lines 11–18: The regular definition of the expected tokens is defined such as the
calculator input will contain digits 0-9.
Execution
To execute the code, we need to type the following commands in the terminal below:
We type lex lexfile.l and press "Enter" to compile the Lex file.
We type yacc -d yaccfile.y and press "Enter" to compile the YACC file.
We type gcc -Wall -o output lex.yy.c y.tab.c and press "Enter" to generate the C files for
execution.
We type ./output to execute the program.
Ans. Symbol Tables:- A symbol table is a data structure used by compilers and interpreters to
store information about variables, functions, and other identif
identifiers
iers in a program. Efficient
organization is crucial for fast lookups during compilation or interpretation.
Several strategies exist for organizing symbol tables, each with trade
trade-offs
offs in terms of space
and time complexity:
Linear List: A simple approach us using
ing an array or linked list. Searching is linear (O(n)),
insertion and deletion are relatively easy. Suitable for small programs but inefficient for
large ones.
Hash Table: Uses a hash function to map identifiers to table entries. Average-case
Average
search, insertion,
rtion, and deletion are O(1), but worst
worst-case
case can be O(n) (if many collisions
occur). Requires careful choice of hash function to minimize collisions.
Binary Search Tree (BST): Organizes symbols in a tree structure, allowing for
efficient searching (O(log n) on average), insertion, and deletion. However, performance
degrades to O(n) if the tree becomes unbalanced. Self-balancing BSTs (like AVL trees
or red-black trees) maintain balance, ensuring O(log n) performance in all cases.
Trie: A tree-like structure where each node represents a character. Efficient for prefix-
based searches (finding all identifiers starting with a given prefix). Space consumption
can be high.
8. Describe bootstrapping in detail.
1. Source Language
2. Target Language
3. Implementation Language
1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and
that compiler runs on machine A.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler
for language L, which runs on machine A and produces code for machine
A.
Ans. Basic Block is a straight line code sequence that has no branches in and out branches except to the
entry and at the end respectively. Basic Block is a set of statements that always executes one after other,
in a sequence. The first task is to partition a sequence of three-address codes into basic blocks. A new
basic block is begun with the first instruction and instructions are added until a jump or a label is met. In
the absence of a jump, control moves further consecutively from one instruction to another. The idea is
standardized in the algorithm below:
Algorithm: Partitioning three-address code into basic blocks.
Input: A sequence of three address instructions.
Process: Instructions from intermediate code which are leaders are determined. The following are the
rules used for finding a leader:
1. The first three-address instruction of the intermediate code is a leader.
2. Instructions that are targets of unconditional or conditional jump/goto statements are leaders.
3. Instructions that immediately follow unconditional or conditional jump/goto statements are
considered leaders.
Each leader thus determined its basic block contains itself and all instructions up to excluding the
next leader.
Basic blocks are sequences of instructions in a program that have no branches except at the entry and
exit. Example 1:
The following sequence of three-address statements forms a basic block:
t1 := a*a
t2 := a*b
t3 := 2*t2
t4 := t1+t3
t5 := b*b
t6 := t4 +t5
A three address statement x:= y+z is said to define x and to use y and z. A name in a basic block is said
to be live at a given point if its value is used after that point in the program, perhaps in another basic
block.
Structure-Preserving Transformations:
Dead code is defined as that part of the code that never executes during the program execution.
So, for optimization, such code or dead code is elimi
eliminated.
nated. The code which is never executed
during the program (Dead code) takes time so, for optimization and speed, it is eliminated from
the code. Eliminating the dead code increases the speed of the program as the compiler does not
have to translate the de
dead code.
Example:
// Program with Dead code
int main()
{
x=2
if (x > 2)
cout << "code"; // Dead code
else
cout << "Optimization";
return 0;
}
// Optimized Program without dead code
int main()
{
x = 2;
cout << "Optimization";
ization"; // Dead Code Eliminated
return 0;
}
Countless algebraic transformations can be used to change the set of expressions computed by a basic block
into an algebraically equivalent set. Some of the algebraic transformation on basic blocks includes:
1. Constant Folding
2. Copy Propagation
3. Strength Reduction
1. Constant Folding:
Solve the constant terms which are continuous so that compiler does not need to solve this expression.
Example:
x = 2 * 3 + y ⇒ x = 6 + y (Optimized code)
2. Copy Propagation:
It is of two types, Variable Propagation, and Constant Propagation.
Variable Propagation:
x=y ⇒ z = y + 2 (Optimized code)
z=x+2
Constant Propagation:
x=3 ⇒ z = 3 + a (Optimized code)
z=x+a
3. Strength Reduction:
Replace expensive statement/ instruction with cheaper ones.
x = 2 * y (costly) ⇒ x = y + y (cheaper)
x = 2 * y (costly) ⇒ x = y << 1 (cheaper)
11. Construct a DAG for the basic block whose code is given below:-
D := B * C
E := A + B
B := B * C
A := E – D
Ans.
Ans. The final phase in compiler model is the code generator. It takes as input an
intermediate representation of the source program and produces as output an equivalent
target program. The code generation techniques presented below can be used whether or not
an optimizing phase occurs before code generation.
Fig. 4.1 Position of code generator
1. Input to code generator: The input to the code generation consists of the intermediate
representation of the source program produced by front end, together with information in the
symbol table to determine run-time addresses of the data objects denoted by the names in the
intermediate representation.
• Intermediate representation can be :
a. Linear representation such as postfix notation
b. Three address representation such as quadruples
c. Virtual machine representation such as stack machine code
d. Graphical representations such as syntax trees and dags.
e. Prior to code generation, the front end must be scanned, parsed and translated into
intermediate representation along with necessary type checking. Therefore, input to code
generation is assumed to be error-free.
2. Target program:The output of the code generator is the target program. The output may be :
a. Absolute machine language. It can be placed in a fixed memory location and can be
executed immediately.
b. Relocatable machine language. It allows subprograms to be compiled separately. Assembly
language - Code generation is made easier.
3. Memory management:
• Names in the source program are mapped to addresses of data objects in run-time memory by
the front end and code generator.
• It makes use of symbol table, that is, a name in a three-address statement refers to a symbol-
table entry for the name.
• Labels in three-address statements have to be converted to addresses of instructions. For
example,
j:gotoigenerates jump instruction as follows:
* if i < j, a backward jump instruction with target address equal to location of code for
quadruple i is generated.
* if i > j, the jump is forward. We must store on a list for quadruple i the location of the first
machine instruction generated for quadruple j. When i is processed, the machine locations for all
instructions that forward jumps to i are filled.
4. Instruction selection:
• The instructions of target machine should be complete and uniform.
• Instruction speeds and machine idioms are important factors when efficiency of target
program is considered.
• The quality of the generated code is determined by its speed and size.
• The former statement can be translated into the latter statement as shown below:
a:=b+c
d:=a+e (a)
MOV b, R0
ADD c, R0
MOV R0, a (b)
MOV a, R0
ADD e, R0
MOV R0, d
5. Register allocation
• Instructions involving register operands are shorter and faster than those involving operands in
memory. The use of registers is subdivided into two subproblems :
1. Register allocation - the set of variables that will reside in registers at a point in the program is
selected.
2. Register assignment - the specific register that a value picked•
3. Certain machine requires even-odd register pairs for some operands and results. For example ,
consider the division instruction of the form :D x, y
where, x - dividend even register in even/odd register pair y-divisor even register holds the
remainder odd register holds the quotient
6. Evaluation order
• The order in which the computations are performed can affect the efficiency of the target code.
Some computation orders require fewer registers to hold intermediate results than others.
13. What is the basic task of scanning? What are the difficulties found in delimiter oriented
scanning? How can this be removed?
Ans. Syntax Directed Translation is a set of productions that have semantic rules embedded
inside it. The syntax-directed translation helps in the semantic analysis phase in the compiler.
SDT has semantic actions along with the production in the grammar. This article is about
postfix SDT and postfix translation schemes with parser stack implementation of it. Postfix
SDTs are the SDTs that have semantic actions at the right end of the production. This article
also includes SDT with actions inside the production, eliminating left recursion from SDT and
SDTs for L-attributed definitions.
The syntax-directed translation which has its semantic actions at the end of the production is
called the postfix translation scheme.
This type of translation of SDT has its corresponding semantics at the last in the RHS of the
production.
SDTs which contain the semantic actions at the right ends of the production are
called postfix SDTs.
Example of Postfix SDT
S ⇢ A#B{S.val = A.val * B.val}
A ⇢B@1{A.val = B.val + 1}
B ⇢num{B.val = num.lexval}
Postfix SDTs are implemented when the semantic actions are at the right end of the production
and with the bottom-up parser(LR parser or shift-reduce parser) with the non-terminals having
synthesized attributes.
The parser stack contains the record for the non-terminals in the grammar and their
corresponding attributes.
The non-terminal symbols of the production are pushed onto the parser stack.
If the attributes are synthesized and semantic actions are at the right ends then attributes of
the non-terminals are evaluated for the symbol in the top of the stack.
When the reduction occurs at the top of the stack, the attributes are available in the stack,
and after the action occurs these attributes are replaced by the corresponding LHS non-
terminal and its attribute.
Now, the LHS non-terminal and its attributes are at the top of the stack.
Production
A ⇢ BC{A.str = B.str . C.str}
B ⇢a {B.str = a}
C ⇢b{C.str = b}
Initially, the parser stack:
B C Non-terminals
B.str C.str Synthesized attributes
⇡
Top of Stack
After the reduction occurs A ⇢BC then after B, C and their attributes are replaced by A and in
the attribute. Now, the stack:
A Non-terminals
A.str Synthesized attributes
⇡
Top of stack
When the semantic actions are present anywhere on the right side of the production then it
is SDT with action inside the production.
It is evaluated and actions are performed immediately after the left non-terminal is processed.
This type of SDT includes both S-attributed and L-attributed SDTs.
If the SDT is parsed in a bottom-up parser then, actions are performed immediately after the
occurrence of a non-terminal at the top of the parser stack.
If the SDT is parsed in a top-down parser then, actions are before the expansion of the non-
terminal or if the terminal checks for input.
Example of SDT with action inside the production
S ⇢ A +{print '+'} B
A ⇢ {print 'num'}B
B ⇢ num{print 'num'}
Ans do self
The term parser LR(k) parser, here the L refers to the left
left-to-right
right scanning, R refers to the
rightmost derivation in reverse and k refers to the number of unconsumed “look ahead”
input symbols that are used in making parser decisions. Typically, k is 1 and is often
omitted. A context-free
free grammar is called LR (k) if the LR (k) parser exists for it. This first
reduces the sequence of tokens to the left. But when we read from above, the derivation
order first extends to non-terminal.
terminal.
1. The stack is empty, and we are looking to reduce the rule by S’ S’→S$.
2. Using a “.” in the rule represents how many of the rules are already on the stack.
3. A dotted item, or simply, the item is a production rrule ule with a dot indicating how much
RHS has so far been recognized. Closing an item is used to see what production rules
can be used to expand the current structure. It is calculated as follows:
Rules for LR parser : The rules of LR parser as follows.
1. The first
irst item from the given grammar rules adds itself as the first closed set.
2. If an object is present in the closure of the form AA→ α. β. γ, where the next symbol after
the symbol is non-terminal,
terminal, add the symbol’s production rules where the dot precedes
the first item.
3. Repeat steps (B) and (C) for new items added under (B).
LR parser algorithm : LR Parsing algorithm is the same for all the parser, but the
parsing table is different for each parser. It consists following components as follows.
1. Input Buffer – It contains the given string, and it ends with a $ symbol.
LR parser diagram
d
17. Classify the errors and discuss the errors in each phase of compiler.
See the Question of 9 part A
18. What is symbol table? Write the procedure to store the names in symbol
ymbol table.
Ans. Do self
19. Explain bottom up parsing?
Ans.
20. Write short note on global data flow analysis
Ans.
PART-C
1. Writing short notes on
a. Nesting dept & access link
b. Data structures used is symbolic table
c. Static versus dynamic storage allocation
2. What is LEX? Discuss the usage of LEX in Lexical Analyzer generation
3. Generate the three address code for the following code fragment
while(a > b)
{
if( c > d)
x = y + z;
else
x = y – z;
}
4. Explain the different storage allocation strategies.
5. Explain the following terms:-
i. Register descriptor
ii. Address deecriptor
iii. Instruction costs
6. Consider the following grammar G :-
E E + T | T
T TF | F
FF*|a|b
a. Construct the SLR parsing table for this grammar.
b. Construct the LALR parsing
7. Define syntax directed definition. Explain the various forms of syntax directed definition.
8. Translate the arithmetic expression:-
(a + b) * (c + d) + (a + b + c) into
a. Syntax tree
b. Three address code
c. Quadruple
d. Triples
9. Consider the following basic block and then construct the DAG for it
t1 = a + b
t2 = c + d
t3 = e – t2
t4 = t1 - t3
10. Explain different storage allocation strategies
11. Consider the following LL(1) grammar describing a certain sort of rented lists:-
S TS|Ꜫ
T U.T|U
U x|y|[s]
i. Left factor this grammar
ii. Give the first and follow sets for each non-terminal in the grammar obtain in part(i).
iii. Using this information construct an LL parsing table for the grammar obtain in
part(i).
12. (a) Calculate canonical collection of sets of LR(0). Items of grammar given below:-
E’ E
E E + T| T
TT*F|F
F (E) | id
Ans.
(b) Calculate canonical collection of sets of LR(1). Items of grammar given below:-
s’ s
s cc
c ec/d
13. For the assignment statement X = (a + b) * (c + d), construct the translation scheme and
an annotated parse tree. Also, differentiate between ‘call by value’ and ‘call by reference’
with example.
14. Explain peephole optimization in detail .
15. Explain the symbol table management system in detail.
16. Explain parsing techniques with a hierarchical diagram.
17. Construct syntax tree and postfix notation for the following expressin
(a + (h * c)^ d – e / (f + g ))
18. What is common sub expression and how to eliminate it? Explain with the help of
appropriate example.