0% found this document useful (0 votes)
4 views

compiler design unit I 2025

The document outlines the principles of compiler design, including the history, necessity, and various types of compilers. It details the compilation process, phases of compilation, and features of compilers, as well as error detection and reporting mechanisms. Additionally, it discusses the analysis and synthesis model, symbol-table management, and common semantic errors encountered during compilation.

Uploaded by

venkat Mohan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

compiler design unit I 2025

The document outlines the principles of compiler design, including the history, necessity, and various types of compilers. It details the compilation process, phases of compilation, and features of compilers, as well as error detection and reporting mechanisms. Additionally, it discusses the analysis and synthesis model, symbol-table management, and common semantic errors encountered during compilation.

Uploaded by

venkat Mohan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

Principles of compiler design

III Year D Section


Preliminaries Required
• Basic knowledge of programming languages.
• Basic knowledge of CFG.
• Knowledge of a high programming language
for the programming assignments.

Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D.
Ullman,
“Compilers: Principles, Techniques, and Tools”
Pearson Education, 2014
History of Compiler
Important Landmark of Compiler's history are as
follows:
•The "compiler" word was first used in the early
1950s by Grace Murray Hopper
•The first compiler was build by John Backum and
his group between 1954 and 1957 at IBM
•COBOL was the first programming language
which was compiled on multiple platforms in 1960
•The study of the scanning and parsing issues
were pursued in the 1960s and 1970s to provide a
complete solution
Why Compilers?
• Machine structures became too complex
and software management too difficult to
continue with low level language

• A program that translates from one


language to another

• It must preserve semantics of the source

• It should create an efficient version of


the target language
COMPILATION

• Translating high-level code to machine code


– Accepts source code as input and returns machine code
– Compilation phases include:
• Lexical Analysis (Scanner)
– Groups the program’s characters into “tokens”
• Parser (Syntax)
– Analyses the grammatical structure of the program
• Intermediate code generator (Semantics)
– Generates correct code
• Optimizer
– Improves code execution efficiency and memory
requirements
• Code generator
– Produces the object file containing the machine code
COMPILERS
A compiler is a program takes a program written in a
source language and translates it into an equivalent
program in a target language.

source program COMPILER target program

( Normally a program written in


a high-level programming language) ( Normally the equivalent program in
machine code – relocatable object file)

error messages
COMPILATION VS. INTERPRETATION
Compiler Interpreter

Compiler takes entire program as Interpreter takes single instruction as


input input

Intermediate Object code is Generated No Intermediate object code is Generated

Conditional Control statements are Conditional Control Statements are


Executes faster executes slower

Memory requirement; more (since


object code is generated) Memory Requirement is less

Program need not be compiled every Every time higher level program is
time converted into lower level program

Errors are displayed after entire Errors are displayed for every
program is checked instruction interpreted (if any)

Programming language like C,C++ Programming language like Python,


use compilers Ruby use interpreters.
Features of Compilers
•Correctness
•Speed of compilation
•Preserve the correct the meaning of the code
•The speed of the target code
•Recognize legal and illegal program constructs
•Good error reporting/handling
•Code debugging help
Types of Compiler
•Native code compiler
•Cross Compiler
•Bootstrap compiler
•Single Pass Compilers
•Two Pass Compilers
•Multipass Compilers
•JIT Compiler
•Source to source compiler
Native code compiler
A compiler may produce binary output to run /execute on the
same computer and operating system. This type of compiler is
called as native code compiler.

Cross Compiler
A cross compiler is a compiler that runs on one machine and
produce object code for another machine.

Bootstrap compiler
If a compiler has been implemented in its own language . self-
hosting compiler.
Single Pass Compiler

Compiler
source code Target code

In single pass Compiler source code directly


transforms into machine code. For example, Pascal
language.
Two Pass Compiler
IR
Front Back
Source code Target code
End End

Two pass Compiler is divided into two sections, viz.

Front end: It maps legal code into Intermediate


Representation (IR).

Back end: It maps IR onto the target machine

The Two pass compiler method also simplifies the retargeting


process. It also allows multiple front ends.
Multipass Compilers
Source code Front Middle Back
End End End
IR IR Machine code

Errors

The multipass compiler processes the source code or syntax tree of a program
several times.

It divided a large program into multiple small programs and process them.

It develops multiple intermediate codes.

All of these multipass take the output of the previous phase as an input.

So it requires less memory. It is also known as 'Wide Compiler’.

Example:- gcc , Turboo C++


JIT Compiler
This compiler is used for JAVA programming language and Microsoft .NET

Source to source compiler


It is a type of compiler that takes a high level language as a input and its
output as high level language. Example Open MP

List of compiler
1. Ada compiler
2. ALGOL compiler
3. BASIC compiler
4. C# compiler
5. C compiler
6. C++ compiler
7. COBOL compiler
8. Smalltalk comiler
9. Java compiler
COMPILATION PROCESS
Other Applications
• In addition to the development of a compiler, the
techniques used in compiler design can be
applicable to many problems in computer science.

– Techniques used in a lexical analyzer can be used in


text editors, information retrieval system, code analysis
tools, security and pattern recognition programs.
– Techniques used in a parser can be used in a query
processing system such as SQL.
– Most of the techniques used in compiler design can be
used in Natural Language Processing (NLP) systems.
Major Parts of Compilers
There are two major parts of a compiler:
Analysis and Synthesis
• In analysis phase, an intermediate representation is
created from the given source program.
– Lexical Analyzer, Syntax Analyzer and Semantic Analyzer
are the parts of this phase.
• In synthesis phase, the equivalent target program is
created from this intermediate representation.
– Intermediate Code Generator, Code Generator, and Code
Optimizer are the parts of this phase.
Analysis-Synthesis Model
• Compilation: Analysis & Synthesis
• Analysis:
– Break source program into pieces
– Intermediate representation
– Hierarchical structure: syntax tree
• Node: operation
• Leaf: arguments
• Synthesis: construct target program from
tree
The Analysis-Synthesis Model of
Compilation:

source Front IR Back machine


code End End code

errors

20
Analysis and Synthesis
Machine
D=A+ B*C code
Load R1,B
Load R2,C
MUL R1,R2

A,B,C,D are Store T1,R1


T1 = B * C
Variables Load R1,A
T2 = A + T1
+ * are Load R2, T1
operators D = T2
ADD R1,R2
Store T2,R1
Intermediate
Analysis Store D,T2
Representation
Synthesis
Phases of A Compiler

Source Lexical Syntax Semanti Intermediate Code Code Target


Progra Analyz Analyz c Code Optimiz Generator Progra
m er er Analyze Generator er m
r

• Each phase transforms the source program from one


representation into another representation.

• They communicate with error handlers.

• They communicate with the symbol table.


The Phases of A Compiler
Lexical Analyzer
• Lexical Analyzer reads the source program
character by character and returns the tokens of
the source program.
• A token describes a pattern of characters having
same meaning in the source program. (such as
identifiers, operators, keywords, numbers,
delimeters and so on)
• Puts information about identifiers into the symbol
table.
• Regular expressions are used to describe tokens
(lexical constructs).
• A (Deterministic) Finite State Automaton can be
used in the implementation of a lexical analyzer.
Lexical Analysis
• Linear analysis: lexical analysis, scanning
❖ e.g., position:= initial+rate*60
1. Identifier position
2. Assignment symbol “: =“
3. Identifier initial
4. “+” sign
5. Identifier rate
6. “*” sign
7. number 60
Operator 🡪 + | - | * | %

relop → < | <= | = | <>| > | >=

id → letter (letter|digit)*

num → digit+ (.digit+)? (E(+|-)?digit+ )?

delim → blank | tab | newline

ws → delim+
Syntax Analyzer
• A Syntax Analyzer creates the syntactic
structure (generally a parse tree) of the
given program.

• A syntax analyzer is also called as a parser.


• A parse tree describes a syntactic structure.

• Hierarchical analysis:
– Group tokens into grammatical phrases
Syntax Analysis
Semantic Analysis
• Check semantic error
• Gather type information for code-generation
• Using hierarchical structure to identify
operators and operands
• Doing type checking
– E.g, using a real number to index an array (error)
– Type convert
E.g, intoreal(60) if initial is a real number
Semantic Analysis
Intermediate Code Generation
• Represent the source program for an abstract
machine code
• Should be easy to produce and easy to translate
into target program
• Three-address code (consists of a sequence of
instructions, each of which has atmost three
operands)
– temp2:=id3*temp1
– every memory location can act like a register
Intermediate Code Generation
• It has several properties
1.Each instruction has atmost one operator in addition to
the assignment.
2.The compiler must generate a temporary name to hold
the value computed by each instruction
3. some three-address instruction have fewer than three
operands.

temp1 := inttoreal(60)
temp2 := id3 * temp1
temp3 := id2 + temp2
id1 : = temp3
Code Optimization
• Improve the intermediate code
• Faster-running machine code
– temp1 :=id3*60.0
id1:=id2+temp1
Code Generation
• Generate relocation machine code or
assembly code
– MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
Translation of A Statement
Symbol-table Management
• To record the identifiers in source program
– Identifier is detected by lexical analysis and then is
stored in symbol table
• To collect the attributes of identifiers
(not by lexical analysis)
– Storage allocation : memory address
– Types
– Scope (where it is valid, local or global)
– Arguments (in case of procedure names)
• Arguments numbers and types
• Call by reference or address
• Return types
Symbol-table Management
• Semantic analysis uses type information
check the type consistence of identifiers
• Code generating uses storage allocation
information to generate proper relocation
address code
Symbol table
Example

int a, b; float c; char z;

Symbol Type Address


Name
a int 1000
b int 1002
c Float 1006
z Char 1007
Example

extern double test (double x);


double sample (int count)
{
double sum= 0.0;
for (int i = 1; i < = count; i++)
sum+= test((double) i);
return sum;
}

Symbol name Type Scope


test function, double extern

x double function parameter

sample function, double global

count int function parameter

sum double block local

i int for-loop statement


Error Detection and Reporting
• Lexical phase: could not form any token
• Syntax and semantic analysis handle a
large fraction of errors
• Syntax phase: tokens violate structure
rules
• Semantic phase: no meaning of
operations
Lexical phase errors
These errors are detected during the lexical analysis
phase. Typical lexical errors are
•Exceeding length of identifier or numeric constants.
•Appearance of illegal characters
•Unmatched string
Void main() In this code, 1xab is neither
{ a number nor an identifier.
int x=10, y=20; So this code will show the
char * a; lexical error.
a= &x;
x= 1xab;
}
printf(“Sona");$ This is a lexical error since an
illegal character $ appears at
the end of statement.
This is a comment This is an lexical error since end
*/ of comment is present but
beginning is not present
Syntactic phase errors
These errors are detected during syntax analysis phase.
Typical syntax errors are
•Errors in structure
•Missing operator
•Misspelled keywords
•Unbalanced parenthesis
if (number=200)
count << "number is equal to 20";
else In this code, if expression used the equal sign
which is actually an assignment operator not
count << "number is not equal to 200" the relational operator which tests for equality.

int a = 5 // semicolon is missing

x = (3 + 5; // missing closing parenthesis ')'


y = 3 + * 5; // missing argument between '+' and '*'
Semantic errors
These errors are detected during semantic
analysis phase. Typical semantic errors are
•Type mismatch
•Undeclared variable
•Reserved identifier misuse.
•Multiple declaration of variable in a scope.
•Accessing an out of scope variable.
•Actual and formal parameter mismatch.

The following tasks should be performed in semantic


analysis:
•Scope resolution
•Type checking
•Array-bound checking
Semantic errors
int i; In this code, t is undeclared that's why it
void f (int m) shows the semantic error.
{
m=t;
}

int a = "hello"; // the types String and int are not compatible

String s = "...";
int a = 5 - s; // the - operator does not support arguments of type String

int a = “value”;
should not issue an error in lexical and syntax analysis phase, as it is lexically and
structurally correct, but it should generate a semantic error as the type of the assignment
differs. These rules are set by the grammar of the language and evaluated in semantic
analysis.
Accessing an Out-of-Scope Variable
int main()
{
if (1)
{
int x = 10;
}
printf("Value of x: %d\n", x);
return 0;
}

Attempting to access x outside the block results in a


semantic error because it is out of scope.
Correct code
int x = 0; // Declare 'x' in the main function's scope
void calculateDifference()
int main()
{ {
int result = 50; calculateDifference();
{ printf("Trying to access result: %d\n",
int result = 20; result);
printf("Inner result: %d\n", result); return 0;
} }
printf("Outer result: %d\n", result);
}

Shadowing:
∙The inner result in the nested block shadows the outer result. Both are valid but
refer to different variables.
∙Shadowing: A variable declared in a nested block with the same name as an outer
variable hides the outer variable.
Out-of-Scope Access:
∙The result variable declared in calculateDifference is not accessible in the main
function.
Actual and Formal Parameter Mismatch
void add(int a, int b)
{
printf("Sum: %d\n", a + b);
}
int main()
{
add(5);
return 0;
}

The function add is defined to take two integer parameters


(int a and int b).
In the main function, add(5) is called with only one argument.
This causes a parameter mismatch since the number of
arguments passed (actual parameters) does not match the
number expected by the function (formal parameters).
float calculateAverage (int numbers[],
int size, float multiplier) int main()
{ {
float sum = 0; int nums[] = {10, 20,
for (int i = 0; i < size; i++) 30};
{ calculateAverage(10,
sum += numbers[i]; 3);
} return 0;
return (sum / size) * multiplier; }
}
First Argument Mismatch:
calculateAverage expects an array as the first argument (int
numbers[]), but 10 (an integer) is passed instead.
Missing Argument:
The function expects three arguments, but only two are
provided.
Type Mismatch:
If a third argument were added but not of type float, the
program would still produce a mismatch error.
Multiple Declarations in the Same Scope

int main()
{
int x = 10; // First declaration of 'x'
int x = 20; // Second declaration of 'x' in the same scope
printf("Value of x: %d\n", x);
return 0;
}

1. The variable x is declared twice in the same scope (main function),


which is not allowed.
2. The compiler cannot resolve which x to use, as this leads to ambiguity.
3. This results in a semantic error during the compilation phase.
Reserved Identifier Misuse
// Using a double underscore prefix (reserved)
int __result = 42;
// Using a leading underscore with an uppercase letter (reserved)

void _CalculateSum()
{
printf("This is a reserved identifier misuse.\n");
}
// Using an identifier starting with an underscore in the global namespace (reserved)
int _globalValue = 100;

int main()
{
printf("Result: %d\n", __result); // Accessing a reserved
identifier
_CalculateSum(); // Calling a reserved
identifier
printf("Global Value: %d\n", _globalValue); // Accessing another reserved
identifier
__result:
Identifiers starting with a double underscore (__) are reserved for the
compiler and standard library at all scopes.

_CalculateSum:
Identifiers starting with a single underscore followed by an uppercase
letter (_C) are reserved for use in the global namespace.

_globalValue:
Identifiers starting with a single underscore (_) are reserved for the
implementation in the global namespace.

Avoid reserved prefixes:


__ (double underscore) for any identifier.
_ followed by an uppercase letter.
Reserved for standard use:
These reserved identifiers are meant for the compiler or standard library, and
violating this rule can lead to undefined behavior or conflicts during compilation
or runtime.
The Context of a compiler:(Cousins of the Compiler)
In addition to a compiler, several other programs may be required to
create an executable target program. A source program may be divided
into modules stores in separate files.

The task of collecting source program is called a


preprocessor.
The preprocessor may also expand shorthands called
macros, into source language statement.

Preprocessors
Macro Processing
File Inclusion
Rational Preprocessors
Language extension
Macro preprocessor:
Contain 2 parts

1.Macro definition
2.Macro use
-single constructs for larger constructs
-contains formal parameters

Ex: #define f(x) = x*x*x macro definition

Macro use :
ex: printf(“%d”, f(10)) ans: 10*10*10

Before compilation all macros are replaced


2. File inclusion:
#include<stdio.h> header files are included

3. Rational preprocessor:
Used in older pgming languages
Doesn’t support modern data structure

4.Language extension:
Attempt to add capabilities to the language by what amounts to built-in
macros.
The language EQUEL is a db query lang embedded in c.
Stmts beginning with ## are taken by the preprocessor to access db
Assembler

Assembly code is a mnemonic


version of machine code in which
names are used instead of binary
codes for operations, and names are
also given to memory address
Assembler:
Contains mnemonics i.e opcode and operand
b=a+2
MOV a,R1
Add #2,R1
MOV R1,b

• Load content of address a into R1


• Add constant 2
• Store R1 into address b
Two-Pass Assembly
The simplest form of assembler makes two passes
over the input, where a pass consists of reading an
input file once.
• First pass
– Find all identifiers and their storage location and store
in symbol table (since, we assumed a word – 4
bytes)
• Identifier Address
a 0
b 4
• Second pass
– Translate each operation code into the sequence of
bits and translates each identifier representing a
location into the address.
– Relocatable machine code
Loaders and Link-Editors

Loader: Taking and altering relocatable address


machine codes
(Loader is a program that is responsible for loading executable programs
into memory for execution. The loader reads the object code of a
program, which is usually in binary form, and copies it into memory)

Link-editors
(linker or link editor is a computer system program that takes one or more
object files (generated by a compiler or an assembler) and
combines them into a single executable file, library file, or another
"object" file.)
– External references
• Library file, routines by system, any other program
Grouping of Phases

Front and Back end of compiler


Passes
Reducing the number of passes
Grouping of Phases
Front and Back end of compiler
-Front end consists of those phases that primarily dependant
on the source language and independent on the target
language.
-some amount of code optimization can also be done at
front end.
-Back end consists of those phases that are totally
dependent upon the target language and independent on the
source language
Front end model is more advantageous

- keeping the same front end and attaching


different back ends one can produce a
compiler for same source language on
different machines.

- keeping different front ends and same back


end one can compile several different
languages on the same machine.
• Passes
- One complete scan of the source language is called pass. It
includes reading an input file and writing to an output file.

Many phases grouped into one pass

- difficult to compile the source program into a single pass,


because it have some forward references.

- it is desirable to have relatively few passes, because it takes


time to read and write intermediate file.

- if we group several phases into one pass we may be forced


to keep the entire program in the memory.
Therefore memory requirement may be large.
Reducing the number of passes
• In the first pass the source program is scanned
completely and output will be an easy to use form
which is equivalent to the source program along with
the additional information about the storage
allocation.

• It is possible to leave a blank slots for missing


information and fill the slot when the information
become available.

• Hence there may be a requirement of more than one


pass in the compiling process.
• One pass for scanning and parsing
• One pass for semantic analysis
• Third pass for code generation and target
code optimization.

• C and pascal permit one pass compilation.


• Modula2 requires two passes.
Compiler Construction Tools
Writing a compiler is tedious and time consuming
task.
There are some specialized tools for helping in
implementation of various phases of compilers.
{{{{{{{{{{{{{{{{{{{
These tools are called compiler construction tool.
Various compiler construction tools are given below
– Scanner generator
– Parser generator
– Syntax directed translation engine
– Automatic code
– Data flow engines
Scanner generator :
•A scanner generator is a tool that automatically creates a
scanner (or lexical analyzer) for a programming language.
•It takes a set of patterns (written as regular expressions) that
describe different parts of the language, like keywords, numbers,
or symbols.
•It generates code that can read a program's text, recognize
these patterns, and break the text into small pieces called
tokens.
•Unix has utility for a scanner generator called LEX.
•The Basic organization of the resulting lexical analyzers is finite
automation.
Popular Tools
•Lex/Flex: For C/C++.
•JFlex: For Java.
•ANTLR (Another Tool for Language Recognition): Generates both lexical
analyzers and parsers.
Parser generators:
• A parser generator is a tool that automatically creates a
parser for a programming language or structured data.
• It takes a set of grammar rules (usually written in a formal
syntax like context-free grammar) that define the structure of
the language.
• It generates code that can read and check whether a given
program follows these rules and organizes the program into a
structured format, often as a parse tree or syntax tree.

Popular Tools
• Yacc/Bison: For C/C++.
• ANTLR: For many languages like Java, Python, and C#.
• PLY (Python Lex-Yacc): For Python.
• CUP (Constructor of Useful Parsers): Generates LALR
parsers for Java.
Syntax- directed translation engine:
• A Syntax-Directed Translation (SDT) engine is a tool that
helps convert input from one form to another while following
a set of grammar rules combined with specific actions.
• It uses grammar rules (to define the structure of a
language) and attaches actions (to describe what to do for
each rule).
• These actions can perform tasks like building a syntax tree,
generating intermediate code, or calculating values.
• These tools help create intermediate representations of
code, like three-address code or abstract syntax trees.

Popular Tools
• LLVM (Low-Level Virtual Machine): Provides an infrastructure for building
compilers and optimizing intermediate code.
• GCC (GNU Compiler Collection): Includes tools for generating intermediate
representations.
Automatic code generator:
• An Automatic Code Generator is a tool that creates
machine code or intermediate code for a computer program
automatically.
• It takes high-level instructions (like syntax trees or
intermediate representations created during compilation)
and translates them into low-level code (like assembly
language or binary code) that the computer can execute.
• These tools assist in generating machine code from
intermediate representations.

Popular Tools:
• LLVM: Again, used for backend code generation.
• SPIM/MIPS Simulators: Used for generating assembly code targeting
the MIPS architecture.
• Keystone: A lightweight assembler framework.
Data flow engine:
• A Data Flow Engine is a system or tool that processes data
by following a defined flow or sequence of operations,
where the output of one step becomes the input for the next.
• It focuses on how data moves and transforms through
different stages in a program or system.
• Each step (or node) in the flow performs a specific operation
on the data, like filtering, aggregating, or analyzing.
• These focus on optimizing intermediate code or machine
code.

Popular Tools
• LLVM: Also used extensively for optimization.
• Polly: An LLVM-based tool for polyhedral optimization.
• Open64: A high-performance compiler with strong optimization capabilities.
Analysis Tools(Software tools that
manipulate source program)

Structure Editors
match begin..end, do..while etc)
Pretty Printers
beautify – special color,font for different portion)
Static Checkers
check syntax error, statement reach ability analysis
Interpreters
Taken line by line
Structure Editor:
•i/p is sequence of commands
•Create and modify pgm
•Can check i/p is correctly formed
•Supply keywords automatically
•It finds matching parenthesis and corresponding begin…..end stmt
•o/p is similar to o/p of analysis phase of compiler.

Interpreter:
Performs the operations in the source pgm.
Pretty printers:
Prints the o/p of the pgm, clearly visible
Comments spl fonts
Stmt with an amount of indentation
Proportional to the depth of their nesting

Static checker:
Identify bugs without running the pgm
Detect the parts of the source code never be executed (unreachable
code)
Variable used before being defined
Type checking
Places where compiler techniques are used
Text formatters
•Sequence of characters[i/p]
•Commands to indicate paragraph, figures, mathematical structure
superscript, subscripts
Silicon compilers
•used in signal analysis in circuits
•i/p language contains logical signal[0/1] signal or group of signals
in a switching circuit
•o/p -> circuit design in an appropriate lang.
Query Interpreters
Translate a predicate (relational and boolean operators) into
commands to a db for records satisfying that predicate.

You might also like