Unit 1,2 PDF
Unit 1,2 PDF
● Cross Compiler that runs on a machine ‘A’ and produces a code for another
machine ‘B’. It is capable of creating code for a platform other than the one on
which the compiler is running.
● Source-to-source Compiler or transcompiler or transpiler is a compiler that
translates source code written in one programming language into source code of
another programming language.
● Assembler – For every platform (Hardware + OS) we will have a assembler. They
are not universal since for each platform we have one. The output of assembler
is called object file. Its translates assembly language to machine code.
● Relocatable Machine Code – It can be loaded at any point and can be run. The
address within the program will be in such a way that it will cooperate for the
program movement.
● Loader/Linker – It converts the relocatable code into absolute code and tries to
run the program resulting in a running program or an error message (or
sometimes both can happen). Linker loads a variety of object files into a single
file to make it executable. Then loader loads it in memory and executes it.
Analysis Phase – An intermediate representation is created from the give source code :
1. Lexical Analyzer
2. Syntax Analyzer
3. Semantic Analyzer
● Lexical analyzer divides the program into “tokens”, Syntax analyzer recognizes
“sentences” in the program using syntax of language and Semantic analyzer
checks static semantics of each construct.
Synthesis Phase – Equivalent target program is created from the intermediate
Intermediate Code Generator generates “abstract” code, Code Optimizer optimizes the
abstract code, and final Code Generator translates abstract intermediate code into
phases of a compiler. These tools assist in the creation of an entire compiler or its
1. Parser Generator –
It produces syntax analyzers (parsers) from the input that is based on a
grammatical description of programming language or on a context-free
grammar. It is useful as the syntax analysis phase is highly complex and
consumes more manual and compilation time.
Example:PIC, EQM
2. Scanner Generator –
It generates lexical analyzers from the input that consists of regular
expression description based on tokens of a language. It generates a finite
automation to recognize the regular expression.
Example: Lex
Symbol Table – It is a data structure being used and maintained by the compiler,
consists all the identifier’s name along with their types. It helps the compiler to function
The compiler has two modules namely front end and back end. Front-end constitutes of
the Lexical analyzer, semantic analyzer, syntax analyzer and intermediate code
generator. And the rest are assembled to form the back end.
1. Lexical Analyzer – It reads the program and converts it into tokens. It
converts a stream of lexemes into a stream of tokens. Tokens are defined
by regular expressions which are understood by the lexical analyzer. It also
removes white-spaces and comments.
2. Syntax Analyzer – It is sometimes called as parser. It constructs the parse
tree. It takes all the tokens one by one and uses Context Free Grammar to
construct the parse tree.
Why Grammar ?
The rules of programming can be entirely represented in some few
productions. Using these productions we can represent what the program
actually is. The input has to be checked whether it is in the desired format or
not.
Syntax error can be detected at this level if the input is not in accordance
with the grammar.
3. Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or
not. It furthermore produces a verified parse tree.It also does type checking,
Label checking and Flow control checking.
4. Intermediate Code Generator – It generates intermediate code, that is a
form which can be readily executed by machine We have many popular
intermediate codes. Example – Three address code etc. Intermediate code
is converted to machine language using the last two phases which are
platform dependent.
Till intermediate code, it is same for every compiler out there, but after that,
it depends on the platform. To build a new compiler we don’t need to build it
from scratch. We can take the intermediate code from the already existing
compiler and build the last two parts.
5. Code Optimizer –
It transforms the code so that it consumes fewer
resources and produces more speed. The meaning of the code being
transformed is not altered. Optimisation can be categorized into two types:
machine dependent and machine independent.
6. Target Code Generator – The main purpose of Target Code generator is to
write a code that the machine can understand and also register allocation,
instruction selection etc. The output is dependent on the type of assembler.
This is the final stage of compilation.
Lexical Analysis is the first phase of compiler also known as scanner. It converts the
What is a token?
A lexical token is a sequence of characters that can be treated as a unit in the grammar
Example of tokens:
Example of Non-Tokens:
token is called a lexeme. eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
3. Remove comments.
4. It also provides help in generating error message by providing row number and
column number.
The lexical analyzer identifies the error with the help of automation machine and
the grammar of the given language on which it is based like C , C++ and gives row
number and column number of the error.
Suppose we pass a statement through lexical analyzer –
int main()
// 2 variables
int a, b;
a = 10;
return 0;
'int' 'main' '(' ')' '{' '}' 'int' 'a' ',' 'b' ';'
together with B
erkeley Yacc parser generator or GNU Bison parser generator. Flex and
Bison both are more flexible than Lex and Yacc and produces faster code.
Bison produces parser from the input file provided by the user. The function yylex() is
automatically generated by the flex when it is provided with a .l file and this yylex()
function is expected by parser to call to retrieve tokens from current/this token stream.
compiler, the compiler processes it in several phases, starting from lexical analysis
(scans the input and divides it into tokens) to target code generation.
Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the
syntactical structure of the given input, i.e. whether the given input is in the correct
syntax (of the language in which the input has been written) or not. It does so by
building a data structure, called a Parse tree or Syntax tree. The parse tree is
constructed by using the pre-defined Grammar of the language and the input string. If
process), the input string is found to be in the correct syntax. if not, error is reported by
syntax analyzer.
Example:
S -> cAd
A -> bc|a
Now the parser attempts to construct syntax tree from this grammar for the given input
string. It uses the given production rules and applies those as needed to generate the
string. To generate string “cad” it uses the rules as shown in the given diagram:
the string produced is “cbcd” not “cad”), here the parser needs to backtrack, and apply
the next production rule available with A which is shown in the step iv, and the string
“cad” is produced.
Thus, the given input can be produced by the given grammar, therefore the input is
correct in syntax.
But backtrack was needed to get the correct syntax tree, which is really a complex
process to implement.
There can be an easier way to solve this, which we shall see in the next article
properties:
During Compilation, the parser uses the grammar of the language to make a
parse tree(or derivation tree) out of the source code. The grammar used must be
unambiguous. An ambiguous grammar must not be used for parsing.
Based on the nature of the recursion in a recursive grammar, a recursive CFG can be
Note:A linear grammar is a context-free grammar that has at most one non-terminal in
Let L be a Context Free Language (CFL). If every Context Free Grammar G with
be useful for a programming language, because two parse trees structures(or more) for
the same string(program) implies two different meanings (executable programs) for the
program.
language, because we would not have any way of fixing a unique structure for all its
programs.
Disambiguate the grammar i.e., rewriting the grammar such that there is only one
derivation or parse tree possible for a string of the language which the grammar
represents.
In the syntax analysis phase, a compiler verifies whether or not the tokens generated by
the lexical analyzer are grouped according to the syntactic rules of the language. This is
done by a parser. The parser obtains a string of tokens from the lexical analyzer and
verifies that the string can be the grammar for the source language. It detects and
reports any syntax errors and produces a parse tree from which intermediate code can
be generated.
The process of deriving the string from the given grammar is known as derivation
(parsing).
Depending upon how derivation is done we have two kinds of parsers :-
Top down parsing attempts to build the parse tree from root to leave. Top down parser
will start from start symbol and proceeds to string. It follows leftmost derivation. In
leftmost derivation, the leftmost non-terminal in each sentential is always chosen.
Parsing is classified into two categories, i.e. Top Down Parsing and Bottom-Up Parsing.
Top-Down Parsing is based on Left Most Derivation whereas Bottom Up Parsing is
dependent on Reverse Right Most Derivation.
The process of constructing the parse tree which starts from the root and goes down to
1. Whenever a Non-terminal spend first time then go with the first alternative
and compare with the given I/P String
2. If matching doesn’t occur then go with the second alternative and compare
with the given I/P String.
3. If matching again not found then go with the alternative and so on.
4. Moreover, If matching occurs for at least one alternative, then the I/P string
is parsed successfully.
1. In LL1, first L stands for Left to Right and second L stands for Left-most
Derivation. 1 stands for number of Look Aheads token used by parser while
parsing a sentence.
2. LL(1) parsing is constructed from the grammar which is free from left
recursion, common prefix, and ambiguity.
3. LL(1) parser depends on 1 look ahead symbol to predict the production to
expand the parse tree.
4. This parser is Non-Recursive.
Build the parse tree from leaves to root. Bottom-up parsing can be defined as an
attempt to reduce the input string w to the start symbol of grammar by tracing out the
Eg.
A general shift reduce parsing is LR parsing. The L stands for scanning the input from
Benefits of LR parsing:
Important Notes
1. Even though CLR parser does not have RR conflict but LALR may contain RR conflict.
n1 = n2 = n3 <= n4
context-free grammar that has the property that no production has either an empty
right-hand side (null productions) or two adjacent non-terminals in its right-hand side.
Examples –
E->E+E/E*E/id
However, grammar that is given below is not an operator grammar because two
S->SAS/a
A->bSb/b
S->SbSbS/SbS/a
A->bSb/b
An operator precedence parser is a one of the bottom-up parser that interprets an
operator-precedence grammar. This parser is only used for operator grammars.
Ambiguous grammars are not allowed in case of any parser except operator precedence
parser.
produce output for next phase of the compiler. Output could be either a parse tree or
abstract syntax tree. Now to interleave semantic analysis with syntax analysis phase of
Syntax Directed Translation are augmented rules to the grammar that facilitate
semantic analysis. SDT involves passing information bottom-up and/or top-down the
parse tree in form of attributes attached to the nodes.
Syntax directed translation rules use 1) lexical values of nodes, 2) constants & 3)
attributes associated to the non-terminals in their definitions.
For understanding translation rules further, we take the first SDT augmented to [ E ->
E+T ] production rule. The translation rule in consideration has val as attribute for both
the non-terminals – E & T. Right hand side of the translation rule corresponds to
attribute values of right side nodes of the production rule and vice-versa. Generalizing,
SDT are augmented rules to a CFG that associate 1) set of attributes to every node of
the grammar and 2) set of translation rules to every production rule using attributes,
constants and lexical values.
To evaluate translation rules, we can employ one depth first search traversal on the
parse tree. This is possible only because SDT rules don’t impose any specific order on
evaluation until children attributes are computed before parents for a grammar having
all synthesized attributes. Otherwise, we would have to figure out the best suited plan to
traverse through the parse tree and evaluate all the attributes in one or more traversals.
For better understanding, we will move bottom up in left to right fashion for computing
Above diagram shows how semantic analysis could happen. The flow of information
happens bottom-up and all the children attributes are computed before parents, as
discussed above. Right hand side nodes are sometimes annotated with subscript 1 to
Additional Information
Synthesized Attributes are such attributes that depend only on the attribute values of
children nodes.
Thus [ E -> E+T { E.val = E.val + T.val } ] has a synthesized attribute val corresponding to
node E. If all the semantic attributes in an augmented grammar are synthesized, one
depth first search traversal in any order is sufficient for semantic analysis phase.
Inherited Attributes are such attributes that depend on parent and/or siblings
attributes.
Thus [ Ep -> E+T { Ep.val = E.val + T.val, T.val = Ep.val } ], where E & Ep are same
production symbols annotated to differentiate between parent and child, has an
Types of attributes –
1. Synthesized attributes –
A Synthesized attribute is an attribute of the non-terminal on the left-hand
side of a production. Synthesized attributes represent information that is
being passed up the parse tree. The attribute can take value only from its
children (Variables in the RHS of the production).
For eg. let’s say A -> BC is a production of a grammar, and A’s attribute is
dependent on B’s attributes or C’s attributes than it will be synthesized
attribute.
2. Inherited attributes –
1. S-attributed SDT :
○ If an SDT uses only synthesized attributes, it is called as
S-attributed SDT.
○ S-attributed SDTs are evaluated in bottom-up parsing, as the
values of the parent nodes depend upon the values of the child
nodes.
○ Semantic actions are placed in rightmost place of RHS.
2. L-attributed SDT:
○ If an SDT uses both synthesized attributes and inherited
attributes with a restriction that inherited attribute can inherit
values from left siblings only, it is called as L-attributed SDT.
○ Attributes in L-attributed SDTs are evaluated by depth-first and
left-to-right parsing manner.
○ Semantic actions are placed anywhere in RHS.
Note – If a definition is S-attributed, then it is also L-attributed but N
OT vice-versa.
Semantics
Semantics of a language provide meaning to its constructs, like tokens and
syntax structure. Semantics help interpret symbols, their types, and their
relations with each other. Semantic analysis judges whether the syntax
structure constructed in the source program derives any meaning or not.
For example:
● Scope resolution
● Type checking
● Array-bound checking
Semantic analyzer receives AST (Abstract Syntax Tree) from its previous
stage (syntax analysis).
Semantic analyzer attaches attribute information with AST, which are called
Attributed AST.
Attributes are two tuple value, <attribute name, attribute value>
For example:
<type, “integer”>
<presentvalue, “5”>
S-attributed SDT
If an SDT uses only synthesized attributes, it is called as S-attributed SDT.
These attributes are evaluated using S-attributed SDTs that have their
semantic actions written after the production (right hand side).
In L-attributed SDTs, a non-terminal can get values from its parent, child,
and sibling nodes. As in the following production
S → ABC
S can take values from A, B, and C (synthesized). A can take values from S
only. B can take values from S and A. C can get values from S, A, and B. No
non-terminal can get values from the sibling to its right.