CD Unit1 Notes
CD Unit1 Notes
INTRODUCTION
LANGUAGE TRANSLATORS :
Is a computer program which translates a program written in one (Source) language to its
equivalent program in other [Target]language. The Source program is a high level language where as
the Target language can be any thing from the machine language of a target machine (between
Microprocessor to Supercomputer) to another high level language program.
Source Program
Input Interpreter Output
Preprocessor
Compiler
Assembler
Library files
Relocatable Object files
Target Machine Code [ Example: filename. exe ]
Figure1.3 : Context of a Compiler in Language Processing System
TYPES OF COMPILERS:
Based on the specific input it takes and the output it produces, the Compilers can be classified
into the following types;
Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a HLL
into its equivalent in native machine code or object code.
Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert Source code into
intermediate code, and then interprets (emulates) it to its equivalent machine code.
Cross-Compilers: These are the compilers that run on one machine and produce code for another
machine.
Incremental Compilers: These compilers separate the source into user defined–steps;
Compiling/recompiling step- by- step; interpreting steps in a given order
Converters (e.g. COBOL to C++): These Programs will be compiling from one high level
language to another.
Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime compilers from
intermediate language (byte code, MSIL) to executable code or native machine code. These
perform type –based verification which makes the executable code more trustworthy
Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-compilers to the native
code for Java and .NET
Binary Compilation: These compilers will be compiling object code of one platform into object code of
another platform.
PHASES OF A COMPILER:
Compiler Phases are the individual modules which are chronologically executed to perform their
respective Sub-activities, and finally integrate the solutions to give target code.
It is desirable to have relatively few phases, since it takes time to read and write immediate files.
Following diagram (Figure1.4) depicts the phases of a compiler through which it goes during the
compilation. There fore a typical Compiler is having the following Phases:
In addition to these, it also has Symbol table management, and Error handler
phases.
The Phases of compiler divided in to two parts, first three phases we are called
as Analysis part remaining three called as Synthesis part.
The synthesis part constructs the desired target program from the intermediate
representation and the information in the symbol table.
The analysis part is often called the front end of the compiler;
the synthesis part is the back end.
If we examine the compilation process in more detail, we see that it operates as a sequence
of phases, each of which transforms one representation of the source program to another.
A typical decomposition of a compiler into phases is shown in Fig.
In practice, several phases may be grouped together, and the intermediate representations
between the grouped phases need not be constructed explicitly .
Figure1.4 : Phases of a Compiler
PHASE, PASSES OF A COMPILER:
In some application we can have a compiler that is organized into what is called passes.
Where a pass is a collection of phases that convert the input from one representation to a
completely deferent representation. Each pass makes a complete scan of the input and produces its
output to be processed by the subsequent pass. For example a two pass Assembler.
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface
between the compiler and the Source language program and performs the following functions:
−− Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as an identifier
, a Keyword , a punctuation mark, a multi character operator like := .
−− The Scanner generates a token-id, and also enters that identifiers name in the
Symbol table if it doesn‘t exist.
SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its subsequent
phase Semantic Analyzer and performs the following functions:
−−Groups the above received, and recorded token stream into syntactic structures,
usually into a structure called Parse Tree whose leaves are tokens.
−−The interior node of this tree represents the stream of tokens that logically
belongs together.
SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the
semantically correctness of the program. Though the tokens are valid and syntactically correct, it
may happen that they are not correct semantically. Therefore the semantic analyzer checks the
semantics (meaning) of the statements formed.
−− The Syntactically and Semantically correct structures are produced here in the form of
a Syntax tree or DAG or some other sequential representation like matrix.
−− It should be easy to produce,and Easy to translate into the target program.
Example intermediate code forms are:
CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in
terms of saving development time, effort, and cost. This phase performs the following specific
functions:
→Attempts to improve the IC so as to have a faster machine code. Typical functions include
–Loop Optimization, Removal of redundant computations, Strength reduction, Frequency
reductions etc.
−−Sometimes the data structures used in representing the intermediate forms may also
be changed.
CODE GENERATOR: This is the final phase of the compiler and generates the target code,
normally consisting of the relocatable machine code or Assembly code or absolute machine code.
−Memory locations are selected for each variable used, and assignment of variables to
registers is done.
The Compiler also performs the Symbol table management and Error handling throughout the
compilation process. Symbol table is nothing but a data structure that stores different source
language constructs, and tokens generated during the compilation. These two interact with all
phases of the Compiler.
For example the source program is an assignment statement; the following figure shows how the
phases of compiler will process the program.
The input source program is Position=initial+rate*60
2. Scanner generators that produce lexical analyzers from a regular-expression description of the
tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse tree
and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for translating
each operation of the intermediate language into the machine language for a target machine.
5. Data- flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data- flow analysis is a key part of code
optimization.
6. Compiler-construction toolkits that provide an integrated set of routines for constructing various
phases of a compiler.
1. The optimization must be correct, that is, preserve the meaning of the compiled program,
2. The optimization must improve the performance of many programs,
3. The compilation time must be kept reasonable, and
4. The engineering effort required must be manageable.
Thus, in studying compilers, we learn not only how to build a compiler, but also the general methodology of
solving complex and open-ended problems.
The Static/Dynamic Distinction: Among the most important issues that we face when designing a compiler
for a language is what decisions can the compiler make about a program. If a language uses a policy that
allows the compiler to decide an issue, then we say that the language uses a static policy or that the issue can
be decided at compile time. On the other hand, a policy that only allows a decision to be made when we
execute the program is said to be a dynamic policy. One issue is the scope of declarations. The scope of a
declaration of x is the region of the program in which uses of x refer to this declaration. A language uses
static scope or lexical scope if it is possible to determine the scope of a declaration by looking only at the
program. Otherwise, the language uses dynamic scope. With dynamic scope, as the program runs, the same
use of x could refer to any of several different declarations of x.
Most languages, including C and its family, use static scope. we consider static-scope rules for a
language with blocks, where a block is a grouping of declarations and statements. C uses braces { and }
to delimit a block; the alternative use of begin and end for the same purpose dates back to Algol.
A C program consists of a sequence of top-level declarations of variables and functions.Functions
may have variable declarations within them, where variables include local variables and parameters. The
scope of each such declaration is restricted to the function in which it appears.
The scope of a top-level declaration of a name x consists of the entire program that follows, with the
exception of those statements that lie within a function that also has a declaration of x.
A block is a sequence of declarations followed by a sequence of statements, all surrounded by braces.
a declaration D "belongs" to a block B if B is the most closely nested block containing D; that is, D is
located within B, but not within any block that is nested within B. The static-scope rule for variable
declarations in a block-structured lan-guages is as follows. If declaration D of name x belongs to block B,
then the scope of D is all of B, except for any blocks B' nested to any depth within J5, in which x is
redeclared. Here, x is redeclared in B' if some other declaration D' of the same name x belongs to B'.
An equivalent way to express this rule is to focus on a use of a name x. Let Bi, i?2, • • • , Bk be
all the blocks that surround this use of x, with Bk the smallest, nested within Bk-i, which is nested within
Bk-2, and so on. Search for the largest i such that there is a declaration of x belonging to B^. This use of
x refers to the declaration in B{. Alternatively, this use of x is within the scope of the declaration in Bi.
Explicit Access Control
Through the use of keywords like public, private, and protected, object-oriented languages such as C
+ + or Java provide explicit control over access to member names in a superclass. These keywords
support encapsulation by restricting access. Thus, private names are purposely given a scope that
includes only the method declarations and definitions associated with that class and any "friend" classes
(the C + + term). Protected names are accessible to subclasses. Public names are accessible from outside
the class.
Dynamic Scope
Any scoping policy is dynamic if it is based on factor(s) that can be known only when the program
executes. The term dynamic scope, however, usually refers to the following policy: a use of a
name x refers to the declaration of x in the most recently called procedure with such a declaration.
Dynamic scoping of this type appears only in special situations. We shall consider two ex-amples of
dynamic policies: macro expansion in the C preprocessor and method resolution in object-oriented
programming.
Declarations and Definitions
Declarations tell us about the types of things, while definitions tell us about their values. Thus, i n t i is a
declaration of i, while i = 1 is a definition of i.
The difference is more significant when we deal with methods or other procedures. In C + + , a method
is declared in a class definition, by giving the types of the arguments and result of the method (often called
the signature for the method. The method is then defined, i.e., the code for executing the method is given,
in another place. Similarly, it is common to define a C function in one file and declare it in other files
where the function is used.
Parameter Passing Mechanisms
In this section, we shall consider how the actual parameters (the parameters used in the call of a
procedure) are associated with the formal parameters (those used in the procedure definition). Which
mechanism is used determines how the calling-sequence code treats parameters. The great majority of
languages use either "call-by-value," or "call-by-reference," or both.
Call - by - Value
In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if it is a variable). The
value is placed in the location belonging to the corresponding formal parameter of the called procedure.
This method is used in C and Java, and is a common option in C + + , as well as in most other languages.
Call-by-value has the effect that all computation involving the formal parameters done by the called
procedure is local to that procedure, and the actual parameters themselves cannot be changed.
Note, however, that in C we can pass a pointer to a variable to allow that variable to be changed by the
callee. Likewise, array names passed as param eters in C, C + + , or Java give the called procedure what
is in effect a pointer or reference to the array itself. Thus, if a is the name of an array of the calling
procedure, and it is passed by value to corresponding formal parameter x, then an assignment such as x
[ i ] = 2 really changes the array element a[2]. The reason is that, although x gets a copy of the value of a,
that value is really a pointer to the beginning of the area of the store where the array named a is located.
Similarly, in Java, many variables are really references, or pointers, to the things they stand for. This
observation applies to arrays, strings, and objects of all classes. Even though Java uses call-by-value
exclusively, whenever we pass the name of an object to a called procedure, the value received by that
procedure is in effect a pointer to the object. Thus, the called procedure is able to affect the value of the
object itself.
Call - by - Reference
In call-by-reference, the address of the actual parameter is passed to the callee as the value of the
corresponding formal parameter. Uses of the formal parameter in the code of the callee are implemented
by following this pointer to the location indicated by the caller. Changes to the formal parameter thus
appear as changes to the actual parameter.
If the actual parameter is an expression, however, then the expression is evaluated before the call, and
its value stored in a location of its own. Changes to the formal parameter change this location, but can
have no effect on the data of the caller.
Call-by-reference is used for "ref" parameters in C + + and is an option in many other languages. It is
almost essential when the formal parameter is a large object, array, or structure. The reason is that
strict call-by-value requires that the caller copy the entire actual parameter into the space belonging
to the corresponding formal parameter. This copying gets expensive when the parameter is large. As
we noted when discussing call-by-value, languages such as Java solve the problem of passing arrays,
strings, or other objects by copying only a reference to those objects. The effect is that Java behaves
as if it used call-by-reference for anything other than a basic type such as an integer or real.
Call - by - Name
A third mechanism — call-by-name — was used in the early programming language Algol 60. It
requires that the callee execute as if the actual parameter were substituted literally for the formal
parameter in the code of the callee, as if the formal parameter were a macro standing for the actual
parameter (with renaming of local names in the called procedure, to keep them distinct). When the
actual parameter is an expression rather than a variable, some unintuitive behaviors occur, which is
one reason this mechanism is not favored today.
LEXICAL ANALYSIS
Upon receiving a get next token command form the parser, the lexical analyzer
reads the input character until it can identify the next token. The LA return to the parser
representation for the token it has found. The representation will be an integer code, if the
token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task
is stripping out from the source program the commands and white spaces in the form of
blank, tab and new line characters. Another is correlating error message from the compiler
with the source program.
The lexical analyzer (the "lexer") parses A parser does not give the nodes any
individual symbols from the source code file meaning beyond structural cohesion. The
into tokens. From there, the "parser" proper next thing to do is extract meaning from this
turns those whole tokens into sentences of structure (sometimes called contextual
your grammar analysis).
INPUT BUFFERING:
Before discussing the problem of recognizing lexemes in the input, let us examine some
ways that the simple but important task of reading the source program can be speeded. This task
is made difficult by the fact that we often have to look one or more characters beyond the next
lexeme before we can be sure we have the right lexeme. There are many situations where we need
to look at least one additional character ahead. For instance, we cannot be sure we've seen the end
of an identifier until we see a character that is not a letter or digit, and therefore is not part of the
lexeme for id. In C, single-character operators like -, =, or < could also be the beginning of a
two-character operator like ->, ==, or <=. Thus, we shall introduce a two-buffer scheme that
handles large look aheads safely. We then consider an improvement involving "sentinels" that
saves time checking for the ends of buffers.
Buffer Pairs
Because of the amount of time taken to process characters and the large number of characters that
must be processed during the compilation of a large source program, specialized buffering
techniques have been developed to reduce the amount of overhead required to process a single
input character. An important scheme involves two buffers that are alternately reloaded.
Figure1.8 : Using a Pair of Input Buffers
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes.
Using one system read command we can read N characters in to a buffer, rather than using one
system call per character. If fewer than N characters remain in the input file, then a special
character, represented by eof, marks the end of the source file and is different from any possible
character of the source program.
1. The Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found;
Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin is
set to the character immediately after the lexeme just found. In Fig, we see forward has passed the
end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted one
position to its left.
Advancing forward requires that we first test whether we have reached the end of one of
the buffers, and if so, we must reload the other buffer from the input, and move forward to the
beginning of the newly loaded buffer. As long as we never need to look so far ahead of the actual
lexeme that the sum of the lexeme's length plus the distance we look ahead is greater than N, we
shall never overwrite the lexeme in its buffer before determining it.
If we use the above scheme as described, we must check, each time we advance forward,
that we have not moved off one of the buffers; if we do, then we must also reload the other buffer.
Thus, for each character read, we make two tests: one for the end of the buffer, and one to
determine what character is read (the latter may be a multi way branch). We can combine the
buffer-end test with the test for the current character if we extend each buffer to hold a sentinel
character at the end. The sentinel is a special character that cannot be part of the source program,
and a natural choice is the character eof. Figure 1.8 shows the same arrangement as Figure 1.7, but
with the sentinels added. Note that eof retains its use as a marker for the end of the entire input.
Any eof that appears other than at the end of a buffer means that the input is at an end.
switch ( *forward++ )
{
case eof: if (forward is at end of first buffer )
{
reload second buffer;
forward = beginning of second buffer;
}
break;
}
reload first buffer;
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */ terminate lexical
analysis;
if If If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i Pi any numeric constant
A pattern is a rule describing the set of lexemes that can represent a particular token
in source program.
LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which
means that there's no way to recognise a lexeme as a valid token for you lexer.
Syntax errors, on the other side, will be thrown by your scanner when a given set of
already recognised valid tokens don't match any of the right sides of your grammar
rules. simple panic-mode error handling system requires that we return to a high-
level parsing function when a parsing or lexical error is detected.
o is a regular expression denoting { € }, that is, the language containing only the
empty string.
o For each „a‟ in ∑, is a regular expression denoting { a }, the language with
only one string consisting of the single symbol „a‟ .
o If R and S are regular expressions, then
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names
of tokens as far as the lexical analyzer is concerned, the patterns for the tokens are
described using regular definitions.
digit --
>[0,9] digits
--
>digit+
number -->digit(.digit)?(e.[+-
]?digits)? letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --></>/<=/>=/==/< >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
+
ws --> (blank/tab/newline)
Here, blank, tab and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that ,when we
recognize it, we do not return it to parser ,but rather restart the lexical analysis from the
character that follows the white space . It is the following token that gets returned to the
parser.
Lexeme Token Name Attribute Value
Any ws _ _
if If _
then Then _
else Else _
Any Id Id pointer to table entry
Any number Number
pointer to table entry
< Relop LT
<= Relop LE
= Relop ET
<> Relop NE
If = if
Then = then
Else = else
Relop = < | <= | = | > | >=
Id = letter (letter | digit) *|
Num = digit |
2.10 AUTOMATA
DESCRIPTION OF AUTOMATA
Deterministic Automata
Non-Deterministic Automata.
DETERMINISTIC AUTOMATA
A deterministic finite automata has at most one transition from each state on
any input. A DFA is a special case of a NFA in which:-
2, each input symbol has at most one transition from any state.
The regular expression is converted into minimized DFA by the following procedure:
The Finite Automata is called DFA if there is only one path for a specific input
from current state to next state.
a
a
So
S1
From state S0 for input „a‟ there is only one path going to S2. similarly from S0
there is only one path for input going to S1.
A set of states S.
A set of input symbols ∑.
A transition for move from one state to an other.
A state so that is distinguished as the start (or initial)
state. A set of states F distinguished as accepting (or
final) state.
A number of transition to a single symbol.
A NFA can be diagrammatically represented by a labeled directed graph, called
a transition graph, In which the nodes are the states and the labeled edges
represent the transition function.
This graph looks like a transition diagram, but the same character can label two
or more transitions out of one state and edges can be labeled by the special
symbol € as well as by input symbols.
The transition graph for an NFA that recognizes the language ( a | b ) * abb is
shown
Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool is
referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the
Lex compiler transforms the input patterns into a transition diagram and generates code, in a
file called lex .yy .c, it is a c program given for C Compiler, gives the Object code. Here we need
to know how to write the Lex language. The structure of the Lex program is given below.
Structure of LEX Program : A Lex program has the following form:
Declarations
%%
Translation rules
%%
In the Translation rules section, We place Pattern Action pairs where each pair have the form
Pattern {Action}
The auxiliary function definitions section includes the definitions of functions used to install
identifiers and numbers in the Symbol tale.
/* regular definitions */
delim [ \t\n]
ws { delim}+
letter [A-Za-z]
digit [o-91
id {letter} ({letter} | {digit}) *
number {digit}+ (\ . {digit}+)? (E [+-I]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(1F) ; }
then {return(THEN) ; }
else {return(ELSE) ; }
int installID0() {/* function to install the lexeme, whose first character is pointed to by yytext,
and whose length is yyleng, into the symbol table and return a pointer
thereto */
int installNum() {/* similar to installID, but puts numerical constants into a separate table */}
Figure 1.10 : Lex Program for tokens common tokens