0% found this document useful (0 votes)
34 views

CD Unit1 Notes

This document discusses language translators like compilers and interpreters. It provides details on: 1. Compilers translate source code to target code while interpreters appear to directly execute source code operations on inputs. 2. A language processing system involves preprocessing, compiling to assembly then machine code, and linking/loading code and libraries. 3. Compiler phases like lexical analysis, parsing, semantic analysis, code generation transform the source code through different representations to the target code.

Uploaded by

Tester
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

CD Unit1 Notes

This document discusses language translators like compilers and interpreters. It provides details on: 1. Compilers translate source code to target code while interpreters appear to directly execute source code operations on inputs. 2. A language processing system involves preprocessing, compiling to assembly then machine code, and linking/loading code and libraries. 3. Compiler phases like lexical analysis, parsing, semantic analysis, code generation transform the source code through different representations to the target code.

Uploaded by

Tester
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT-I

INTRODUCTION
LANGUAGE TRANSLATORS :

Is a computer program which translates a program written in one (Source) language to its
equivalent program in other [Target]language. The Source program is a high level language where as
the Target language can be any thing from the machine language of a target machine (between
Microprocessor to Supercomputer) to another high level language program.

−− Two commonly Used Translators are Compiler and Interpreter


1. Compiler : Compiler is a program, reads program in one language called Source Language
and translates in to its equivalent program in another Language called Target Language, in
addition to this its presents the error information to the User.

If the target program is an executable machine-language program, it can then be called by


the users to process inputs and produce outputs.

Input Target Program Output

Figure1.1: Running the target Program

2. Interpreter: An interpreter is another commonly used language processor. Instead of producing


a target program as a single translation unit, an interpreter appears to directly execute the
operations specified in the source program on inputs supplied by theuser.

Source Program
Input Interpreter Output

Figure 1.2: Running the target Program


LANGUAGE PROCESSING SYSTEM:
Based on the input the translator takes and the output it produces, a language translator can be
called as any one of the following.
Preprocessor: A preprocessor takes the skeletal source program as input and produces an extended
version of it, which is the resultant of expanding the Macros, manifest constants if any, and
including header files etc in the source file. For example, the C preprocessor is a macro processor
that is used automatically by the C compiler to transform our source before actual compilation. Over
and above a preprocessor performs the following activities:
−−Collects all the modules, files in case if the source program is divided into different
modules stored at different files.
−−Expands short hands / macros into source language statements.
Compiler: Is a translator that takes as input a source program written in high level language and
converts it into its equivalent target program in machine language. In addition to above the compiler
also
−− Reports to its user the presence of errors in the source program.
−− Facilitates the user in rectifying the errors, and execute the code.
Assembler: Is a program that takes as input an assembly language program and converts it into its
equivalent machine language code.
Loader / Linker: This is a program that takes as input a relocatable code and collects the library
functions, relocatable object files, and produces its equivalent absolute machine code.
Specifically,
−− Loading consists of taking the relocatable machine code, altering the relocatable addresses,
and placing the altered instructions and data in memory at the proper locations.
−− Linking allows us to make a single program from several files of relocatable machine
code. These files may have been result of several different compilations, one or more may
be library routines provided by the system available to any program that needs them.
The steps involved in a typical language processing system can be understood with following diagram.

Source Program [ Example: filename.C ]

Preprocessor

Modified Source Program [ Example: filename.C ]

Compiler

Target Assembly Program

Assembler

Relocatable Machine Code [ Example: filename.obj ]

Library files
Relocatable Object files
Target Machine Code [ Example: filename. exe ]
Figure1.3 : Context of a Compiler in Language Processing System

TYPES OF COMPILERS:

Based on the specific input it takes and the output it produces, the Compilers can be classified
into the following types;

Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a HLL
into its equivalent in native machine code or object code.
Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert Source code into
intermediate code, and then interprets (emulates) it to its equivalent machine code.

Cross-Compilers: These are the compilers that run on one machine and produce code for another
machine.
Incremental Compilers: These compilers separate the source into user defined–steps;
Compiling/recompiling step- by- step; interpreting steps in a given order

Converters (e.g. COBOL to C++): These Programs will be compiling from one high level
language to another.

Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime compilers from
intermediate language (byte code, MSIL) to executable code or native machine code. These
perform type –based verification which makes the executable code more trustworthy

Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-compilers to the native
code for Java and .NET

Binary Compilation: These compilers will be compiling object code of one platform into object code of
another platform.

PHASES OF A COMPILER:

Compiler Phases are the individual modules which are chronologically executed to perform their
respective Sub-activities, and finally integrate the solutions to give target code.

It is desirable to have relatively few phases, since it takes time to read and write immediate files.
Following diagram (Figure1.4) depicts the phases of a compiler through which it goes during the
compilation. There fore a typical Compiler is having the following Phases:

1. Lexical Analyzer (Scanner),


2. Syntax Analyzer (Parser),
3. Semantic Analyzer,
4. Intermediate Code Generator(ICG),
5. Code Optimizer(CO) , and
6. Code Generator(CG).

In addition to these, it also has Symbol table management, and Error handler
phases.
The Phases of compiler divided in to two parts, first three phases we are called
as Analysis part remaining three called as Synthesis part.
The synthesis part constructs the desired target program from the intermediate
representation and the information in the symbol table.
The analysis part is often called the front end of the compiler;
the synthesis part is the back end.
If we examine the compilation process in more detail, we see that it operates as a sequence
of phases, each of which transforms one representation of the source program to another.
A typical decomposition of a compiler into phases is shown in Fig.
In practice, several phases may be grouped together, and the intermediate representations
between the grouped phases need not be constructed explicitly .
Figure1.4 : Phases of a Compiler
PHASE, PASSES OF A COMPILER:

In some application we can have a compiler that is organized into what is called passes.
Where a pass is a collection of phases that convert the input from one representation to a
completely deferent representation. Each pass makes a complete scan of the input and produces its
output to be processed by the subsequent pass. For example a two pass Assembler.
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface
between the compiler and the Source language program and performs the following functions:

−− Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as an identifier
, a Keyword , a punctuation mark, a multi character operator like := .

−−The character sequence forming a token is called a lexeme of the token.

−− The Scanner generates a token-id, and also enters that identifiers name in the
Symbol table if it doesn‘t exist.

−−Also removes the Comments, and unnecessary spaces.

The format of the token is < Token name, Attribute value>

SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its subsequent
phase Semantic Analyzer and performs the following functions:

−−Groups the above received, and recorded token stream into syntactic structures,
usually into a structure called Parse Tree whose leaves are tokens.

−−The interior node of this tree represents the stream of tokens that logically
belongs together.

−−It means it checks the syntax of program elements.

SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the
semantically correctness of the program. Though the tokens are valid and syntactically correct, it
may happen that they are not correct semantically. Therefore the semantic analyzer checks the
semantics (meaning) of the statements formed.

−− The Syntactically and Semantically correct structures are produced here in the form of
a Syntax tree or DAG or some other sequential representation like matrix.

INTERMEDIATE CODE GENERATOR(ICG): This phase takes the syntactically and


semantically correct structure as input, and produces its equivalent intermediate notation of the
source program. The Intermediate Code should have two important properties specified below:

−− It should be easy to produce,and Easy to translate into the target program.
Example intermediate code forms are:

−−Three address codes,

−−Polish notations, etc.

CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in
terms of saving development time, effort, and cost. This phase performs the following specific
functions:
→Attempts to improve the IC so as to have a faster machine code. Typical functions include
–Loop Optimization, Removal of redundant computations, Strength reduction, Frequency
reductions etc.

−−Sometimes the data structures used in representing the intermediate forms may also
be changed.

CODE GENERATOR: This is the final phase of the compiler and generates the target code,
normally consisting of the relocatable machine code or Assembly code or absolute machine code.

−Memory locations are selected for each variable used, and assignment of variables to
registers is done.

−−Intermediate instructions are translated into a sequence of machine instructions.

The Compiler also performs the Symbol table management and Error handling throughout the
compilation process. Symbol table is nothing but a data structure that stores different source
language constructs, and tokens generated during the compilation. These two interact with all
phases of the Compiler.

For example the source program is an assignment statement; the following figure shows how the
phases of compiler will process the program.
The input source program is Position=initial+rate*60

Figure1.5: Translation of an assignment Statement.


Compiler Construction Tools:
The compiler writer, like any software developer, can protably use modern software development
environments containing tools such as language editors, debuggers, version managers, prolers, test
harnesses, and so on.
In addition to these general software-development tools, other more specialized tools have been
created to help implement various phases of a compiler. These tools use specialized languages for
specifying and implementing specific components, and many use quite sophisticated algorithms.
The most successful tools are those that hide the details of the generation algorithm and produce
components that can be easily integrated into the remainder of the compiler.
Some commonly used compiler-construction tools include
1. Parser generators that automatically produce syntax analyzers from a grammatical description of
a programming language.

2. Scanner generators that produce lexical analyzers from a regular-expression description of the
tokens of a language.

3. Syntax-directed translation engines that produce collections of routines for walking a parse tree
and generating intermediate code.

4. Code-generator generators that produce a code generator from a collection of rules for translating
each operation of the intermediate language into the machine language for a target machine.

5. Data- flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data- flow analysis is a key part of code
optimization.

6. Compiler-construction toolkits that provide an integrated set of routines for constructing various
phases of a compiler.

The Science of building a Compiler:


• A compiler must accept all source programs that conform to the specification of the language; the set
of source programs is infinite and any program can be very large, consisting of possibly millions of
lines of code. Any transformation performed by the compiler while translating a source program must
preserve the meaning of the program being compiled.
• Compiler writers thus have influence over not just the compilers they create, but all the programs that
their compilers compile. This leverage makes writing compilers particularly rewarding; however, it
also makes compiler development challenging.

Modelling in compiler design and implementation:


• The study of compilers is mainly a study of how we design the right mathematical models and choose
the right algorithms.
• Some of most fundamental models are finite-state machines and regular expressions.
• These models are useful for de-scribing the lexical units of programs (keywords, identifiers, and such)
and for describing the algorithms used by the compiler to recognize those units.
• Also among the most fundamental models are context-free grammars, used to describe the syntactic
structure of programming languages such as the nesting of parentheses or control constructs.
• Similarly, trees are an important model for representing the structure of programs and their translation
into object code.
• The science of code optimization: The term "optimization" in compiler design refers to the attempts
that a com-piler makes to produce code that is more efficient than the obvious code.
• In modern times, the optimization of code that a compiler performs has become both more important
and more complex.
• It is more complex because processor architectures have become more complex, yielding more
opportunities to improve the way code executes.
• It is more important because massively par-allel computers require substantial optimization, or their
performance suffers by orders of magnitude.

Compiler optimizations must meet the following design objectives:

1. The optimization must be correct, that is, preserve the meaning of the compiled program,
2. The optimization must improve the performance of many programs,
3. The compilation time must be kept reasonable, and
4. The engineering effort required must be manageable.

Thus, in studying compilers, we learn not only how to build a compiler, but also the general methodology of
solving complex and open-ended problems.

Programming Language Basics:

1 The Static/Dynamic Distinction


2 Environments and States
3 Static Scope and Block Structure
4 Explicit Access Control
5 Dynamic Scope
6 Parameter Passing Mechanisms

The Static/Dynamic Distinction: Among the most important issues that we face when designing a compiler
for a language is what decisions can the compiler make about a program. If a language uses a policy that
allows the compiler to decide an issue, then we say that the language uses a static policy or that the issue can
be decided at compile time. On the other hand, a policy that only allows a decision to be made when we
execute the program is said to be a dynamic policy. One issue is the scope of declarations. The scope of a
declaration of x is the region of the program in which uses of x refer to this declaration. A language uses
static scope or lexical scope if it is possible to determine the scope of a declaration by looking only at the
program. Otherwise, the language uses dynamic scope. With dynamic scope, as the program runs, the same
use of x could refer to any of several different declarations of x.

Environments and States:


The environment is a mapping from names to locations in the store. Since variables refer to locations,
we could alternatively define an environment as a mapping from names to variables.
The state is a mapping from locations in store to their values. That is, the state maps 1-values to
their corresponding r-values, in the terminology of C. Environments change according to the scope rules
of a language.
Static Scope and Block Structure

Most languages, including C and its family, use static scope. we consider static-scope rules for a
language with blocks, where a block is a grouping of declarations and statements. C uses braces { and }
to delimit a block; the alternative use of begin and end for the same purpose dates back to Algol.
A C program consists of a sequence of top-level declarations of variables and functions.Functions
may have variable declarations within them, where variables include local variables and parameters. The
scope of each such declaration is restricted to the function in which it appears.
The scope of a top-level declaration of a name x consists of the entire program that follows, with the
exception of those statements that lie within a function that also has a declaration of x.
A block is a sequence of declarations followed by a sequence of statements, all surrounded by braces.
a declaration D "belongs" to a block B if B is the most closely nested block containing D; that is, D is
located within B, but not within any block that is nested within B. The static-scope rule for variable
declarations in a block-structured lan-guages is as follows. If declaration D of name x belongs to block B,
then the scope of D is all of B, except for any blocks B' nested to any depth within J5, in which x is
redeclared. Here, x is redeclared in B' if some other declaration D' of the same name x belongs to B'.
An equivalent way to express this rule is to focus on a use of a name x. Let Bi, i?2, • • • , Bk be
all the blocks that surround this use of x, with Bk the smallest, nested within Bk-i, which is nested within
Bk-2, and so on. Search for the largest i such that there is a declaration of x belonging to B^. This use of
x refers to the declaration in B{. Alternatively, this use of x is within the scope of the declaration in Bi.
Explicit Access Control
Through the use of keywords like public, private, and protected, object-oriented languages such as C
+ + or Java provide explicit control over access to member names in a superclass. These keywords
support encapsulation by restricting access. Thus, private names are purposely given a scope that
includes only the method declarations and definitions associated with that class and any "friend" classes
(the C + + term). Protected names are accessible to subclasses. Public names are accessible from outside
the class.

Dynamic Scope
Any scoping policy is dynamic if it is based on factor(s) that can be known only when the program
executes. The term dynamic scope, however, usually refers to the following policy: a use of a
name x refers to the declaration of x in the most recently called procedure with such a declaration.
Dynamic scoping of this type appears only in special situations. We shall consider two ex-amples of
dynamic policies: macro expansion in the C preprocessor and method resolution in object-oriented
programming.
Declarations and Definitions

Declarations tell us about the types of things, while definitions tell us about their values. Thus, i n t i is a
declaration of i, while i = 1 is a definition of i.
The difference is more significant when we deal with methods or other procedures. In C + + , a method
is declared in a class definition, by giving the types of the arguments and result of the method (often called
the signature for the method. The method is then defined, i.e., the code for executing the method is given,
in another place. Similarly, it is common to define a C function in one file and declare it in other files
where the function is used.
Parameter Passing Mechanisms

In this section, we shall consider how the actual parameters (the parameters used in the call of a
procedure) are associated with the formal parameters (those used in the procedure definition). Which
mechanism is used determines how the calling-sequence code treats parameters. The great majority of
languages use either "call-by-value," or "call-by-reference," or both.
Call - by - Value

In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if it is a variable). The
value is placed in the location belonging to the corresponding formal parameter of the called procedure.
This method is used in C and Java, and is a common option in C + + , as well as in most other languages.
Call-by-value has the effect that all computation involving the formal parameters done by the called
procedure is local to that procedure, and the actual parameters themselves cannot be changed.
Note, however, that in C we can pass a pointer to a variable to allow that variable to be changed by the
callee. Likewise, array names passed as param eters in C, C + + , or Java give the called procedure what
is in effect a pointer or reference to the array itself. Thus, if a is the name of an array of the calling
procedure, and it is passed by value to corresponding formal parameter x, then an assignment such as x
[ i ] = 2 really changes the array element a[2]. The reason is that, although x gets a copy of the value of a,
that value is really a pointer to the beginning of the area of the store where the array named a is located.
Similarly, in Java, many variables are really references, or pointers, to the things they stand for. This
observation applies to arrays, strings, and objects of all classes. Even though Java uses call-by-value
exclusively, whenever we pass the name of an object to a called procedure, the value received by that
procedure is in effect a pointer to the object. Thus, the called procedure is able to affect the value of the
object itself.
Call - by - Reference

In call-by-reference, the address of the actual parameter is passed to the callee as the value of the
corresponding formal parameter. Uses of the formal parameter in the code of the callee are implemented
by following this pointer to the location indicated by the caller. Changes to the formal parameter thus
appear as changes to the actual parameter.
If the actual parameter is an expression, however, then the expression is evaluated before the call, and
its value stored in a location of its own. Changes to the formal parameter change this location, but can
have no effect on the data of the caller.
Call-by-reference is used for "ref" parameters in C + + and is an option in many other languages. It is
almost essential when the formal parameter is a large object, array, or structure. The reason is that
strict call-by-value requires that the caller copy the entire actual parameter into the space belonging
to the corresponding formal parameter. This copying gets expensive when the parameter is large. As
we noted when discussing call-by-value, languages such as Java solve the problem of passing arrays,
strings, or other objects by copying only a reference to those objects. The effect is that Java behaves
as if it used call-by-reference for anything other than a basic type such as an integer or real.
Call - by - Name

A third mechanism — call-by-name — was used in the early programming language Algol 60. It
requires that the callee execute as if the actual parameter were substituted literally for the formal
parameter in the code of the callee, as if the formal parameter were a macro standing for the actual
parameter (with renaming of local names in the called procedure, to keep them distinct). When the
actual parameter is an expression rather than a variable, some unintuitive behaviors occur, which is
one reason this mechanism is not favored today.

LEXICAL ANALYSIS

OVER VIEW OF LEXICAL ANALYSIS


o To identify the tokens we need some method of describing the possible tokens that
can appear in the input stream. For this purpose we introduce regular expression,
a notation that can be used to describe essentially all the tokens of programming
language.
o Secondly , having decided what the tokens are, we need some mechanism to
recognize these in the input stream. This is done by the token recognizers, which
are designed using transition diagrams and finite automata.

ROLE OF LEXICAL ANALYZER


the LA is the first phase of a compiler. It main task is to read the input character
and produce as output a sequence of tokens that the parser uses for syntax analysis.

Upon receiving a get next token command form the parser, the lexical analyzer
reads the input character until it can identify the next token. The LA return to the parser
representation for the token it has found. The representation will be an integer code, if the
token is a simple construct such as parenthesis, comma or colon.

LA may also perform certain secondary tasks as the user interface. One such task
is stripping out from the source program the commands and white spaces in the form of
blank, tab and new line characters. Another is correlating error message from the compiler
with the source program.

LEXICAL ANALYSIS VS PARSING:

Lexical analysis Parsing


A Scanner simply turns an input String (say a A parser converts this list of tokens into a
file) into a list of tokens. These tokens Tree-like object to represent how the tokens
represent things like identifiers, parentheses, fit together to form a cohesive whole
operators etc. (sometimes referred to as a sentence).

The lexical analyzer (the "lexer") parses A parser does not give the nodes any
individual symbols from the source code file meaning beyond structural cohesion. The
into tokens. From there, the "parser" proper next thing to do is extract meaning from this
turns those whole tokens into sentences of structure (sometimes called contextual
your grammar analysis).

INPUT BUFFERING:

Before discussing the problem of recognizing lexemes in the input, let us examine some
ways that the simple but important task of reading the source program can be speeded. This task
is made difficult by the fact that we often have to look one or more characters beyond the next
lexeme before we can be sure we have the right lexeme. There are many situations where we need
to look at least one additional character ahead. For instance, we cannot be sure we've seen the end
of an identifier until we see a character that is not a letter or digit, and therefore is not part of the
lexeme for id. In C, single-character operators like -, =, or < could also be the beginning of a
two-character operator like ->, ==, or <=. Thus, we shall introduce a two-buffer scheme that
handles large look aheads safely. We then consider an improvement involving "sentinels" that
saves time checking for the ends of buffers.

Buffer Pairs

Because of the amount of time taken to process characters and the large number of characters that
must be processed during the compilation of a large source program, specialized buffering
techniques have been developed to reduce the amount of overhead required to process a single
input character. An important scheme involves two buffers that are alternately reloaded.
Figure1.8 : Using a Pair of Input Buffers

Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes.
Using one system read command we can read N characters in to a buffer, rather than using one
system call per character. If fewer than N characters remain in the input file, then a special
character, represented by eof, marks the end of the source file and is different from any possible
character of the source program.

 Two pointers to the input are maintained:

1. The Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found;

Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin is
set to the character immediately after the lexeme just found. In Fig, we see forward has passed the
end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted one
position to its left.

Advancing forward requires that we first test whether we have reached the end of one of
the buffers, and if so, we must reload the other buffer from the input, and move forward to the
beginning of the newly loaded buffer. As long as we never need to look so far ahead of the actual
lexeme that the sum of the lexeme's length plus the distance we look ahead is greater than N, we
shall never overwrite the lexeme in its buffer before determining it.

Sentinels To Improve Scanners Performance:

If we use the above scheme as described, we must check, each time we advance forward,
that we have not moved off one of the buffers; if we do, then we must also reload the other buffer.
Thus, for each character read, we make two tests: one for the end of the buffer, and one to
determine what character is read (the latter may be a multi way branch). We can combine the
buffer-end test with the test for the current character if we extend each buffer to hold a sentinel
character at the end. The sentinel is a special character that cannot be part of the source program,
and a natural choice is the character eof. Figure 1.8 shows the same arrangement as Figure 1.7, but
with the sentinels added. Note that eof retains its use as a marker for the end of the entire input.

Figure1.8 : Sentential at the end of each buffer

Any eof that appears other than at the end of a buffer means that the input is at an end.

switch ( *forward++ )
{
case eof: if (forward is at end of first buffer )
{
reload second buffer;
forward = beginning of second buffer;
}

else if (forward is at end of second buffer )


{

break;

}
reload first buffer;
forward = beginning of first buffer;
}

else /* eof within a buffer marks the end of input */ terminate lexical
analysis;

Figure 1.9: use of switch-case for the sentential

TOKEN, LEXEME, PATTERN:

Token: Token is a sequence of characters that can be treated as a single logical


entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output.
This set of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched
by the pattern for a token.
Example:
Description of token

Token lexeme Pattern

const const Const

if If If

relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i Pi any numeric constant

nun 3.14 any character b/w “and “except"

literal "core" Pattern

A pattern is a rule describing the set of lexemes that can represent a particular token
in source program.
LEXICAL ERRORS:

Lexical errors are the errors thrown by your lexer when unable to continue. Which
means that there's no way to recognise a lexeme as a valid token for you lexer.
Syntax errors, on the other side, will be thrown by your scanner when a given set of
already recognised valid tokens don't match any of the right sides of your grammar
rules. simple panic-mode error handling system requires that we return to a high-
level parsing function when a parsing or lexical error is detected.

Error-recovery actions are:


i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.

2.2 REGULAR EXPRESSIONS

Regular expression is a formula that describes a possible set of string.


Component of regular expression..
X the character x
. any character, usually accept a new
line [x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally
as R) R* zero or more occurrences…..
R+ one or more occurrences
…… R1R2 an R1 followed by an R2
R2R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view
the set of strings in each token class as an language, we can use the regular-expression
notation to describe tokens.

Consider an identifier, which is defined to be a letter followed by zero or more


letters or digits. In regular expression notation we would write.

Identifier = letter (letter | digit)*


Here are the rules that define the regular expression over alphabet .

o is a regular expression denoting { € }, that is, the language containing only the
empty string.
o For each „a‟ in ∑, is a regular expression denoting { a }, the language with
only one string consisting of the single symbol „a‟ .
o If R and S are regular expressions, then

(R) | (S) means LrULs


R.S means
Lr.Ls R*
denotes Lr*

2.3 REGULAR DEFINITIONS

For notational convenience, we may wish to give names to regular


expressions and to define regular expressions using these names as if they were
symbols.
Identifiers are the set or string of letters and digits beginning with a letter.
The following regular definition provides a precise specification for this class of
string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to
take the patterns for all the needed tokens and build a piece of code that examins the input
string and finds a prefix that is a lexeme matching one of the patterns.
Stmt -> if expr then stmt
| If expr then else stmt

Expr --> term relop term
|term
Term -->id

For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names
of tokens as far as the lexical analyzer is concerned, the patterns for the tokens are
described using regular definitions.

digit --
>[0,9] digits
--
>digit+
number -->digit(.digit)?(e.[+-
]?digits)? letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --></>/<=/>=/==/< >

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
+
ws --> (blank/tab/newline)
Here, blank, tab and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that ,when we
recognize it, we do not return it to parser ,but rather restart the lexical analysis from the
character that follows the white space . It is the following token that gets returned to the
parser.
Lexeme Token Name Attribute Value
Any ws _ _
if If _
then Then _
else Else _
Any Id Id pointer to table entry
Any number Number
pointer to table entry
< Relop LT

<= Relop LE
= Relop ET
<> Relop NE

2.4 TRANSITION DIAGRAM:


Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input looking for
a lexeme that matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is
labeled by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s
labeled by a. if we find such an edge ,we advance the forward pointer and enter
the state of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has
been found, although the actual lexeme may not consist of all positions b/w the lexeme
Begin and forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we
shall additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled
“start” entering from nowhere .the transition diagram always begins in the state before any
input symbols have been used.
As an intermediate step in the construction of a LA, we first produce a stylized
flowchart, called a transition diagram. Position in a transition diagram, are drawn as circles
and are called as states.

The above TD for an identifier, defined to be a letter followed by any no of letters


or digits.A sequence of transition diagram can be converted into program to look for the
tokens specified by the diagrams. Each state gets a segment of code.

If = if
Then = then
Else = else
Relop = < | <= | = | > | >=
Id = letter (letter | digit) *|
Num = digit |
2.10 AUTOMATA

An automation is defined as a system where information is transmitted and


used for performing some functions without direct participation of man.
1, an automation in which the output depends only on the input is
called an automation without memory.
2, an automation in which the output depends on the input and state also is
called as automation with memory.
3, an automation in which the output depends only on the state of the machine is
called a Moore machine.
3, an automation in which the output depends on the state and input at any
instant of time is called a mealy machine.

DESCRIPTION OF AUTOMATA

1, an automata has a mechanism to read input from input tape,


2, any language is recognized by some automation, Hence these automation are
basically language „acceptors‟ or „language recognizers‟.
Types of Finite Automata

Deterministic Automata
Non-Deterministic Automata.
DETERMINISTIC AUTOMATA

A deterministic finite automata has at most one transition from each state on
any input. A DFA is a special case of a NFA in which:-

1, it has no transitions on input € ,

2, each input symbol has at most one transition from any state.

DFA formally defined by 5 tuple notation M = (Q, ∑, δ, qo, F),


where Q is a finite „set of states‟, which is non empty.
∑ is „input alphabets‟, indicates input set.
qo is an „initial state‟ and qo is in Q ie, qo,
∑, QF is a set of „Final states‟,
δ is a „transmission function‟ or mapping function, using this function
the next state can be determined.

The regular expression is converted into minimized DFA by the following procedure:

Regular expression → NFA → DFA → Minimized DFA

The Finite Automata is called DFA if there is only one path for a specific input
from current state to next state.

a
a
So

S1
From state S0 for input „a‟ there is only one path going to S2. similarly from S0
there is only one path for input going to S1.

2.11 NONDETERMINISTIC AUTOMATA

A NFA is a mathematical model that consists of

A set of states S.
A set of input symbols ∑.
A transition for move from one state to an other.
A state so that is distinguished as the start (or initial)
state. A set of states F distinguished as accepting (or
final) state.
A number of transition to a single symbol.
A NFA can be diagrammatically represented by a labeled directed graph, called
a transition graph, In which the nodes are the states and the labeled edges
represent the transition function.

This graph looks like a transition diagram, but the same character can label two
or more transitions out of one state and edges can be labeled by the special
symbol € as well as by input symbols.

The transition graph for an NFA that recognizes the language ( a | b ) * abb is
shown

2.12 DEFINITION OF SYMBOL TABLE

An extensible array of records.


The identifier and the associated records contains collected information
about the identifier.
FUNCTION identify (Identifier name)
RETURNING a pointer to identifier information
contains The actual string
A macro
definition A
keyword definition
A list of type, variable & function
definition A list of structure and union
name definition
A list of structure and union field selected definitions.
LEX the Lexical Analyzer generator

Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool is
referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the
Lex compiler transforms the input patterns into a transition diagram and generates code, in a
file called lex .yy .c, it is a c program given for C Compiler, gives the Object code. Here we need
to know how to write the Lex language. The structure of the Lex program is given below.
Structure of LEX Program : A Lex program has the following form:

Declarations

%%
Translation rules

%%

Auxiliary functions definitions


The declarations section : includes declarations of variables, manifest constants (identifiers
declared to stand for a constant, e.g., the name of a token), and regular definitions. It appears
between %{. . .%}

In the Translation rules section, We place Pattern Action pairs where each pair have the form

Pattern {Action}

The auxiliary function definitions section includes the definitions of functions used to install
identifiers and numbers in the Symbol tale.

LEX Program Example:


%{
/* definitions of manifest constants LT,LE,EQ,NE,GT,GE, IF,THEN, ELSE,ID, NUMBER,
RELOP */
%}

/* regular definitions */
delim [ \t\n]
ws { delim}+
letter [A-Za-z]
digit [o-91
id {letter} ({letter} | {digit}) *
number {digit}+ (\ . {digit}+)? (E [+-I]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(1F) ; }
then {return(THEN) ; }
else {return(ELSE) ; }

(id) {yylval = (int) installID(); return(1D);}


(number) {yylval = (int) installNum() ; return(NUMBER) ; }
‖< ‖ {yylval = LT; return(REL0P) ; )}
― <=‖ {yylval = LE; return(REL0P) ; }
―=‖ {yylval = EQ ; return(REL0P) ; }
―<>‖ {yylval = NE; return(REL0P);}

―<‖ {yylval = GT; return(REL0P);)}


―<=‖ {yylval = GE; return(REL0P);}
%%

int installID0() {/* function to install the lexeme, whose first character is pointed to by yytext,
and whose length is yyleng, into the symbol table and return a pointer
thereto */
int installNum() {/* similar to installID, but puts numerical constants into a separate table */}
Figure 1.10 : Lex Program for tokens common tokens

You might also like