SSCD Mod2
SSCD Mod2
We know that the programming languages are used to describe computations to people
and to machines. The programming languages are very important since all the software
running on all the computers was written in some programming language. But, before we
run a program on any computer, it must be translated into a form which can be
understood and executed by the computer. For this purpose we use language processors.
In the first section, we discuss something about language processors.
Definition: A language processor is a program that accepts source language as the input
and translates it into another language called target language. The language processors
are used for software development without worrying much about the architecture of the
machine i.e., the programmers can ignore the machine-dependent details during software
development. It performs translating and interpreting a specified programming language.
Now, let us see “What are the various types of language processors?” The various types
of language processors are shown below:
Preprocessor
Compiler
1.2 Lexical Analyzer
Interpreter
Assembler
Hybrid compiler
Now, let us see “What is a compiler? What are the activities performed by the compiler?”
Definition: A compiler is a translator that accepts the source program written in high
level language such as C/C++ and converts it into object program or low level language
such as assembly language. The various activities done by the compiler are shown below:
The syntax errors are identified and appropriate error messages are displayed along
with line numbers
An optional list file can be generated by the compiler
The compiler replaces each executable statement in high level language into one or
more machine language instructions
Converts HLL into assembly language or object program if the program is
syntactically correct.
Error messages
If the target program is an executable machine code, it can be called by the user to
process the inputs and produce output which is pictorially represented as shown below:
Definition: The preprocessor is a program that accepts source program as the input and
prepares the input program for compilation. A source program is normally divided into
Compiler Design - 1.3
modules and will be stored in separate files. All these files are collected by the
preprocessor with the help of #include directive. The preprocessor may also expand the
shorthands called macros into source language statements with the help of #define
directive.
Thus, the expanded version of the source program is input to the compiler. This is
pictorially represented as shown below:
#include
Source <stdio.h>
program
#define MAX 5
void main() Preprocessor
{
printf(“v=%d”, MAX);
} …………. Contents of
…………. Expanded
file “stdio.h”
…………. are included
void main()
{ program
Object program printf(“v=%d”, 5);
(Binary form) }
0101 0101
1101 1101
…………. Compiler
………….
Definition: Interpreter is a translator that accepts the source program written in high level
language and converts it into machine code and executes it. The interpreter converts one
high level language statement into machine code and then executes immediately. The
statements within the loop are translated into machine code each time the control enters
into the loop. Whenever an error is encountered, the execution of the program is halted
and an error message is displayed. This can be pictorially represented as shown below:
Object program
1.4 Lexical Analyzer
HLL
generated is Interpreter executed Ex:
T.BAS immediately
Now, let us see “What are the differences between a compiler and interpreter?” The
difference between compiler and interpreter are shown below:
COMPILER INTERPRETER
Converts the entire source program into The interpreter takes one statement at a
machine code. The linker uses the time, translates it into machine code
machine code and creates an executable and execute that instruction
file
The compiler will not execute the The interpreter executes the machine
machine code code generated
Compiler requires more memory while Interpreter requires less memory while
translating translating
Compiler is not required to execute the Interpreter is required whenever a
executable code program has to be executed
Compiler is costlier It is cheaper than the compiler
Compiled program runs faster Interpreted code runs slower
Object program is created only when Program can be executed even if the
there are no syntax and semantic errors syntax errors are present in later stages
in the program but execution stops whenever syntax
error is encountered.
Now, let us see “What is an assembler? What are the activities performed by the
compiler?”
Assembly Object
language ASSEMBLER program
program
T.ASM T.OBJ
disc Errors disc
Now, let us see “What are the various functions of the assembler? The various functions
of the assembler are shown below:
Convert assembly language instructions to their machine language equivalent
Convert the data constants into equivalent machine representation. For example, if
decimal data 4095 is used as operand, it is converted into hexadecimal value
(machine representation) FFF.
Build the machine instruction using the appropriate instruction format.
Process the instructions given to the assembler directives. These directives help the
assembler while converting from assembly language to machine language
Create the list file. The list file consists of source code along with address for each
instruction and corresponding object code.
Display appropriate errors on the screen so that the programmer can correct the
program and re-assemble it.
Create the object program. The object program consists of machine instructions and
instructions to the loader.
Note: Now, let us see “What is portability with respect to software?” The software
written for one machine (with Intel processor) is copied into another machine (with
Motorola processor) and that software is perfectly executed in both machines, then we
say that the software is portable. Otherwise, the software is not portable.
Ex1: The object program obtained from a C compiler in one machine (with Intel
processor) can be copied into another machine (with Motorola processor). But, the copied
object program cannot be loaded and executed by the Motorola processor. So, we say
object program is not portable.
Ex2: Any C program may be copied from one machine (with Intel processor) to another
machine (with Motorola processor). The C program can be compiled using C compilers
in both machines and can be executed in both machines. This is possible if both machines
have C compilers. Now, we say that C programs are portable.
Ex3: For all java programs, the byte codes can be generated using java compilers. Since,
all the machines with java virtual machine can execute the byte codes, we say that all
java programs are portable.
Now, let us see “What are the cousins of the compiler?” or let us “Explain language
processing system” During software development, in addition to a compiler several other
programs may be required to create an executable target program. All the programs that
are helpful in generating the required executable target program are called cousins of the
compiler. All these cousins of the compiler are together represent language processing
system.
The language processing system or the various cousins of the compiler and their
interaction are shown in block diagram below:
Target
source program machine code
Compiler Design - 1.7
Thus, the various programs that constitute language processing system (or cousins of the
compiler) are:
2) Convert the data constants into equivalent machine representation. For example, if
decimal data 4095 is used as operand, it is converted into hexadecimal value
(machine representation) FFF.
3) Build the machine instruction using the appropriate instruction format.
4) Process the instructions given to the assembler directives. These directives help the
assembler while converting from assembly language to machine language
5) Create the list file. The list file consists of source code along with address for each
instruction and corresponding object code.
6) Display appropriate errors on the screen so that the programmer can correct the
program and re-assemble it.
7) Create the object program. The object program consists of machine instructions and
instructions to the loader.
Linker: Large programs are often compiled in pieces and generate object programs and
stored in files. The linker or linkage editor accepts one or more object programs
generated by a compiler and links the library files and creates an executable file. The
linker resolves external reference symbols where code in one file may refer to a location
in another file.
Loader: A loader is the part of an operating system that is responsible for loading
programs into main memory for execution. This is one of the important stages in the
process of executing a program. Loading a program involves reading the contents of
executable file, loading the file containing the program text into memory, and then
carrying out other tasks to prepare the executable for running. Once loading is complete,
the operating system starts the program by passing control to the loaded program code
and execution begins.
1.2 Analysis of source program
Now, let us “Discuss the analysis of source program with respect to compilation process”
The process of converting from high level language to target language is called
compilation. The entire compilation process is mapped into two parts namely: Analysis
and synthesis
Analysis part: It acts as the front end of the compiler and performs various activities as
shown below:
Accepts the source program and breaks it into pieces and imposes a grammatical
structure on them
It then uses the above structure to create an intermediate representation of the source
program
Compiler Design - 1.9
It also displays appropriate error messages whenever a syntax error or sematic error is
encountered during compilation. This helps the programmer to correct the program.
It also gathers the information about the source program and stores the variables and
data in a data structure called symbol table along with type of data, line numbers etc.
The symbol table thus obtained along with intermediate representation of the source
program is sent to syntheses part.
Synthesis part: This is the back end of the compilation process and performs the
following activities:
It accepts the intermediate representation of source program along with symbol table
from the analysis part.
It generates the target program or object program as the output.
To understand the compilation process in more detail, this large system should be divided
into various modules with proper interfaces as discussed in the next section.
In this section, let us see “What are the different phases of the compiler?” or “Explain the
block diagram of the compiler”
The various phases of the compiler and the block diagram is shown below:
Lexical analysis
Syntax analysis
Semantic analysis
Intermediate code generation
Code optimization
Code generation
1.10 Lexical Analyzer
Source program
ANALYSIS
Lexical analysis
Syntax analysis
Semantic analysis
Code optimization
Target program
Lexical analysis: It is the first phase of the compiler. It accepts the source program as the
input and produces tokens as the output. During this phase, the lexical analyzer performs
following activities:
Removes white spaces (tabs, spaces and newline characters).
Removes comments from the program.
Interacts with symbol table to insert variables and constants into symbol table and
returns a token consisting of an integer code and pointer to variable and constant in
the symbol table.
Detects invalid identifiers and interacts with error handler to display the error
messages.
Example 1.1: Show the various tokens generated by the lexical analyzer for the
following statement:
a=b+c
The various tokens generated by the lexical analyzer for the statement “a = b + c” are
shown below:
Compiler Design - 1.11
1.12 Lexical Analyzer
lexemes tokens
a ID, 1 = ASSIGN b ID, 2
+ PLUS
c ID, 3
Note: For all symbols and keywords used in the program the unique numbers are
assigned and available to both lexical analyzer and syntax analyzer as shown in header
file:
#define ASSIGN 1
#define PLUS 2
#define MUL 3
#define LP 4
…. ….
#define ID 100
#define NUM 101
Syntax analysis: It is the second phase of the compiler. It accepts the tokens from the
lexical analyzer and imposes a grammatical structure on them and checks whether the
program is syntactically correct or not. The various activities done by the parser are:
Detects syntax errors (such as undefined symbol, semicolon expected etc.)
Interacts with symbol table and the error handler to display appropriate error
messages
Generates parse tree (or syntax tree which is the compressed form of the parse tree)
Semantic analysis: It is the third phase of the compiler. Type checking and type
conversion is done by this phase. Other operations that are performed by this phase are:
Collects type information (such as int, float, char etc.) of all variables and constants
for subsequent code generation phase
Checks the type of each operand in an arithmetic expression and report error if any
(For example, subscript variable i in an array a should not be float or double)
Type conversion at the appropriate place is done in this phase. (For example,
conversion from 10 to 10.0)
Compiler Design - 1.13
Detects error whenever there is a mismatch in the type of arguments and parameters
in the function call and function header
Intermediate code generation: It is the fourth phase of the compiler. The syntax tree
which is the output of the semantic analyzer is input to this phase. Using this syntax tree,
it generates intermediate code which is suitable for generating the code. Some of the
activities that are performed by this phase are:
High level language statements such as while, if, switch etc., are translated into low
level conditional statements
Produces stream of simple instructions and fed to the next phase of the compiler
b = a*b + c
t1 = a*b
t2 = t1 + c
b = t2
Note: The above intermediate code is input to code optimizer which is the next phase of
the compiler
Code optimization: This is the fifth phase of the compiler. The intermediate code
generated in the previous phase is the input. The output is another intermediate code that
does the same job as the original but saves time and space. That is, output of this phase is
the optimized code.
1.14 Lexical Analyzer
a = b + c*d
e = f + c*d
t1 = c*d
a = b+t1
e=f+t1
Observe that computation of c*d is done only once. So, the code is optimized.
Code generation: This is the last phase of the compiler. The optimized code obtained
from the previous phase is the input. It converts the intermediate code into equivalent
assembly language instructions. This phase has to utilize the registers very efficiently to
generate the efficient code. For example, if the intermediate code generated by the
previous phase is:
a=b+c
LOAD B
ADD C
STORE A
Symbol table: This is the module that can be accessed by all phases of the compiler. The
information of all the variables, constants along with type of variables and constants are
stored here.
Error handler: This routine is invoked whenever an error is encountered in any of the
phases. This routine attempts a correction so that subsequent statements need not be
discarded and many errors can be identified in a single compilation.
In this section, let us see “What is the difference between a phase and pass?”
Compiler Design - 1.15
Definition: Since compiler is a very complex program, to understand the design of the
compiler, it is logically divided into various sub-programs called modules. These
subprograms or modules are also called phases. Each phase or module accepts one
representation of the program as the input and generates another representation of the
program as the output. Thus, a phase is nothing but a module of the compiler that accepts
one representation of the program as the input and produces another representation of the
program as the output. The various phases of the compiler along with input and output
are pictorially shown in page 1.9.
One pass compiler: If analysis and syntheses of source program is done in one
flow from beginning to end of the program, then the compiler is called one pass
compiler. In this case, all the phases of a compiler starting from lexical analysis
phase to code generation phase are combined together as one unit to convert the
source program into object program in one single flow. This is pictorially
represented as shown below:
Source file
Lexical analysis
Syntax analysis
Semantic analysis One pass compiler
1.16 Lexical Analyzer
The single pass compiler or one pass compiler is beneficial because of the following
reasons:
Simplifies the job of writing a compiler
One pass compilers generally perform compilations faster than multi-pass
compilers
Two pass compiler: If analysis is of source program is done in one pass and syntheses
of source program is done in another pass, then the compiler is called two pass
compiler.
In this case, the first three phases i.e., lexical analysis phase, syntax analysis phase,
semantic analysis phase are implemented as one unit and constitute first pass of
compiler. This is often called front-end of compiler. The other three phases i.e.,
intermediate code generation phase, code optimizing phase and code generation
phase are implemented as second unit and constitute second pass of compiler. This is
often called back-end of compiler.
The output of the first pass is normally stored a file.
The second pass accepts this file as the input and produces the object program.
Source file
Lexical analysis
Pass-1 Syntax analysis Analysis phase
Semantic analysis
Intermediate file
Code generator
Object program
Three pass compiler: If analysis and synthesis of source program is done in three
different passes, then the compiler is called three pass compiler. In this case, the first
four phases i.e., lexical analysis phase to intermediate code generation phase are
implemented as one unit (pass), the code optimizer as second unit (pass) and code
generation as third unit(pass). All the three passes are pictorially represented as
shown below:
Source file
Lexical analysis
Syntax analysis
Pass 1 Semantic analysis
Intermediate code generation
Code optimizer
Pass 2
Optimized intermediate code file
Pass 1: The front end of the phases such as lexical analysis, syntax analysis,
semantic analysis and intermediate code generation phases may be grouped into
one unit called pass1 of compiler. Pass 1 accepts the source program and
produces the intermediate code file. This file is internal and stored in main
memory. In some compilers, this file is also stored in the disk and hence it is
visible when we list the directory after pass 1.
Pass 2: The intermediate file created in pass 1 is input to pass 2. It produces
another intermediate code file which is optimized. Each line in this file contains
1.18 Lexical Analyzer
Note: Even though single pass compiler has some advantages, it has some disadvantages
too. Some of the disadvantages are:
It is not possible to perform many of the sophisticated optimizations needed to
generate high quality code.
It is difficult to count the number of passes made by an optimizing compiler. For
example, one expression may be analyzed many times whereas another expression
may be analyzed only once.
All the above disadvantages are overcome using multi-pass compiler. Now, let us see
“What is a multi-pass compiler? Explain the need for multiple passes in compiler?”
Definition: A multi-pass compiler is a type of compiler that processes the source code
two or more times. The multi-pass compilers are slower, but much more efficient when
compiling.
On similar lines, by retaining the front-end and changing the back-end of the
compiler, it is possible to produce a compiler for different target machines.
If there is feature of a language that requires a compiler to perform more than one
pass over the source, then multi-pass compiler is used.
Now, let us “Show the output of each phase of the compiler for the assignment statement:
sum = initial + value * 10 “ The translation process is shown below:
sum = initial + value * 10
t1 =
inttofloat(10)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3 t3
1.20 Lexical Analyzer
Code optimizer
t1 = id3 * 10.0
id1 = id2 + t1
Code generator
LDA #10.0
MUL ID3
ADD ID2
STA ID1
1.3 Compiler construction tools
In this section, let us see “What are compiler construction tools?” The compiler writer,
like any software developer, can profitably use modern software environments containing
tools such as editors, debuggers and so on. Some of the tools that help the programmer to
build the compiler very easily and efficiently are listed below:
Code generator generators: Using the intermediate code generated, they produce
the target language. The target language may be object program or assembly language
program.
Data flow analysis engines: It gathers the information of how values are transmitted
from one part of a program to other part of the program. Data flow analysis is key
part of the code optimization
Compiler Design - 1.21
Definition: The process of reading the source program and converting it into sequence of
tokens is called lexical analysis.
Now, let us see “What is the role of lexical analyzer?” The lexical analyzer is the first
phase of the compiler. The various tasks that are performed by the lexical analyzer are:
Read a sequence of characters from the source program and produce the tokens.
The tokens thus generated are sent to the parser for syntax analysis. The parser is
also called syntax analyzer.
During this process, lexical analyzer interacts with symbol table to insert various
identifiers and constants. Sometimes, information of identifier is read from symbol
table to assist in determining the proper token to send to the parser.
The interaction between the lexical analyzer and the parser is pictorially represented as
shown below:
Token
Source Scanner Parser Semantic
Program (Lexical analyzer) (Syntax analyzer) Analysis
getNextToken()
Symbol
Table
The parser program calls the function getNextToken() which is the function defined
in lexical analyzer (See the calling sequence below)
……….
……. ……….
token = getNextToken() getNextToken()
……. {
……. …….
…….
return token;
}
The function getNextToken() of lexical analyzer returns the token back to parser for
…….. parsing.
If the token obtained is an identifier, it is entered into the symbol table along with
various attribute values and returns a token as a pair consisting of an integer code
denoted by ID and a pointer to the symbol table for that identifier.
The other actions that are performed by the parser are:
Removes comments from the program.
Remove white spaces such as blanks, tabs and newline characters from the
source program and then tokens are obtained.
Keep track of line numbers so as to associate line numbers with error
messages.
If any errors are encountered, the lexical analyzer displays appropriate error
messages along with line numbers
Preprocessing may be done during lexical analysis phase
Now, let us see “Why analysis portion of the compiler is separated into lexical analysis
and syntax analysis phase?” The various reasons for which lexical analysis is separated
from syntax analysis are shown below:
Simplicity of the design: The separation of lexical analysis and syntax analysis
allows us to simplify some of the tasks listed below:
A parser that deals with comments and white spaces would be more complex.
So, to simplify this task and reduce the complexity of parser, removing of
comments and white spaces is done by the lexical analysis
When designing a new language, the separation of first two phases results in
cleaner overall language design
Compiler Design - 1.23
In this section, before proceeding further, let us see “What is the meaning of patterns,
lexemes and tokens?” To explain the meaning of these three words, consider the
following statement:
char str[] = “hello”;
In the above statement, the patterns, lexemes and respective tokens are shown below:
Symbolic names
defined using #define
L A
keyword char CHAR
E N
identifier str ID, 1A
X
left bracket L
I[ LEFT_BRACKET
C Y
right bracket ] RIGHT_BRACKET
A S
operator = ASSIGN
L I
string S
“hello” LITERAL
symbol ; SEMI_COLON
Observe that lexemes are input to lexical analyzer and lexical analyzer produces tokens.
The description of a lexeme is a pattern. Now, let us study the meaning of token, pattern
and lexemes in detail.
1.4.3.1 Tokens
Definition: A token is a pair consisting of token name and an optional attribute value.
The token names are basically integer codes represented using symbolic names written in
capital letters such as INT, FLOAT, SEMI_COLON etc (defined in the file token.h in
next page). The attribute values are optional and will not be present for keywords,
operators and symbols. The attribute values are present for all identifiers and constants.
For every keyword there is a unique token name. For example, INT, FLOAT, CHAR,
IF, WHILE etc
For every symbol there is a unique token name. For example, SEMI_COLON,
COLON, COMMA, LEFT_PARANTHESIS, RIGHT_PARANTHESIS, ASSIGN
etc.
For an identifier sum, the token is <ID, 1> where ID is the token name and 1 is the
position of the identifier in the symbol table
Whenever there is a request from the parser, the lexical analyzer sends the token. So,
tokens are output of the lexical analyzer and input to the syntax analyzer. The syntax
analyzer uses these tokens to check whether the program is syntactically correct or not by
deriving the tokens from the grammar. All the tokens are represented using symbolic
constants defined using #define directive as shown below:
Note: A simple header file “token.h” which is shared by both lexical analyzer and syntax
analyzer with various tokens and corresponding integer codes is shown below:
#define DOUBLE 5
#define FLOAT 6
………..
………..
………..
………..
/* TOKENS with corresponding integer codes for operators */
#define PLUS 50
#define MINUS 51
#define SUBTRACT 52
#define DIVIDE 53
………
………
/* TOKENS with corresponding integer codes for symbols */
#define LEFT_BRACKET 70
#define RIGHT_BRACKET 71
#define LEFT_PARANTHESIS 72
#define RIGHT_PARANTHESES 73
#define SEMI_COLON 74
………
………
/* TOKENS with corresponding integer codes for identifiers and literals */
#define ID 100
#define LITERAL 101
1.4.3.2 Lexeme
Definition: A sequence of characters in the source program that matches the patterns
such as identifiers, numbers, relational operators, arithmetic operators, symbols such as #,
[, ], (, ) and so on are called lexemes. In other words, a lexeme is a string of patterns read
from the source file that corresponds to a token.
1.26 Lexical Analyzer
1.4.3.3 Patterns
Solution: The lexemes, patterns and tokens for the above expression are shown below:
a is a lexeme matching the pattern identifier and returning the token <ID, 1> where
ID is the token name and 1 is the position of identifier a in the symbol table =
is a lexeme matching the pattern symbol „=‟ and returning the token ASSIGN
b is a lexeme matching the pattern identifier and returning the token <ID, 2> where
ID is the token name and 2 is the position of identifier b in the symbol table
* is a lexeme matching the pattern symbol „*‟ and returning the token STAR
d is a lexeme matching the pattern identifier and returning the token <ID, 3> where
ID is the token name and 3 is the position of identifier d in the symbol table ; is a
lexeme matching the pattern semicolon and returning the token SEMICOLON
Compiler Design - 1.27
Note: All capital letters in the above solution represent symbolic names of integers
defined in the file “tokens.h” given in section 1.4.3.1 (page 1.20)
Solution: The lexemes, patterns and tokens for the given printf statements are shown
below:
printf is a lexeme matching the pattern identifier and returning the token <ID, 1>
where ID is the token name and 1 is the position of identifier printf in the symbol
table
The character „(„ is a lexeme matching the pattern symbol and returning the token
LEFT_PARANTHESES
The sequence of characters “Simple Interest = %f\n” is a lexeme matching the pattern
string and returning the token <LITERAL, 2> where LITERAL is the token name
and 2 is the position of literal in the symbol table
The character „,‟ is a lexeme matching the pattern symbol and returning the token
COMMA
si is a lexeme matching the pattern identifier and returning the token <ID, 3> where
ID is the token name and 3 is the position of identifier si in the symbol table
The character „;‟ is a lexeme matching the pattern symbol and returning the token
SEMI_COLON
We know that a token is a pair consisting of token name and optional attribute value.
Now, let us see “What is the need for returning attributes for tokens along with token
name?”
When more than one lexeme matches the pattern, the lexical analyzer must provide
additional information about the type of lexeme matched, to subsequent compiler phases.
For example, the pattern for identifier matches both sum and
pos
1.28 Lexical Analyzer
The lexical analyzer returns the token ID for both lexemes. For the token ID, we need to
associate more information such as: lexemes (For example, sum and pos)
the type of identifier (such as int, float, char, double etc.)
the location at which the identifier found (say found in line number 10 and 20). This
information is required to display appropriate error messages along with line numbers
All the above information must be kept in the symbol table. Thus, appropriate attribute
value for the identifier is “pointer to symbol table entry for an identifier”. But, we have
used the position of identifier in the symbol table as the attribute value.
It is very important for the code generator to know which lexeme was found in the source
program to generate the actual code and to associate correct value to that variable. Thus,
lexical analyzer returns token name and its attribute value. The token name is used by the
parser to check for syntax errors, whereas the attribute value is used during translation
into machine code.
Example 1.3: Give the token names and attribute values for the following statement:
E = M * C **2. where ** in FORTRAN
language is used as exponent operator.
Solution: The various token names and attribute values for the above statement are
shown below:
<ID, 1000>
<ASSIGN>
Symbol Table
<ID, 1010> lexemes type line no.
<STAR>
1000 E float 2
<ID, 1020>
<POWER> 1010 M float 2
<NUM, 2> 1020 C float 2
Example 1.4: Give the token names and attribute values for the following statement:
si = p * t * r / 100
Compiler Design - 1.29
Solution: Assume the variables si, p, t, and r are already declared and they are available
in symbol table. The token names and associated attribute values for the above statement
are shown below:
Symbol Table
<DIVIDE>
<NUM, 100>
Note: In the above set of tokens, the second component of each pair such as 1000, 1010,
1020 and 1030 are the attribute values which contain the addresses to the symbol table.
Now, let us see “Why input buffering is required?” Input buffering is very essential for
the following reasons:
Since lexical analyzer is the first phase of the compiler, it is the only phase of the
compiler that reads the source program character-by-character and consumes
considerable time in reading the source program. Thus, the speed of lexical analysis is
a concern while designing the compiler.
Lexical analyzers may have to look one or more characters beyond the next lexeme
before we have the right lexeme.
For this reason, we use the concept of input buffering where a block of 1024 or 4096 or
more characters are read in one memory read operation and stored in the array to speed
up the process.
Definition: The method of reading a block of characters (1K or 4K or more bytes) from
the disk in one read operation and storing in memory (normally in the form of an array)
for further processing and faster accessing is called input buffering. The memory (an
array) where a block of characters read from the disk are stored is called buffer. Now,
1.30 Lexical Analyzer
instead of reading character-by-character from the disk, we can directly read from the
buffer to identify the lexemes for the tokens. This enhances the speed of the lexical
analyzer as character-by-character reading from the disk consume considerable amount of
time. The principle behind input buffering can be explained using:
Buffer pairs
Sentinel
Buffer pairs: In this input buffering technique two buffers are used as shown below:
Input pointer
The size of each buffer is N where N is usually the size of the disk block. If size of
disk block is 4K, in one read operation 4096 characters can be read into the buffer
using one system command rather than using one system call per character which
consumes lot of time.
If less than 4096 characters are present in the disk, then a special character
represented by eof is stored in buffer indicating no-more characters are present in the
disk file to read.
Two pointers to the input are maintained as shown below:
1) A pointer lexemeBeginning is used and it is pointing to the current lexeme.
2) Another pointer called input pointer is used to scan ahead until the pattern is
found.
Now, all the characters starting from lexemeBeginning till the input pointer
(excluding it) is the current lexeme. The token corresponding to this lexeme must be
returned by the lexical analyzer.
Once the token is found the pointer lexemeBeginning ponts to the input pointer to
identify the next lexeme.
As the input pointer moves forward, if end of buffer is encountered, read one more
block whose size is N from the disk and load into other buffer.
Now, let us see “What is the disadvantage of input buffering with buffer pairs?” or
Compiler Design - 1.31
“Explain the use of sentinels in recognizing the tokens” In input buffering with buffer
pairs technique, each time we move the input pointer towards right, it indicates that we
have not moved off the buffer. If we moved off the buffer, we have to load the other
buffer. Thus, for each character read we have to do two tests: The first test is to
determine the end of the buffer Second test is to determine what character is read.
It is possible to combine the buffer-end test with the test for the current character, if we
extend the each buffer to hold a sentinel character at the end.
Sentinels: A sentinel is a special character that cannot be the part of the source program.
So, the natural choice is to use a character eof which acts as sentinel as well as end of the
input program. The input buffering technique with sentinels also uses two buffers with
sentinel added at the end as shown below:
Input pointer
The size of each buffer is N where N is usually the size of the disk block. If size of
disk block is 4K, in one read operation 4096 characters can be read into the buffer
using one system command rather than using one system call per character which
consumes lot of time.
Irrespective of number of characters stored in the buffer, last character of each buffer
is eof.
Note that eof retains its use as a marker for the end of the entire input. Any eof that
appears other than at the end of a buffer means that the input is at an end.
The use of two pointers lexembeBeginning and input pointer and the method of
accessing lexeme remains same as in buffer pairs.
The algorithm consisting of lookahead code with sentinels is shown below:
switch (*inputPointer++)
{
case eof: If (inputPointer is at end of first buffer)
{
1.32 Lexical Analyzer
Observe from the above algorithm that instead of having two tests as in buffer pair
technique, there is only one test i.e., testing the eof marker.
The regular expressions are the notations for specifying the pattern. Even though they
cannot express all possible patterns, they are very effective in specifying those types of
patterns that are necessary for tokens. In the next section, let us see how to build lexical
analyzers by converting regular expressions to finite automata. The definition of some of
the terms are discussed here:
Definition: An alphabet is any finite set of symbols. The set of alphabets is denoted by
the symbol ∑. Some of the alphabets are letters, digits and punctuations.
s. For example, let ∑ = {a, b}. The various strings that are obtained from ∑ are shown
below: a, aa, ab, ba, bb, …….and so on
Definitions: A language is set of strings obtained from the alphabets denoted by ∑. For
example, language consisting of one or more a‟s is shown below:
The following table shows some examples of regular expressions and the language
corresponding to these regular expressions.
Regular Meaning
expressions
a* String consisting of any number of a‟s (or string consisting of zero or
more a‟s)
a+ String consisting of at least one a (or string consisting of one or more
a‟s)
(a+b) String consisting of either one a or one b
(a+b)* Set of strings of a‟s and b‟s of any length including the NULL string.
(a+b)*abb Set of strings of a‟s and b‟s ending with the string abb
ab(a+b)* Set of strings of a‟s and b‟s starting with the string ab.
(a+b)*aa(a+b)* Set of strings of a‟s and b‟s having a sub string aa.
1.34 Lexical Analyzer
a*b*c* Set of string consisting of any number of a‟s(may be empty string also)
followed by any number of b‟s(may include empty string) followed by
any number of c‟s(may include empty string).
a+b+c+ Set of string consisting of at least one „a‟ followed by string consisting
of at least one „b‟ followed by string consisting of at least one „c‟.
aa*bb*cc* Set of strings consisting of at least one „a‟ followed by string consisting
of at least one „b‟ followed by string consisting of at least one „c‟.
(a+b)* (a + bb) Set of strings of a‟s and b‟s ending with either a or bb
(aa)*(bb)*b Set of strings consisting of even number of a‟s followed by odd number
of b‟s
(0+1)*000 Set of strings of 0‟s and 1‟s ending with three consecutive zeros(or
ending with 000)
(11)* Set of strings consisting of even number of 1‟s
01* + 1 The language represented is the string 1 plus the string consisting of a
zero followed by any number of 1‟s possibly including none.
(01)* + 1 The language consists of a string 1 or strings of (01)‟s that repeat zero or
more times.
0(1* + 1) Set of strings consisting of a zero followed by any number of 1‟s
(1+)(00*1)*0* Strings of 0‟s and 1‟s without any consecutive 1‟s
(0+10)*1* Strings of 0‟s and 1‟s ending with any number of 1‟s (possibly none)
(a+b)(a+b) Strings of a‟s and b‟s whose length is 2
The finite automaton represented using transition diagram can be easily written using
which the tokens can be obtained. Using this transition diagram, we can easily construct
the lexical analyzer. Now, let us see how to write the lexical analyzers by looking into
transition diagrams. The two functions that are used while designing the lexical analyzer
are:
retract(): This function is invoked only if we want to unread the last character read.
It is identified by having a an edge labeled other to the final state with * marked to it.
This function un reads the last character read
Compiler Design - 1.35
install_ID(): Once the identifier is identified, the function install_ID() is called. This
function checks whether the identifier is already there in the symbol table. If it is not
there, it is entered into the symbol table and returns the pointer to that entry. If it is
already there, it returns the pointer to that entry.
letter other *
0 1 2
State 1: If input symbol is a letter or digit remain in state 1. But, any character other than
letter and digit go to state 2.
State 2: Since other character is read before coming to state 2, push back the other
character using the function retract() and return the token ID and pointer to symbol table
entry for the identifier recognized by calling the function install_ID();
So, the complete lexical analyzer to identify the identifier using the above procedure is
shown below:
state = 0;
for (;;)
{
switch (state)
{
case 0:
ch = getchar()
1.36 Lexical Analyzer
break;
case 1:
ch = getchar()
break;
Example 1.6: Design a lexical analyzer to identify the relation operators such as <, <=, >,
>=, == and !=
Note: Observe from the previous problem that, states 0, 1 and 2 are used to recognize the
token corresponding to an identifier. But, state 3 onwards, other tokens are identified.
Now, we start from state 3 to recognize other tokens such as relational operators as
shown in following transition diagram.
Compiler Design - 1.37
< =
3 4 5 LE for (;;)
{
other *
6 LT switch (state)
{
> = In the above transition diagram, observe
7 8 GE
that there are no transitions defined for
states 3, 10 and 12 for symbols other
other *
9 GT than specified as labels. So, from state 3,
10 and 12 on any other symbol, we
= = enter into state 14 which is used to
10 11 EQ identify some other token. The lexical
analyzer for identifying relational
! = operators is shown below:
12 13 NE
state = 0;
/* Case 0-2: Identify the identifier
*/ case 3:
ch = getchar()
if (ch == „<‟) state = 4;
else if (ch == „>‟) state = 7;
else if (ch == „=‟) state = 10;
else if (ch == „!‟) state = 12;
else state = 14;
break;
case 4:
ch = getchar()
if (ch == „=‟) state = 5;
else state = 6;
break;
1.38 Lexical Analyzer
case 7:
ch = getchar()
break;
case 10:
ch = getchar()
break;
case 11: return EQ;
case 12:
ch = getchar()
break;
}
}
Example 1.7: Design a lexical analyzer to identify the keywords begin, else, end. Sketch
the program segment to implement it showing the first two states and one final state
Note: Observe from the previous problem that, states 0 to 13 are used to identify the
identifier and relational operators. Let us identify the keywords begin, else, end using the
following transition diagram starting from state 14 as shown below:
`
b e g *
14 15 16 17 i 18 n 19
other
20
e s e *
21 l 22 23 24 other 25
n d *
26 27 other 28
Similar to the previous two problems, we can write various cases to identify the tokens
BEGIN, ELSE and END. The lexical analyzer is implemented by writing the code for
first two states and one final state as shown below:
state = 0;
for (;;)
{
switch (state)
{
/* Case 0-13: Identify other tokens */
case 14:
ch = getchar()
if (ch == „b‟)
state = 15;
else if (ch == „e‟)
state = 21;
else
1.40 Lexical Analyzer
if (ch == „e‟)
state = 16;
else
state = 29; /* other token
*/
break;
…….
…….
case 28: retract();
return END;
}
}
Example 1.8: Design a lexical analyzer to recognize unsigned number. Sketch the
program segment to implement it showing the first two states and one final state
digit
digit other *
29 30 31
.
. digit
digit other *
32 33 34
E, e digit
+, - digit other *
35 36 37 38
Similar to the previous two problems, we can write various cases to identify unsigned
integers and floating point numbers represented using decimal point and scientific
Compiler Design - 1.41
method. The lexical analyzer is implemented by writing the code for first two states and
one final state as shown below:
state = 0;
for (;;)
{
switch (state)
{
/* Case 0-28: Identify other tokens */
case 29:
ch = getchar()
if (ch == digit)
state = 30;
else if (ch == „.‟)
state = 32;
else
state = 39; /* other token */
break;
case 30:
ch = getchar();
if (ch == digit)
state = 30;
else if (ch == „.‟)
state = 33;
else
state = 31;
break;
…….
…….
case 31: retract();
return (NUM, InstallNUM() );
}
1.42 Lexical Analyzer
Note: Here, Install_Num() is a function which checks whether the number is already in
the numeric table. If it is not present, it is entered into the numeric table and returns the
token NUM and pointer to that integer as the attribute value. Otherwise, it returns the
pointer to that integer as the attribute value apart from returning the token NUM.
Exercises
1) What is a language processor? What are the various types of language processors?
2) What is a compiler? What are the activities performed by the compiler?
3) What is an interpreter? What are the differences between a compiler and interpreter 4)
What is an assembler? What are the activities performed by the compiler?
5) What is hybrid compiler?
6) What are the cousins of the compiler? or Explain language processing system
7) Discuss the analysis of source program with respect to compilation process
8) What are the different phases of the compiler? or “Explain the block diagram of the
compiler construction method
9) Show the output of each phase of the compiler for the assignment statement: sum =
initial + value * 10
10) What is the difference between a phase and pass? What is a multi-pass compiler?
Explain the need for multiple passes in compiler?
13) Why analysis portion of the compiler is separated into lexical analysis and syntax
analysis phase?
16) Identify lexemes and tokens in the following statement: printf(“Simple Interest = %f\
n”, si);
17) What is the need for returning attributes for tokens along with token name?”
18) Give the token names and attribute values for the following statement: E = M * C
**2.
where ** in FORTRAN language is used as exponent operator.
19) Give the token names and attribute values for the following statement: si = p * t * r /
100
21) What is the disadvantage of input buffering with buffer pairs? Explain the use of
sentinels in recognizing the tokens.
23) Design a lexical analyzer to identify the relation operators such as <, <=, >, >=, ==
and !=
24) Design a lexical analyzer to identify the keywords begin, else, end. Sketch the
program segment to implement it showing the first two states and one final state
25) Design a lexical analyzer to recognize unsigned number. Sketch the program segment
to implement it showing the first two states and one final state