Compiler UNIT I
Compiler UNIT I
error messages
We can give names to regular expressions, and we can use these names as
symbols to de ne other regular expressions.
Speci cation of Tokens (…)
Examples of Regular De nitions
INPUT BUFFERING
Need for Bu ering:
● The main task of the lexical analyzer is to read the input characters of the source program
group them into lexemes and produce as output a sequence of tokens for each lexeme in
the source program.
● When the lexical analyzer discovers a lexeme constituting an identi er, it needs to enter that
lexeme into the symbol table.
● To ensure that a right lexeme is found, one or more characters have to be looked up beyond
the next lexeme
● Hence a two-bu er scheme is introduced to handle large lookaheads safely.
● The lexical analyzer not only identi es the lexemes but also pre-processes the source text
like removing comments, white spaces, etc.
Lexical analyzers are divided into a cascade of two processes:
● Scanning: It consists of simple processes that do not require the tokenization of the input
such as deletion of comments, compaction of consecutive white space characters into one.
● Lexical Analysis: This is the more complex portion where the scanner produces sequence of
tokens as output.
INPUT BUFFERING (...)
Techniques for speeding up the process of lexical analyzer such as the use of sentinels to mark
the bu er end have been adopted.
There are three general approaches for the implementation of a lexical analyzer:
1. By using a lexical-analyzer generator: In this, the generator provides routines for reading
and bu ering the input.
2. By writing the lexical analyzer in a conventional systems-programming language, using I/O
facilities of that language to read the input.
3. By writing the lexical analyzer in assembly language and explicitly managing the reading of
input.
Bu er Pairs
INPUT BUFFERING (...)
Specialized bu ering techniques are used to reduce the amount of overhead, which is required
to process an input character in moving characters
● Consists of two bu ers, each consists of N-character size which are reloaded alternatively.
● Two pointers: lexemeBegin and forward
● Lexeme Begin points to the beginning of the current lexeme which is yet to be found.
Forward scans ahead until a match for a pa ern is found.
● Once a lexeme is found, lexeme begin is set to the character immediately after the lexeme
which is just found and forward is set to the character at its right end.
● Current lexeme is the set of characters between two pointers.
Sentinels
INPUT BUFFERING (...)
● Sentinels is used to make a check, each time when the forward pointer is moved, a check is
done to ensure that one half of the bu er has not moved o . If it is done, then the other half
must be reloaded. The sentinel is a special character that cannot be part of the source
program. (eof character is used as sentinel).
● Therefore the ends of the bu er halves require two tests for each advance of the forward
pointer.
Test 1: For end of bu er.
Test 2: To determine what character is read.
● The usage of sentinel reduces the two tests to one by extending each bu er half to hold a
sentinel character at the end.
Pseudocode for input bu ering INPUT BUFFERING (...)
INPUT BUFFERING
Pseudocode for input bu ering (reduced no. of tests) (...)
Disadvantages:
INPUT BUFFERING (...)
● This scheme works well most of the time, but the amount of lookahead is limited.
● This limited lookahead may make it impossible to recognize tokens in situations where the
distance that the forward pointer must travel is more than the length of the bu er.
Advantages
● Most of the time, It performs only one test to see whether forward pointer points to an eof.
● Only when it reaches the end of the bu er half or eof, it performs more tests.
● Since N input characters are encountered between eofs, the average number of tests per
input character is very close to 1.
• The nite automata orFINITE
nite stateSTATE
machineAUTOMATA
is an abstract machine which have ve
elements or tuple.
• It has a set of states and rules for moving from one state to another but it depends upon
the applied input symbol.
• It is an abstract model of digital computer.
• Formal speci cation of machine is { Q, Σ, q, F, δ }.
Q : Finite set of states.
Σ : set of Input Symbols.
q : Initial state.
F : set of Final States.
δ : Transition Function.
• Two Types: Deterministic Finite state automata (DFA); Non Deterministic
Finite State Automata (NFA)
FINITE STATE AUTOMATA (…)
FINITE STATE AUTOMATA (…)
LEX: A tool for lexical analysis
● Lex is a tool in lexical analysis phase to recognize tokens using regular expression.
● Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
● The lexical analyzer is a program that transforms an input stream into a sequence of
tokens.
● It reads the input stream and produces the source code as output through implementing
the lexical analyzer in the C program
LEX: A tool for lexical analysis
The function of Lex is as follows:
● Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler runs
the lex.1 program and produces a C program lex.yy.c.
● Finally C compiler runs the lex.yy.c program and produces an object program a.out.
● a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
LEX: File format
De nitions include declarations of constant, variable and regular de nitions.
Rules de ne the statement of form p1 {action1} p2 {action2}....pn {action}.
Where pi describes the regular expression and action1 describes the actions what action the
lexical analyzer should take when pa ern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can be
loaded with
the lexical analyzer and compiled separately.
Files created by LEX
● lex.l is an a input le wri en in a language which describes the generation of lexical
analyzer. The lex compiler transforms lex.l to a C program known as lex.yy.c.
● lex.yy.c is compiled by the C compiler to a le called a.out.
● The output of C compiler is the working lexical analyzer which takes stream of input
characters and produces a stream of tokens.
Files created by LEX
● lex.l is an a input le wri en in a language which describes the generation of lexical
analyzer. The lex compiler transforms lex.l to a C program known as lex.yy.c.
● lex.yy.c is compiled by the C compiler to a le called a.out.
● The output of C compiler is the working lexical analyzer which takes stream of input
characters and produces a stream of tokens.
LEX Variable
● yyin is a variable of the type FILE* and
points to the input le. yyin is de ned by
LEX automatically. If the programmer
assigns an input le to yyin in the
auxiliary functions section, then yyin is
set to point to that le. Otherwise LEX
assigns yyin to stdin(console input).
LEX Variable
● yytext is of type char* and it contains
the lexeme currently found. A lexeme
is a sequence of characters in the
input stream that matches some
pa ern in the Rules Section. Each
invocation of the function yylex()
results in yytext carrying a pointer to
the lexeme found in the input stream
by yylex(). The value of yytext will be
overwri en after the next yylex()
invocation.
LEX Variable
● yyleng is a variable of the type int and it
stores the length of the lexeme pointed
to by yytext.
LEX Functions
● ylex() is a function of return type int. LEX
automatically de nes yylex() in lex.yy.c but does
not call it.
● The programmer must call yylex() in the Auxiliary
functions section of the LEX program.
● LEX generates code for the de nition of yylex()
according to the rules speci ed in the Rules
section.
● When yylex() is invoked, it reads the input as
pointed to by yyin and scans through the input
looking for a matching pa ern.
● When the input or a part of the input matches
one of the given pa erns, yylex() executes the
corresponding action associated with the pa ern
as speci ed in the Rules section.
LEX Functions
● LEX declares the function yywrap() of return-
type int in the le lex.yy.c .
● LEX does not provide any de nition for yywrap().
yylex() makes a call to yywrap() when it
encounters the end of input.
● If yywrap() returns zero (indicating false) yylex()
assumes there is more input and it continues
scanning from the location pointed to by yyin.
● If yywrap() returns a non-zero value (indicating
true), yylex() terminates the scanning process
and returns 0 (i.e. “wraps up”).
● If the programmer wishes to scan more than one
input le using the generated lexical analyzer, it
can be simply done by se ing yyin to a new
input le in yywrap() and return 0.
Disambiguation
● yylex() uses two important disambiguation rules in selecting the right action to
execute in case there is more than one pa ern that matches a string in the given input: