0% found this document useful (0 votes)
112 views78 pages

Compiler UNIT I

This document provides an overview of the CS416 Compiler Design course including: - The prerequisites required for the course including programming languages and FSA/CFG knowledge. - The course outline covering topics like lexical analysis, syntax analysis, semantic analysis, intermediate code generation, and code optimization. - Descriptions of compilers and their role in translating source code to target code along with comparisons to interpreters and assemblers. - The analysis-synthesis model of compilation including the roles of lexical analysis, syntax analysis, semantic analysis, and intermediate/target code generation.

Uploaded by

Shresht Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views78 pages

Compiler UNIT I

This document provides an overview of the CS416 Compiler Design course including: - The prerequisites required for the course including programming languages and FSA/CFG knowledge. - The course outline covering topics like lexical analysis, syntax analysis, semantic analysis, intermediate code generation, and code optimization. - Descriptions of compilers and their role in translating source code to target code along with comparisons to interpreters and assemblers. - The analysis-synthesis model of compilation including the roles of lexical analysis, syntax analysis, semantic analysis, and intermediate/target code generation.

Uploaded by

Shresht Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

18CSC304J-Compiler Design

CS416 Compiler Design 1


Preliminaries Required
• Basic knowledge of programming languages.
• Basic knowledge of FSA and CFG.
• Knowledge of a high programming language for the
programming assignments.
Textbook:
Alfred V. Aho, Ravi Sethi, and Je rey D. Ullman,
“Compilers: Principles, Techniques, and Tools”
Addison-Wesley, 1986.

CS416 Compiler Design 2


Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
– Context Free Grammars
– Top-Down Parsing, LL Parsing
– Bo om-Up Parsing, LR Parsing
• Syntax-Directed Translation
– A ribute De nitions
– Evaluation of A ribute De nitions
• Semantic Analysis, Type Checking
• Run-Time Organization
• Intermediate Code Generation
CS416 Compiler Design 3
COMPILERS
• A compiler is a program takes a program wri en in a source
language and translates it into an equivalent program in a target
language.
( Normal ly a program wri en in ( Normally the equivalent program
source
a hi gh-level program
programming COMPILER in target program
machine code – relocatable
language) object le)

error messages

CS416 Compiler Design 4


CS416 Compiler Design 5
Translators

CS416 Compiler Design 6


CS416 Compiler Design 7
Compiler vs Interpreter

CS416 Compiler Design 8


Compilers vs Assemblers

CS416 Compiler Design 9


Other Applications
• In addition to the development of a compiler, the techniques
used in compiler design can be applicable to many problems in
computer science.
– Techniques used in a lexical analyzer can be used in text editors, information
retrieval system, and pa ern recognition programs.
– Techniques used in a parser can be used in a query processing system such as
SQL.
– Many software having a complex front-end may need techniques used in
compiler design.
• A symbolic equation solver which takes an equation as input. That program should
parse the given input equation.
– Most of the techniques used in compiler design can be used in Natural
Language Processing (NLP) systems.

CS416 Compiler Design 10


Analysis-Synthesis model of compilation
• There are two major parts of a compiler: Analysis and Synthesis
• In analysis phase, an intermediate representation is created from
the given source program.
– Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of this
phase.
• In synthesis phase, the equivalent target program is created
from this intermediate representation.
– Intermediate Code Generator, Code Generator, and Code Optimizer are the parts
of this phase.

CS416 Compiler Design 11


Analysis of source program
Analysis is done in 3 phases:
1.Linear Analysis:
Stream of characters are read from left to right and grouped
into
tokens
2. Hierarchical Analysis (Syntax):
Tokens are grouped hierarchically into nested collections
3. Semantic Analysis:
Checks inherent meaning of code.

CS416 Compiler Design 12


Language Processing System

CS416 Compiler Design 13


1) Preprocessor
converts the HLL (high level language) into pure high level
language. It includes all the header les and also evaluates if any
macro is included. It is the optional because if any language which
does not support #include and macro preprocessor is not required.
2) Compiler: takes pure high level language as a input and convert
into assembly code.
3) Assembler: takes assembly code as an input and converts it into
assembly code.

CS416 Compiler Design 14


4) Linking and loading:
It has four functions
1. Allocation: get the memory portions from operating system and
storing the object data.
2. Relocation: maps the relative address to the physical address
and relocating the object code.
3. Linker:
It combines all the executable object module to pre single
executable le.
4. Loader:
It loads the executable le into permanent storage.

CS416 Compiler Design 15


Phases of A Compiler
Grouping of Phases:
•Analysis Phase:
Lexical, Syntax,
Semantic
•Synthesis Phase:
Intermediate Code
generation, Code
optimizer, Code
generator

CS416 Compiler Design 16


Lexical Analyzer
• Lexical Analyzer reads the source program character by
character and returns the tokens of the source program.
• A token describes a pa ern of characters having same meaning
in the source program. (such as identi ers, operators, keywords,
numbers, delimeters and so on)
Ex: newval := oldval + 12 => tokens: newval identi er
:= assignment operator

oldval identi er
+ add operator
12 a number
• Puts information about identi ers into the symbol table.

• Regular expressions are used to describe tokens (lexical


constructs).
• A (Deterministic) Finite State Automaton can be used in the
implementation of a lexical analyzer.
CS416 Compiler Design 17
Syntax Analyzer
• A Syntax Analyzer creates the syntactic structure (generally a
parse tree) of the given program.
• A syntax analyzer is also called as a parser.
• A parse tree describes a syntactic structure.
assgstmt

identi er := expression • In a parse tree, all terminals are at

leaves.
newval expression + expression • aAlcontext
l inner nodes are non-terminals in
free grammar.
identi er number
oldval 12
CS416 Compiler Design 18
Syntax Analyzer (CFG)
• The syntax of a language is speci ed by a context free grammar
(CFG).
• The rules in a CFG are mostly recursive.
• A syntax analyzer checks whether a given program satis es the
rules implied by a CFG or not.
– If it satis es, the syntax analyzer creates a parse tree for the given program.
• Ex: We use BNF (Backus Naur Form) to specify a CFG
assgstmt -> identi er := expression
expression -> identi er
expression -> number
expression -> expression + expression
CS416 Compiler Design 19
Syntax Analyzer versus Lexical Analyzer
• Which constructs of a program should be recognized by the
lexical analyzer, and which ones by the syntax analyzer?
– Both of them do similar things; But the lexical analyzer deals with simple non-
recursive constructs of the language.
– The syntax analyzer deals with recursive constructs of the language.
– The lexical analyzer simpli es the job of the syntax analyzer.
– The lexical analyzer recognizes the smallest meaningful units (tokens) in a source
program.
– The syntax analyzer works on the smallest meaningful units (tokens) in a source
program to recognize meaningful structures in our programming language.

CS416 Compiler Design 20


Parsing Techniques
• Depending on how the parse tree is created, there are di erent
parsing techniques.
• These parsing techniques are categorized into two groups:
– Top-Down Parsing,
– Bo om-Up Parsing
• Top-Down Parsing:
– Construction of the parse tree starts at the root, and proceeds towards the leaves.
– E cient top-down parsers can be easily constructed by hand.
– Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
• Bo om-Up Parsing:
– Construction of the parse tree starts at the leaves, and proceeds towards the root.
– Normally e cient bo om-up parsers are created with the help of some software
tools.
– Bo om-up parsing is also known as shift-reduce parsing.
– Operator-Precedence Parsing –CS416simple, restrictive, easy to implement
– LR Parsing – much general form ofCompiler
shift-reduce
Design
parsing, LR, SLR, LALR 21
Semantic Analyzer
• A semantic analyzer checks the source program for semantic
errors and collects the type information for the code generation.
• Type-checking is an important part of semantic analyzer.
• Normally semantic information cannot be represented by a
context-free language used in syntax analyzers.
• Context-free grammars used in the syntax analysis are
integrated with a ributes (semantic rules)
– the result is a syntax-directed translation,
– A ribute grammars
• Ex:
newval := oldval + 12
• The type of the identi er newval must match with type of the expression (oldval+12)
CS416 Compiler Design 22
Intermediate Code Generation
• A compiler may produce an explicit intermediate codes
representing the source program.
• These intermediate codes are generally machine (architecture
independent). But the level of intermediate codes is close to the
level of machine codes.
• Ex:
newval := oldval * fact + 1
id1 := id2 * id3 + 1
MULT id2,id3,temp1 Intermediates Codes (Quadraples)
ADD temp1,#1,temp2
MOV temp2,,id1 CS416 Compiler Design 23
Code Optimizer (for Intermediate Code Generator)
• The code optimizer optimizes the code produced by the
intermediate code generator in the terms of time and space.
• Ex:
MULT id2,id3,temp1
ADD temp1,#1,id1

CS416 Compiler Design 24


Code Generator
• Produces the target language in a speci c architecture.
• The target program is normally is a relocatable object le
containing the machine codes.
• Ex:
( assume that we have an architecture with instructions whose at least one of its operands
is
a machine register)
MOVE id2,R1
MULT id3,R1
ADD #1,R1
MOVE R1,id1
CS416 Compiler Design 25
Symbol table
• It is a data structure created and maintained by compilers in
order to store information about the occurrence of various
entities such as variable names, function names, objects,
classes, interfaces, etc.
• Symbol table is used by both the analysis and the synthesis
parts of a compiler.
Uses:
• To store the names of all entities in a structured form at one
place.
• To verify if a variable has been declared.
• To implement type checking, by verifying assignments and
expressions in the source code are semantically correct.
• To determine the scope of a name (scope resolution).
CS416 Compiler Design 26
Error handler
• The tasks of the Error Handling process are to detect each error,
report it to the user, and then make some recover strategy and
implement them to handle error.
• An Error is the blank entries in the symbol table.
Types of Error –
Run-time error:
✔ error which takes place during the execution of a program, and
usually happens because of adverse system parameters or
invalid input data.
✔ Logic errors, occur when executed code does not produce the
expected result. Logic errors are best handled by meticulous
program debugging.
Compile-time errors: CS416 Compiler Design 27
Error handler (…)
Classi cation of Compile-time error –
•Lexical : This includes misspellings of identi ers, keywords or
operators
•Syntax : missing semicolon or unbalanced parenthesis
•Semantic : incompatible value assignment or type mismatches
between operator and operand
•Logical : code not reachable, in nite loop.

CS416 Compiler Design 28


CS416 Compiler Design 29
Error Recovery
Four common error-recovery strategies:
Panic mode
•When a parser encounters an error anywhere in the statement, it
ignores the rest of the statement by not processing input from
erroneous input to delimiter, such as semi-colon.
•This is the easiest way of error-recovery and also, it prevents the
parser from developing in nite loops.
Statement mode
•When a parser encounters an error, it tries to take corrective
measures so that the rest of inputs of statement allow the parser to
parse ahead.
•For example, inserting a missing semicolon, replacing comma with
a semicolon etc. CS416 Compiler Design 30
Error Recovery (…)
Error productions
•Some common errors are known to the compiler designers that
may occur in the code.
•In addition, the designers can create augmented grammar to be
used, as productions that generate erroneous constructs when
these errors are encountered.
Global correction
•The parser considers the program in hand as a whole and tries to
gure out what the program is intended to do and tries to nd out a
closest match for it, which is error-free.
•When an erroneous input (statement) X is fed, it creates a parse
tree for some closest error-free statement Y.
•This may allow the parser to make minimal changes in the source
CS416 Compiler Design 31
Cousins of complier
1. Preprocessor:
•A preprocessor is a program that processes its input data to
produce output that is used as input to another program.
•The output is said to be a preprocessed form of the input data,
which is often used by some subsequent programs like compilers.
•They may perform the following functions : Macro processing,
Rational Preprocessors, File Inclusion, Language extension
•Macro processing:
✔A macro is a rule or pa ern that speci es how a certain input
sequence should be mapped to an output sequence according to a
de ned procedure.
✔The mapping process that instantiates a macro into a speci c
output sequence is known as macro expansion.
CS416 Compiler Design 32
Cousins of complier (…)
• File Inclusion:
✔ Preprocessor includes header les into the program text.
✔ When the preprocessor nds an #include directive it replaces it
by the entire content of the speci ed le.
• Rational Preprocessors:
✔ These processors change older languages with more modern
ow-of-control and data-structuring facilities.
• Language extension :
✔ These processors a empt to add capabilities to the language by
what amounts to built-in macros.
✔ For example, the language Equel is a database query language
embedded in C.
CS416 Compiler Design 33
Cousins of complier (…)
2. Assembler
•Assembler creates object code by translating assembly
instruction mnemonics into machine code.
•There are two types of assemblers:
✔One-pass assemblers go through the source code once and
assume that all symbols will be de ned before any instruction that
references them. ·
✔Two-pass assemblers create a table with all symbols and their
values in the rst pass, and then use the table in a second pass to
generate code

CS416 Compiler Design 34


Cousins of complier (…)
3. Linker and Loader
•A linker or link editor is a program that takes one or more objects
generated by a compiler and combines them into a single
executable program.
•Three tasks of the linker are
1.Searches the program to nd library routines used by program,
e.g. printf(), math routines.
2. Determines the memory locations that code from each module
will occupy and relocates its instructions by adjusting absolute
references
3. Resolves references among les.
•A loader is the part of an operating system that is responsible for
loading programs in memory, one of the essential stages in the
CS416 Compiler Design 35
Front end Grouping of Phases
Phases: Lexical analysis, Syntax analysis, Semantic analysis,
Intermediate code generation.
• Front end comprises of phases which are dependent on the
input (source language) and independent on the target machine
(target language).
• It includes lexical and syntactic analysis, symbol table
management, semantic analysis and the generation of
intermediate code.
• Code optimization can also be done by the front end.
• It also includes error handling at the phases concerned.

CS416 Compiler Design 36


Grouping of Phases (…)
Back End
•Phases: Code optimizer and code generator
•Back end comprises of those phases of the compiler that are
dependent on the target machine and independent on the source
language.
•This includes code optimization, code generation.
•In addition to this, it also encompasses error handling and
symbol table management operations.

CS416 Compiler Design 37


Grouping of Phases (…)
Passes
•The phases of compiler can be implemented in a single pass by
marking the primary actions viz. reading of input le and writing
to the output le.
•Several phases of compiler are grouped into one pass in such a
way that the operations in each and every phase are incorporated
during the pass.
•Lexical analysis, syntax analysis, semantic analysis and
intermediate code generation might be grouped into one pass. If
so, the token stream after lexical analysis may be translated
directly into intermediate code.
CS416 Compiler Design 38
Grouping of Phases (…)
Reducing the Number of Passes
•Minimizing the number of passes improves the time e ciency as
reading from and writing to intermediate les can be reduced.
•When grouping phases into one pass, the entire program has to
be kept in memory to ensure proper information ow to each
phase because one phase may need information in a di erent
order than the information produced in previous phase.
•The source program or target program di ers from its internal
representation. So, the memory for internal form may be larger
than that of input and output.

CS416 Compiler Design 39


Compiler construction tools
Parser Generators
Input: Grammatical description of a programming language
Output: Syntax analyzers.
Parser generator takes the grammatical description of a
programming language and produces a syntax analyzer.
Scanner Generators
Input: Regular expression description of the tokens of a language
Output: Lexical analyzers.
Scanner generator generates lexical analyzers from a regular
expression description of the tokens of a language.
Syntax-directed Translation Engines
Input: Parse tree.
Output: Intermediate code. CS416 Compiler Design 40
Compiler construction tools (…)
Automatic Code Generators
Input: Intermediate language.
Output: Machine language.
Code-generator takes a collection of rules that de ne the
translation of each operation of the intermediate language into the
machine language for a target machine.
Data- ow Analysis Engines
Data- ow analysis engine gathers the information that is, the
values transmi ed from one part of a program to each of the other
parts. Data- ow analysis is a key part of code optimization.
Compiler Construction Toolkits
The toolkits provide integrated set of routines for various phases of
compiler. Compiler construction toolkits provide an integrated set
CS416 Compiler Design 41
LEXICAL ANALYSIS
• Lexical Analysis is the rst phase of the
compiler also known as a scanner.
• It converts the High level input program into a
sequence of Tokens.
• Lexical Analysis can be implemented with the
Deterministic nite Automata.
• The output is a sequence of tokens that is sent
to the parser for syntax analysis.
What is a token?
A lexical token is a sequence of characters that
can be treated as a unit in the grammar of the
programming languages.
Examples:
Keywords, Identi er (variable name, function name), Operators, Separators
Not tokens: Comments, preprocessor directive, macros, blanks, tabs, newline.
FSA for Relational operators
FSA for naming variables in Fortran

FSA for oating point numbers


LEXICAL ANALYSIS (…)
Lexemes:
• The sequence of characters matched by a pa ern to form the corresponding token or a
sequence of input characters that comprises a single token is called a lexeme.
• There are some prede ned rules for every lexeme to be identi ed as a valid token.
• These rules are de ned by grammar rules, by means of a pa ern.
Pa erns:
• A pa ern explains what can be a token, and these pa erns are de ned by means of regular
expressions.
LEXICAL ANALYSIS (…)
Functions of Lexical Analyser:
1. Tokenization i.e. Dividing the program into valid tokens.
2. Remove white space characters.
3. Remove comments.
4. It also provides help in generating error messages by providing row numbers and column
numbers.
Issues in Lexical Analysis:
Reasons of separating analysis phase into lexical and syntax:
1. Simplicity: Separation of lexical and syntax analysis allows us to simplify these phases.
2. E ciency: A large amount of time is spent reading the source program and partitioning into
tokens. Separate lexical and syntax analyzer working in parallel can signi cantly improve
the performance.
3. Portability: Mostly the lexical analysis phase is di erent for di erent languages and other
phases are almost the same. To construct a compiler for C++ language we only have to
develop the lexical analysis phase and use the other phases of pascal language.
A ributes for tokens
Speci cation of Tokens
Alphabets: Any nite set of symbols.
Eg: {0,1} is a set of binary alphabets
Strings: Any nite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets. A string of zero length is known as an empty string and is denoted by ε
Language: A language is considered as a nite set of strings over some nite set of alphabets.
Computer languages are considered as nite sets, and mathematically set operations can be
performed on them. Finite languages can be described by means of regular expressions.
Longest Match Rule
When the lexical analyzer read the source-code, it scans the code le er by le er; and when it
encounters a whitespace, operator symbol, or special symbols, it decides that a word is completed.
Eg: int intvalue;
• The Longest Match Rule states that the lexeme scanned should be determined based on the
longest match among all the tokens available.
• The lexical analyzer also follows rule priority where a reserved word, e.g., a keyword, of a language
is given priority over user input.
Speci cation of Tokens (…)
Pre x of String:
A part of a string which is obtained by zero or more trailing characters/symbols for a string
Example pre x of string “apple” is apple, appl, app, ap etc.
Su x of String:
A part of a string which is obtained by deleting zero or more leading symbols for a string
Example of string “apple” is apple, pple, ple, le etc.
Substring :
Any string obtained by deleting a pre x or su x from a string is called substring
Example, substring of “apple” is ppl etc.
Proper pre x, su x, or substring:
A proper pre x, su x or substring is that part “x” of the string “s” which is a pre x, a su x or
substring of string “s” such that x ≠ s
Subsequence:
A string formed by deleting zero or more not necessarily contiguous symbols from string is call
subsequence.
Speci cation of Tokens (…)
Regular Expression:
• Regular expression can be used to specify the structure of tokens used in the programming
language.
• Regular expressions de nes the pa erns for the tokens.
• When comparing this pa ern against a string, it'l either be true or false.
• The set of string describe by a regular expression is called set and the language describe by
regular expression is called regular language.
Examples:
Speci cation of Tokens (…)
Operations on Regular Expressions
• Union: If L and M are two regular languages then their union L U M is also a union.
L U M = {s | s is in L or s is in M}
• Intersection: If L and M are two regular languages then their intersection is also an intersection.
L ⋂ M = {st | s is in L and t is in M}
• Kleen closure: If L is a regular language then its Kleen closure L1* wil also be a regular language.
L* = Zero or more occurrence of language L.
• Positive Closure: If L is a regular language then its positive closure L1+ wil also be a regular
l+=anguage.
L One or more occurrence of language L.
• ?: If L is a regular language then its positive closure L1? wil also be a regular language
L? = Zero or one occurrence of language L.
Speci cation of Tokens (…)
Regular De nitions

We can give names to regular expressions, and we can use these names as
symbols to de ne other regular expressions.
Speci cation of Tokens (…)
Examples of Regular De nitions
INPUT BUFFERING
Need for Bu ering:
● The main task of the lexical analyzer is to read the input characters of the source program
group them into lexemes and produce as output a sequence of tokens for each lexeme in
the source program.
● When the lexical analyzer discovers a lexeme constituting an identi er, it needs to enter that
lexeme into the symbol table.
● To ensure that a right lexeme is found, one or more characters have to be looked up beyond
the next lexeme
● Hence a two-bu er scheme is introduced to handle large lookaheads safely.
● The lexical analyzer not only identi es the lexemes but also pre-processes the source text
like removing comments, white spaces, etc.
Lexical analyzers are divided into a cascade of two processes:
● Scanning: It consists of simple processes that do not require the tokenization of the input
such as deletion of comments, compaction of consecutive white space characters into one.
● Lexical Analysis: This is the more complex portion where the scanner produces sequence of
tokens as output.
INPUT BUFFERING (...)
Techniques for speeding up the process of lexical analyzer such as the use of sentinels to mark
the bu er end have been adopted.
There are three general approaches for the implementation of a lexical analyzer:
1. By using a lexical-analyzer generator: In this, the generator provides routines for reading
and bu ering the input.
2. By writing the lexical analyzer in a conventional systems-programming language, using I/O
facilities of that language to read the input.
3. By writing the lexical analyzer in assembly language and explicitly managing the reading of
input.
Bu er Pairs
INPUT BUFFERING (...)
Specialized bu ering techniques are used to reduce the amount of overhead, which is required
to process an input character in moving characters

● Consists of two bu ers, each consists of N-character size which are reloaded alternatively.
● Two pointers: lexemeBegin and forward
● Lexeme Begin points to the beginning of the current lexeme which is yet to be found.
Forward scans ahead until a match for a pa ern is found.
● Once a lexeme is found, lexeme begin is set to the character immediately after the lexeme
which is just found and forward is set to the character at its right end.
● Current lexeme is the set of characters between two pointers.
Sentinels
INPUT BUFFERING (...)
● Sentinels is used to make a check, each time when the forward pointer is moved, a check is
done to ensure that one half of the bu er has not moved o . If it is done, then the other half
must be reloaded. The sentinel is a special character that cannot be part of the source
program. (eof character is used as sentinel).
● Therefore the ends of the bu er halves require two tests for each advance of the forward
pointer.
Test 1: For end of bu er.
Test 2: To determine what character is read.
● The usage of sentinel reduces the two tests to one by extending each bu er half to hold a
sentinel character at the end.
Pseudocode for input bu ering INPUT BUFFERING (...)
INPUT BUFFERING
Pseudocode for input bu ering (reduced no. of tests) (...)
Disadvantages:
INPUT BUFFERING (...)
● This scheme works well most of the time, but the amount of lookahead is limited.
● This limited lookahead may make it impossible to recognize tokens in situations where the
distance that the forward pointer must travel is more than the length of the bu er.
Advantages
● Most of the time, It performs only one test to see whether forward pointer points to an eof.
● Only when it reaches the end of the bu er half or eof, it performs more tests.
● Since N input characters are encountered between eofs, the average number of tests per
input character is very close to 1.
• The nite automata orFINITE
nite stateSTATE
machineAUTOMATA
is an abstract machine which have ve
elements or tuple.
• It has a set of states and rules for moving from one state to another but it depends upon
the applied input symbol.
• It is an abstract model of digital computer.
• Formal speci cation of machine is { Q, Σ, q, F, δ }.
Q : Finite set of states.
Σ : set of Input Symbols.
q : Initial state.
F : set of Final States.
δ : Transition Function.
• Two Types: Deterministic Finite state automata (DFA); Non Deterministic
Finite State Automata (NFA)
FINITE STATE AUTOMATA (…)
FINITE STATE AUTOMATA (…)
LEX: A tool for lexical analysis
● Lex is a tool in lexical analysis phase to recognize tokens using regular expression.
● Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
● The lexical analyzer is a program that transforms an input stream into a sequence of
tokens.
● It reads the input stream and produces the source code as output through implementing
the lexical analyzer in the C program
LEX: A tool for lexical analysis
The function of Lex is as follows:
● Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler runs
the lex.1 program and produces a C program lex.yy.c.
● Finally C compiler runs the lex.yy.c program and produces an object program a.out.
● a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
LEX: File format
De nitions include declarations of constant, variable and regular de nitions.
Rules de ne the statement of form p1 {action1} p2 {action2}....pn {action}.
Where pi describes the regular expression and action1 describes the actions what action the
lexical analyzer should take when pa ern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can be
loaded with
the lexical analyzer and compiled separately.
Files created by LEX
● lex.l is an a input le wri en in a language which describes the generation of lexical
analyzer. The lex compiler transforms lex.l to a C program known as lex.yy.c.
● lex.yy.c is compiled by the C compiler to a le called a.out.
● The output of C compiler is the working lexical analyzer which takes stream of input
characters and produces a stream of tokens.
Files created by LEX
● lex.l is an a input le wri en in a language which describes the generation of lexical
analyzer. The lex compiler transforms lex.l to a C program known as lex.yy.c.
● lex.yy.c is compiled by the C compiler to a le called a.out.
● The output of C compiler is the working lexical analyzer which takes stream of input
characters and produces a stream of tokens.
LEX Variable
● yyin is a variable of the type FILE* and
points to the input le. yyin is de ned by
LEX automatically. If the programmer
assigns an input le to yyin in the
auxiliary functions section, then yyin is
set to point to that le. Otherwise LEX
assigns yyin to stdin(console input).
LEX Variable
● yytext is of type char* and it contains
the lexeme currently found. A lexeme
is a sequence of characters in the
input stream that matches some
pa ern in the Rules Section. Each
invocation of the function yylex()
results in yytext carrying a pointer to
the lexeme found in the input stream
by yylex(). The value of yytext will be
overwri en after the next yylex()
invocation.
LEX Variable
● yyleng is a variable of the type int and it
stores the length of the lexeme pointed
to by yytext.
LEX Functions
● ylex() is a function of return type int. LEX
automatically de nes yylex() in lex.yy.c but does
not call it.
● The programmer must call yylex() in the Auxiliary
functions section of the LEX program.
● LEX generates code for the de nition of yylex()
according to the rules speci ed in the Rules
section.
● When yylex() is invoked, it reads the input as
pointed to by yyin and scans through the input
looking for a matching pa ern.
● When the input or a part of the input matches
one of the given pa erns, yylex() executes the
corresponding action associated with the pa ern
as speci ed in the Rules section.
LEX Functions
● LEX declares the function yywrap() of return-
type int in the le lex.yy.c .
● LEX does not provide any de nition for yywrap().
yylex() makes a call to yywrap() when it
encounters the end of input.
● If yywrap() returns zero (indicating false) yylex()
assumes there is more input and it continues
scanning from the location pointed to by yyin.
● If yywrap() returns a non-zero value (indicating
true), yylex() terminates the scanning process
and returns 0 (i.e. “wraps up”).
● If the programmer wishes to scan more than one
input le using the generated lexical analyzer, it
can be simply done by se ing yyin to a new
input le in yywrap() and return 0.
Disambiguation
● yylex() uses two important disambiguation rules in selecting the right action to
execute in case there is more than one pa ern that matches a string in the given input:

1. Choose the rst match.


2. "Longest match" is preferred
Disambiguation
● yylex() uses two important disambiguation rules in selecting the right action to
execute in case there is more than one pa ern that matches a string in the given input:

1. Choose the rst match.


2. "Longest match" is preferred
Example: Token identi cation

You might also like