0% found this document useful (0 votes)
111 views

LT Unit 3 Notes 2017

The document discusses the analysis of source programs by compilers. It describes the key phases of compilers as lexical analysis, syntactic analysis, and semantic analysis. These phases break down the source code, check for errors, and produce an intermediate representation. Related tools like preprocessors, assemblers, linkers, and loaders are also discussed. They work together to translate source code into executable machine code.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

LT Unit 3 Notes 2017

The document discusses the analysis of source programs by compilers. It describes the key phases of compilers as lexical analysis, syntactic analysis, and semantic analysis. These phases break down the source code, check for errors, and produce an intermediate representation. Related tools like preprocessors, assemblers, linkers, and loaders are also discussed. They work together to translate source code into executable machine code.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Language Translators

Unit -3

Syllabus

Source Program Analysis: Compilers – Analysis of the Source Program – Phases of a Compiler
– Cousins of Compiler – Grouping of Phases – Compiler Construction Tools.
Lexical Analysis: Role of Lexical Analyzer – Input Buffering – Specification of Tokens –
Recognition of Tokens –A Language for Specifying Lexical Analyzer.

Text Book

Alfred Aho, V. Ravi Sethi, and D. Jeffery Ullman, “Compilers Principles, Techniques and Tools”,
Addison-Wesley, 1988.

Compiler

A compiler is a program that can read a program in one language — the source language and
translate it into an equivalent program in another language — the target language. It is also
expected that a compiler should make the target code efficient and optimized in terms of time
and space. An important role of the compiler is to report any errors in the source program that it
detects during the translation process.

Commonly, the source language is a high-level programming language (i.e. a problem-oriented


language), and the target language is a machine language or assembly language (i.e. a machine-
oriented language). Thus compilation is a fundamental concept in the production of software: it
is the link between the (abstract) world of application development and the low-level world of
application execution on machines.

Compiler design principles provide an in-depth view of translation and optimization process.
Compiler design covers basic translation mechanism and error detection & recovery. It includes
lexical, syntax, and semantic analysis as front end, and code generation and optimization as
back-end.
-------

1
Language Processing System (Cousins of Complier)

In addition to a compiler, several other programs may be required to create an executable target
program.

 A source program may be divided into modules stored in separate files.


 The task of collecting the source program is sometimes entrusted to a separate program,
called a preprocessor. The preprocessor may also expand shorthands, called macros, into
source language statements. The modified source program is then fed to a compiler.
 The compiler may produce an assembly-language program as its output, because
assembly language is easier to produce as output and is easier to debug.
 The assembly language is then processed by a program called an assembler that produces
relocatable machine code as its output.
 Large programs are often compiled in pieces, so the relocatable machine code may have
to be linked together with other relocatable object files and library files into the code that
actually runs on the machine. The linker resolves external memory addresses, where the
code in one file may refer to a location in another file.
 The loader then puts together all of the executable object files into memory for execution.

I) Preprocessor: A preprocessor is a program that processes its input data to produce output that is
used as input to another program. The preprocessor is executed before the actual compilation of code
begins. They may perform the following functions
1. Macro processing 2. File Inclusion 3."Rational Preprocessors 4. Language extension

1. Macro processing: A macro is a rule or pattern that specifies how a certain input sequence (often a
sequence of characters) should be mapped to an output sequence (also often a sequence of characters)
according to a defined procedure.

Macro definitions (#define, #undef)

2
When the preprocessor encounters this directive, it replaces any occurrence of identifier in the rest of
the code by replacement.

Example:
#define TABLE_SIZE 100
int table1[TABLE_SIZE]; After the preprocessor has replaced TABLE_SIZE, the code becomes
equivalent to: int table1[100];

2. File Inclusion

Preprocessor includes header files into the program text. When the preprocessor finds an #include
directive it replaces it by the entire content of the specified file. There are two ways to specify a file
to be included:

#include "file" and #include <file>

The only difference between both expressions is the places (directories) where the compiler is going
to look for the file.

In the first case where the file name is specified between double-quotes, the file is searched first
in the same directory that includes the file containing the directive. In case that it is not there, the
compiler searches the file in the default directories where it is configured to look for the standard
header files.
If the file name is enclosed between angle-brackets <> the file is searched directly where the
compiler is configured to look for the standard header files. Therefore, standard header files are
usually included in angle-brackets, while other specific header files are included using quotes.

3."Rational Preprocessors:

These processors augment older languages with more modern flow of control and data structuring
facilities. For example, such a preprocessor might provide the user with built-in macros for
constructs like while-statements or if-statements,where none exist in the programming language
itself.

4. Language extension :

These processors attempt to add capabilities to the language by what amounts to built-in macros.
For example, the language equal is a database query language embedded in C. Statements begging
with ## are taken by the preprocessor to perform the database access

II) Assembler:

Typically, a modern assembler creates object code by translating assembly instruction mnemonics
into opcodes, and by resolving symbolic names for memory locations and other entities. There are
two types of assemblers based on how many passes through the source are needed to produce the
executable program.
One –pass
Two -Pass

3
One-pass assembler goes through the source code once and assumes that all symbols will be defined
before any instruction that references them. Two-pass assemblers create a table with all symbols and
their values in the first pass, and then use the table in a second pass to generate code.

III) Linkers and Loaders:

Compilers, assemblers and linkers usually produce code whose memory references are made relative
to an undetermined starting location that can be anywhere in memory (relocatable machine code). A
loader calculates appropriate absolute addresses for these memory locations and amends the code to
use these addresses. The process of loading consists of taking relocatable machine code, altering
the relocatable addresses and placing the altered instructions and data in memory at the
proper locations.

A linker combines object code (machine code that has not yet been linked) produced from compiling
and assembling many source programs, as well as standard library functions and resources supplied
by the operating system. This involves resolving references in each object file to external variables
and procedures declared in other files. A linker or link editor is a program that takes one or more
objects generated by a compiler and combines them into a single executable program.

------

ANALYSIS OF THE SOURCE PROGRAM

The analysis phase breaks up the source program into constituent pieces and creates an intermediate
representation of the source program. Analysis consists of three phases:
• Linear analysis
• Hierarchical analysis
• Semantic analysis

Linear analysis (Lexical analysis or Scanning) :

The lexical analysis phase reads the characters in the source program and grouped them as tokens
that are sequence of characters having a collective meaning.

Example: position: = initial + rate * 10

Identifiers – position, initial, rate.


Assignment symbol - : =
Operators - +, *
Number – 10
Blanks – Eliminated

Hierarchical analysis (Syntax analysis or Parsing) :

It involves grouping the tokens of the source program hierarchically into nested collections that are
used by the complier to synthesize output.

4
Semantic analysis :

In this phase checks the source program for semantic errors and gathers type information for
subsequent code generation phase. An important component of semantic analysis is type checking.
Example : int to real conversion

---- ---

Analysis – Synthesis Model of Compilation

The process of compilation has two parts namely : Analysis and Synthesis

The analysis part is often called the front end of the compiler; the synthesis part is the back end
of the compiler.

Analysis :The analysis part breaks up the source program into constituent pieces and creates an
intermediate representation of the source program. The front end analyzes the source program,
determines its constituent parts, and constructs an intermediate representation of the program.
Typically the front end is independent of the target language.

Synthesis : The synthesis part constructs the desired target program from the intermediate
representation . The back end synthesizes the target program from the intermediate representation
produced by the front end. Typically the back end is independent of the source language.

5
Phases of a Compiler

A compiler operates in phases. A phase is a logically interrelated operation that takes source program
in one representation and produces output in another representation. The different phases are as
follows:

1. Lexical analysis (“scanning”)


o Reads in program, groups characters into “tokens”
2. Syntax analysis (“parsing”)
o Structures token sequence according to grammar rules of the language.
3. Semantic analysis
o Checks semantic constraints of the language.
4. Intermediate code generation
o Translates to “lower level” representation.
5. Code optimization
o Improves code quality.
6. Final code generation.

Front end : machine independent phases


1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
Back end : machine dependent phases
5. Code Optimization
6. Target Code Generation

Lexical Analysis

The first phase of a compiler is called lexical analysis linear analysis or scanning.The lexical
analyzer reads the stream of characters making up the source program and groups the characters
into meaningful sequences called lexemes. For each lexeme, the lexical analyzer produces as
output a token of the form - (token-name, attribute-value) , that it passes on to the subsequent
phase, syntax analysis.

6
For example, suppose a source program contains the assignment statement
p o s i t i o n := i n i t i a l + r a t e * 60

The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:

1. The identifier position


2. The assignment symbol: =
3. The identifier initial
4. The plus sign
5. The identifier rate
6. The multiplication sign
7. The number 60

The blanks separating the characters are eliminated during lexical analysis

Syntax Analysis

The second phase of the compiler is syntax analysis or hierarchical analysis or parsing. In this
phase expressions, statements, declarations etc… are identified by using the results of lexical
analysis. The tokens from the lexical analyzer are grouped hierarchically into nested collections
with collective meaning. Syntax analysis is aided by using techniques based on formal grammar
of the programming language. This is represented using a parse tree.

The tokens from the lexical analyzer are grouped hierarchically into nested collections with
collective meaning called “Parse Tree” followed by syntax tree as output.
A Syntax Tree is a compressed representation of the parse tree in which the operators appears as
interior nodes & the operands as child nodes.

:=

id1 +

id2 *

id3 60

Semantic Analysis

The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for subsequent use during
intermediate-code generation. An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands. For example, a binary arithmetic
operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the
operator is applied to a floating-point number and an integer, the compiler may convert the
integer into a floating-point number. For the above syntax tree we apply the type conversion
considering all the identifiers to be real values, we get

7
:=

id1 +

id2 *

id3 inttoreal

60

Intermediate Code Generation

 Intermediate code should possess the following properties


 IC should be easily generated from the semantic representation of the source program
 Should be easy to translate the IC to Target Program
 Should be capable of holding the values computed during translation
 Should maintain precedence ordering of the source language
 Should be capable of holding the correct number of operands of the instruction.

An intermediate form called three-address code is considered, which consists of a sequence of


assembly-like instructions with three operands per instruction. Properties of three-address
instructions.

1. Each three-address assignment instruction has at most one operator on the right side.
2. The compiler must generate a temporary name to hold the value computed by a three-address
instruction.
3. Some "three-address instructions may have fewer than three operands.

 Three address Code – consists of a sequence of instructions, each of which has at most
three operands; Eg: A =B+ C , A = B; Sum = 10

temp1 = inttoreal(10)
temp 2= id3 * temp 1
temp 3 = id2 + temp 2
id 1 = temp3

Code Optimization

 The machine-independent code-optimization phase attempts to improve the intermediate


code so that better target code will result. There is a great variation in the amount of code
optimization different compilers perform. Those that do the most, are called "optimizing
compilers."A significant amount of time is spent on this phase. There are simple
optimizations that significantly improve the running time of the target program without
slowing down compilation too much. Aim – to improve on the intermediate code to
generate a code that runs faster and (or) occupies less space in memory.

8
 Compilation speed Vs Execution Speed
 Two optimization techniques
 Local Optimization
 Elimination of common sub expression copy propagation
 Loop Optimization
 Finding out loop invariants & avoiding them

Optimized Code

temp1 := id3 * 10.0


id1 := id2 + temp1

Code Generation

The code generator takes as input an intermediate representation of the source program and maps
it into the target language. If the target language is machine code, registers or memory locations
are selected for each of the variables used by the program. Then, the intermediate instructions are
translated into sequences of machine instructions that perform the same task.

 The final phase of the compiler is the generation of target code, consisting normally of
relocatable machine code or assembly code.
 Memory locations are selected for each of the variables used by the program. Then,
intermediate instructions are each translated into a sequence of machine instructions that
perform the same task.
 A crucial aspect is the assignment of variables to registers.

MOVF id3, R2
MULF #10.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1

9
10
Symbol-Table Management

 An essential function of a compiler is to record the variable names used in the source
program and collect information about various attributes of each name.
 These attributes may provide information about the storage allocated for a name, its type, its
scope (where in the program its value may be used), and in the case of procedure names,
such things as the number and types of its arguments, the method of passing each argument
(for example, by value or by reference), and the type returned.
 The symbol table is a data structure containing a record for each variable name, with fields
for the attributes of the name. When an identifier in the source program is detected by the lex
analyzer, the identifier is entered into the Symbol Table

 The data structure should be designed to allow the compiler to find the record for each name
quickly and to store or retrieve data from that record quickly.
Address Symbol Attribute Memory Location
1 Position id1, real 1000
2 = Operator 1100
3 Initial
4 +
5 Rate
6 *
7 10

Error Detection and Reporting

 Each phase can encounter errors. Features of the compiler is to detect & report errors.
 Lexical Analysis --- Characters may be misspelled
 Syntax Analysis --- Structure of the statement violates the rules of the language
 Semantic Analysis --- No meaning in the operation involved
 Intermediate Code Generation --- Operands have incompatible data types
 Code Optimizer --- Certain Statements may never be reached
 Code Generation --- Constant is too long
 Symbol Table --- Multiple declared variables

 The syntax and semantic analysis phases usually handle a large fraction of the errors
detectable by the compiler. The lexical phase can detect errors where the characters
remaining in the input do not form any token of the language. Errors when the token
stream violates the syntax of the language are determined by the syntax analysis phase.
During semantic analysis the compiler tries to detect constructs that have the right
syntactic structure but no meaning to the operation involved.

 After detecting an error, a phase must be able to recover from the error so that
compilation can proceed and allow further errors to be detected.

 A compiler which stops after detecting the first error is not useful. On detecting an error
the compiler must:

11
 report the error in a helpful way,
 correct the error if possible, and
 Continue processing (if possible) after the error to look for further errors.

---- ----- -----


Grouping Of Phases

Activities from more than one phase are often grouped together. The phases are collected into a front
end and a back end

Front End:
 The Front End consists of those phases or parts of phases that depends primarily on the
source language and is largely independent of target machine.
 Lexical and syntactic analysis, symbol table, semantic analysis and the generation of
intermediate code is included.
 Certain amount of code optimization can be done by the front end.
 It also includes error handling that goes along with each of these phases.

Back End:
 The Back End includes those portions of the compiler that depend on the target machine
and these portions do not depend on the source language.
 Find the aspects of code optimization phase, code generation along with necessary error
handling and symbol table operations.

Passes:Several phases of compilation are usually implemented in a single pass consisting of


reading an input file and writing an output file.
 It is common for several phases to be grouped into one pass, and for the activity of these
phases to be interleaved during the pass.
 Eg: Lexical analysis, syntax analysis, semantic analysis and intermediate code generation
might be grouped into one pass. If so, the token stream after lexical analysis may be
translated directly into intermediate code.

Reducing the number of passes: It is desirable to have relatively few passes, since it takes
time to read and write intermediate files.
 On reducing the number of passes , the entire information of the pass has to be stored in
the temp memory. This increases the memory space needed to store the information
 Lexical Analysis + Syntax Analysis
 Code Generation cannot be done before IC generation
 Intermediate and target code generation – Backpatching (Address of the branch
instruction can be left blank and can be filled in when the information is
available)
Compiler-Construction Tools

 Compiler Construction tools are the tools that have been created for automatic design of
specific compiler components. Some commonly used compiler-construction tools include
1. Parser generator

12
2. Scanner generator
3. Syntax-directed translation engine
4. Automatic code generator
5. Data flow engine

Parser generators
- produce syntax analyzers from input that is based on context-free grammar.
- Earlier, syntax analysis consumed large fraction of the running time of a compiler +
large fraction of the intellectual effort of writing a compiler.
- This phase is now considered as one of the easiest to implement.
- Many parser generators utilize powerful parsing algorithms that are too complex to be
carried out by hand.

Scanner generators
- automatically generates lexical analyzers from a specification based on regular
expression.
- The basic organization of the resulting lexical analyzers is finite automation.

Syntax-directed translation engines

- produce collections of routines that walk a parse tree and generating intermediate code.
- The basic idea is that one or more “translations” are associated with each node of the
parse tree.
- Each translation is defined in terms of translations at its neighbor nodes in the tree.

Automatic Coder generators


- A tool takes a collection of rules that define the translation of each operation of the
intermediate language into the machine language for a target machine.
- The rules must include sufficient details that we can handle the different possible access
methods for data.

Data-flow analysis engines


- gathering of information about how values are transmitted from one part of a program
to each other part.
- Data-flow analysis is a key part of code optimization.

13
Lexical Analysis

 Lexical analysis is the first phase of a compiler. It takes the input characters of the source
program, group them into lexemes, and produce as output a sequence of tokens for each lexeme
in the source program. The lexical analyzer removes any whitespace or comments in the source
code.
 If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.

It produces output a sequence of tokens that the parser uses for syntax analysis. As in the figure,
upon receiving a “get next token” command from the parser the lexical analyzer reads input
characters until it can identify the next token.

Other functions of lexical analyzer

1. stripping out comments and whitespace (blank, newline, tab characters that are used to
separate tokens in the input).

2. Another task is correlating error messages generated by the compiler with the source program.
For instance, the lexical analyzer may keep track of the number of newline characters seen, so it
can associate a line number with each error message.

Separation of lexical analysis from syntax analysis

1. Simplicity – Techniques for lexical analysis are less complex that those required for syntax
analysis, so the lexical-analysis process can be simpler if it separate. Also, removing the low-
level details of lexical analysis from the syntax analyze makes the syntax analyzer both smaller
and cleaner.

14
2. Efficiency – Although it pays to optimize the lexical analyzer, because lexical analysis
requires a significant portion of total compilation time, it is not fruitful to optimize the syntax
analyzer. Separation facilitates this selective optimization.

3. Portability – Because the lexical analyzer reads input program files and often includes
buffering of that input, it is somewhat platform dependent. However, the syntax analyzer can be
platform independent. It is always a good practice to isolate machine dependent parts of any
software system.

Important terms

 Lexeme : Smallest logical units or words of a program


 E.g. int, main, true, 10.0, * , +
 Tokens : Classes of similar Lexeme
 Category to which a lexeme belongs to
 E.g. 10.0 belongs to float, int belongs to keyword, * belongs to operator, a
belongs to identifier
 Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are, 1) Identifiers 2) keywords 3) operators 4) special symbols
5)constants
 Pattern: informal or formal description of a token
 An identifier is a string, in which the first character is an alphabet and the
successive characters are either digits or alphabets
 Patterns can be used to automatically generate a lexical analyzer

Lexical Errors

It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-
code error. For instance in the following C statement
f i ( a == f ( x ) ) . ..

a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an undeclared


function identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return

15
the token id to the parser and let some other phase of the compiler — probably the parser in this
case — handle an error due to transposition of the letters.

However, suppose a situation arises in which the lexical analyzer is unable to proceed because
none of the patterns for tokens matches any prefix of the remaining input. The simplest recovery
strategy is "panic mode" recovery. We delete successive characters from the remaining input,
until the lexical analyzer an find a well-formed token at the beginning of what input is left.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.


2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.

Input Buffering

 Input Buffering is done to optimize the working speed of the lexical analyser.

 Eg. if (c==10)

 Lexeme_beginning = Pointer that indicates the beginning of the lexeme


 Search_Pointer= Pointer that keeps track of the portion of the input string scanned. This
pointer is also called as Look Ahead Pointer

The lexical analyzer scans the characters of the source program one a t a time to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the
next token itself can be determined. For this and other reasons, it is desirable for the lexical
analyzer to read its input from an input buffer. Figure shows a buffer divided into two halves of,
say 100 characters each. One pointer marks the beginning of the token being discovered. A look
ahead pointer scans ahead of the beginning point, until the token is discovered .we view the
position of each pointer as being between the character last read and the character next to be read.
In practice, each buffering scheme adopts one convention; a pointer is at the symbol last read or
at the symbol, it is ready to read.

16
The distance which the lookahead pointer may have to travel past the actual token may be large. For
example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n)

Without knowing whether DECLARE is a keyword or an array name until we see the character that
follows the right parenthesis. In either case, the token itself ends at the second E. If the look ahead
pointer travels beyond the buffer half in which it began, the other half must be loaded with the next
characters from the source file. Since the buffer shown in above figure is of limited size there is an
implied constraint on how much look ahead can be used before the next token is discovered.

In the above example, if the look ahead traveled to the left half and all the way through the left half
to the middle, we could not reload the right half, because we would lose characters that had not yet
been grouped into tokens. While we can make the buffer larger if we chose or use another buffering
scheme, we cannot ignore the fact that overhead is limited

 One Buffer Scheme


 Disadv: It is useful for small lexemes. If the lexeme crosses the buffer boundary,
overwriting is required to read the full lexeme
 Buffer Pairs – Buffer divided into two N- Character Halves

• Two pointers to the input buffer is forward and lexeme beginning.


• The string of characters between the two pointers is the current lexeme.
• Both the pointers point to the first character of the next lexeme to be found.
• Forward pointer scans ahead until a match for a pattern is found. It will point to the
character at its right end after finding the token.

if forward at end of first half then begin


reload second half;
forward := forward+1
end
elseif forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else forward := forward+1;

 Two Buffer Scheme


 Two buffers are used alternatively.
 When Buffer 1 is full, Buffer 2 is used and vice versa
 The pointer “sp” has to be incremented and two times it has to be checked.-
one for the end of the buffer, and one to determine what character is read
 Sentinel Character ( A special character, not a part of the program)

17
 E.g. EOF – End of File.
If SP encounters EOF in a buffer, the next buffer is used for refilling. So two test is avoided…
Only one test is perfor

Sentinels

For each character read, we make two tests: one for the end of the buffer, and one to determine
what character is read. We can combine the buffer-end test with the test for the current character
if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character
that cannot be part of the source program, and a natural choice is the character eof. Note that eof
retains its use as a marker for the end of the entire input. Any eof that appears other than at the
end of a buffer means that the input is at an end.

forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end

else if forward at end of second half then begin


reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end

Specification of Tokens

Regular expressions are an important notation for specifying lexeme patterns.

Strings and Languages

 An alphabet is a finite set of symbols.


 A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
 A language is any countable set of strings over some fixed alphabet.
 In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
The length of a string s, usually written |s|, is the number of occurrences of symbols in s.
 For example, banana is a string of length six. The empty string, denoted ε, is the string of
length zero.

Operations on Language
 L and M are languages
 Union of L and M - L U M = { s | s is in L OR s is in M }
 Intersection of L and M - L  M = { s | s is in L AND s is in M }

18
 Concatenation of L and M - LM = { st | s is in L and t is in M }
 Exponentiation of the Language L is - Li = L Li-1
 Kleene closure of L (Zero or more Concatenations)

L* = U Li
i=o
 Positive Closure of L (One or more Concatenations)

L+ = U Li
i=1

Rules governing the languages

 If L and M are 2 Languages, then


 LUM=MUL
 UL=LU
 L = L = 
 If M has only an empty string ( ) in its alphabet set, then
{} L = L {} = L

Terms for Parts of Strings

The following string-related terms are commonly used:

1. A prefix of string s is any string obtained by removing zero or more symbols from the end of
s. For example, ban, banana, and e are prefixes of banana.

2. A suffix of string s is any string obtained by removing zero or more symbols from the
beginning of s. For example, nana, banana, and e are suffixes of banana.

3. A substring of s is obtained by deleting any prefix and any suffix from s. For example, banana,
nan, and e are substrings of banana.

4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes, and
substrings, respectively, of s that are not ε or not equal to s itself.

4. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive
positions of s. For example, baan is a subsequence of banana.

Regular Expressions

1. Each regular expression r denotes a language L(r).


2. Here are the rules that define the regular expressions over some alphabet Σ and the
languages that those expressions denote.
3. ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the
empty string.

19
4. If a is a symbol in Σ, then a is a regular expression, and L(a) = {a}, that is, the language
with one string, of length one, with a in its one position.
5. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.

 (r)|(s) is a regular expression denoting the language L(r) U L(s).


 (r)(s) is a regular expression denoting the language L(r)L(s).
 (r)* is a regular expression denoting (L(r))*.
 (r) is a regular expression denoting L(r).

 The unary operator * has highest precedence and is left associative.


 Concatenation has second highest precedence and is left associative. | has lowest
precedence and is left associative.
 A language that can be defined by a regular expression is called a regular set.

If two regular expressions r and s denote the same regular set, we say they are equivalent and
write r = s. For instance, (a|b) = (b|a).

Regular Expression Operations

 Three Basic Operations


 Choice among the alternates
 Indicated by meta character |
 RE = R |S
 L ( R |S) = L (R) U L (S)
 Concatenation – RS
 L (RS) = L(R) L(S)
 Repetition – Kleene Closure [ Finite Concatenation of Strings ]
 R*
 Precedence – Repetition, Concatenation, Choice

Rules for constructing RE over an alphabet

  is a RE
 If ‘a’ is a symbol in , then a is a regular expression
 If ‘ r’ and ‘s’ are regular expressions then,
 r | s is a RE
 r s is RE
 If ‘r’ is a regular expression then
 r * is a RE
 (r) is a RE

Axioms for RE

 The operator | is
 commutative- r | s = s| r
 Associative – r |(s |t) = (r |s) | t = r | s |t

20
 The operator ‘.’ is
 Associative – r.(s.t) = (r.s) . T
 Distributive – r (s|t)= rs | rt
 r=r
 r* r * = (r*)* = r* = rr* | 
 (r|s) * = (r*s*)* = (r*s*)r* = (r*|s*) *
 rr* = r*r
 (rs)*r = r(sr)*

Notational Shorthands

Certain constructs occur so frequently in regular expressions that it is convenient to introduce


notational shorthand’s for them.

One or more instances (+)


- The unary postfix operator + means “ one or more instances of” .
- If r is a regular expression that denotes the language L(r), then ( r )+ is a regular expression that
denotes the language ( L (r ) )+
- Thus the regular expression a+ denotes the set of all strings of one or more a’s.
- The operator + has the same precedence and associativity as the operator *.

Zero or one instance ( ?)


- The unary postfix operator ? means “zero or one instance of”.
- The notation r? is a shorthand for r | ε.
- If ‘r’ is a regular expression, then ( r )? Is a regular expression that denotes the language L( r )
U { ε }.

Character Classes.
- The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a | b |
c.
- Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
- Identifiers as being strings generated by the regular expression,

[A–Za–z][A–Za–z0–9]*

Regular Set
- A language denoted by a regular expression is said to be a regular set.

Non-regular Set
- A language which cannot be described by any regular expression.

Eg. The set of all strings of balanced parentheses and repeating strings cannot be described by a
regular expression. This set can be specified by a context-free grammar.

21
Transition Diagrams

As an intermediate step in the construction of a lexical analyzer, we first convert patterns into
stylized flowcharts, called "transition diagrams." Transition diagrams have a collection of nodes
or circles, called states. Each state represents a condition that could occur during the process of
scanning the input looking for a lexeme that matches one of several patterns. Edges are directed
from one state of the transition diagram to another. Each edge is labeled by a symbol or set of
symbols.

Some important conventions about transition diagrams are:

1. Certain states are said to be accepting, or final. These states indicate that a lexeme has been
found. We always indicate an accepting state by a double circle, and if there is an action to be
taken — typically returning a token and an attribute value to the parser — we shall attach that
action to the accepting state.

2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does
not include the symbol that got us to the accepting state), then we shall additionally place a *
near that accepting state.

3. One state is designated the start state, or initial state; it is indicated by an edge, labeled "start,"
entering from nowhere.

4. The transition diagram always begins in the start state before any input symbols have been
read.

Lex A Lexical Analyzer Generator

Lex helps write programs whose control flow is directed by instances of regular expressions in
the input stream. It is well suited for editor-script type transformations and for segmenting input
in preparation for a parsing routine.

Lex source is a table of regular expressions and corresponding program fragments. The table is
translated to a program which reads an input stream, copying it to an output stream and
partitioning the input into strings which match the given expressions. As each such string is
recognized the corresponding program fragment is executed. The recognition of the expressions
is performed by a deterministic finite automaton generated by Lex. The program fragments
written by the user are executed in the order in which the corresponding regular expressions
occur in the input stream.

The lexical analysis programs written with Lex accept ambiguous specifications and choose the
longest match possible at each input point. If necessary, substantial look-ahead is performed on
the input, but the input stream will be backed up to he end of the current partition, so that the
user has general freedom to manipulate it.

22
Introduction.

Lex is a program generator designed for lexical processing of character input streams. It accepts
a high-level, problem oriented specification for character string matching, and produces a
program in a general purpose language which recognizes regular expressions. The regular
expressions are specified by the user in the source given to Lex. The Lex written code recognizes
these expressions in an input stream and partitions the input stream into strings matching the
expressions. At the boundaries between strings program sections provided by the user are
executed. The Lex source file associates the regular expressions and the program fragments. As
each expression appears in the input to the program written by Lex, the corresponding fragment
is executed.

Lex is not a complete language, but rather a generator representing a new language feature which
can be added to different programming languages, called ‘‘host languages.’’

Lex can write code in different host languages. The host language is used for the output code
generated by Lex and also for the program fragments added by the user. Compatible run-time
libraries for the different host languages are also provided. This makes Lex adaptable to different
environments and different users. Each application may be directed to the combination of
hardware and host language appropriate to the task, the user’s background, and properties of
local implementations.

Lex turns the user’s expressions and actions (called source in this memo) into the host general-
purpose language; the generated program is named yylex The yylex program will recognize
expressions in a stream (called input in this memo) and perform the specified actions for each
expression as it is detected.
Source → Lex → yylex
Input −−−> yylex → Output

An overview of Lex
For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of
lines.
%%
[ \t]+$ ;
is all that is required. The program contains a%% delimiter to mark the beginning of the rules,
and one rule. This rule contains a regular which matches one or more instances of the characters
blank rtab (written \t for visibility, in with the C language convention) just prior to the end of a
line. The brackets indicate character class made of blank and tab; the + indicates ‘‘one or more
...’’; and the $ indicates ‘‘end of line,’’ as in QED. No action is specified, so the program
generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change
any remaining string of blanks or tabs to a single blank, add another rule:
%%
[ \t]+$ ;
[ \t]+ printf (" ");
The finite automaton generated for this source will scan for both rules at once, observing at the
termination of the string of blanks or tabs whether or not there is a newline character, and

23
executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of
lines, and the second rule all remaining strings of blanks or tabs. Lex can be used alone for
simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be
used with a parser generator to perform the lexical analysis phase; it is particularly easy to
interface Lex and Yacc .Lex programs recognize only regular expressions; Yacc writes parsers
that accept a large class of context free grammars, but require a lower level analyzer to recognize
input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a
preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser
generator assigns structure to the resulting pieces. The flow of control in such a case (which
might be the first half of a compiler, for example) is shown below.Additional programs, written
by other generators or by hand, can be added easily to programs written by Lex..
lexical grammar
rules rules
↓|
Lex Yacc
↓↓
Input → yylex → yyparse → Parsed input

Lex with Yacc


Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be named,
so that the use of this name by Lex simplifies interfacing. Lex generates a deterministic finite
automation from the regular expressions in the source .The automaton is interpreted, rather than
compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by
a Lex program to recognize and partition an input stream is proportional to the length of the
input. The number of Lex rules or the complexity of the rules is not important in determining
speed, unless rules which include forward context require a significant amount of rescanning.
What does increase with the number and complexity of rules is the size of the finite automaton,
and therefore the size of the program generated by Lex.

In the program written by Lex, the user’s (representing the actions to be performed as each
regular expression is found) are gathered as cases of a switch. The automaton interpreter directs
the control flow. Opportunity is provided for the user to insert either declarations or additional
statements in the routine containing the actions, or to add subroutines outside this action routine.
Lex is not limited to source which can be interpreted on the basis of one character look-ahead.
For example, if there are two rules, one looking for ab and another for abcdefg , and the input
stream is abcdefh , Lex will recognize ab and leave the input pointer just before cd … Such is
more costly than the processing of languages.

Lex Source.
General format of Lex source is:
{definitions}
%%
{rules}
%%
{user subroutines}

24
where the definitions and the user subroutines are often omitted. The second % % is optional, but
the first is required to mark the beginning of the rules. The absolute minimum Lex program is
%% (no definitions, no rules) which translates into a program which copies the input to the
output unchanged. In the outline of Lex programs shown the rules represent the user’s control
signs; they are a table, in which the left column contains regular expressions and the right
column contains actions to be executed when the expressions are recognized. Thus an individual
rule might appear integer printf("found keyword INT"); to look for the string integer in the input
stream and print the message ‘‘found keyword INT’’ whenever it appears. In this example the
host procedural language is C and the C library function printf is used to print the string. The end
of the expression is indicated by the first blank or tab character. If the action is merely a single C
expression, it can just be given on the right side of the line; if it is compound, or takes more than
a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired to
change a number of words from British to American spelling.
Lex rules such as
colour printf ("color");
Mechanise printf ("mechanize");
petrol printf("gas");
would be a start.

Lex Regular Expressions.

A regular expression specifies a set of strings to be matched. It contains text characters (which
match the corresponding characters in the strings being compared) and operator characters
(which specify repetitions, choices, and other features). The letters of the alphabet and the digits
are always text characters; thus the regular expression integer matches the string integer
wherever it appears and the expression a57D looks for the string a57D

Metacharacter Matches
. any character except newline
\n newline
* zero or more copies of preceding expression
+ one or more copies of preceding expression
? zero or one copy of preceding expression
^ beginning of line
$ end of line
a|b a or b
(ab)+ one or more copies of ab (grouping)
"a+b" literal “a+b” (C escapes still work)
[ ] character class

Expression Matches
abc abc
abc* ab, abc, abcc, abccc, …
abc+ abc, abcc, abccc, …
a(bc)+ abc, abcbc, abcbcbc, …
a(bc)? a, abc
[abc] a, b, c

25
[a-z] any letter, a through z
[a\-z] a, -, z
[-az] -, a, z
[A-Za-z0-9]+ one or more alphanumeric characters
[ \t\n]+ whitespace
[^ab] anything except: a, b
[a^b] a, ^, b
[a|b] a, |, b
a|b a or b

name function

int yylex(void) call to invoke lexer, returns token


char *yytext pointer to matched string
yyleng length of matched string
yylval value associated with token
int yywrap(void) wrap-up, return 1 if done, 0 if not done
FILE *yyout output file
FILE *yyin input file
INITIAL initial start condition

BEGIN condition switch start condition


ECHO write matched string
A recognizer for a language is a program that takes as input a string x and answers ‘yes’ if a sentence
of the language and ‘no’ otherwise. We compile a regular expression into a recognizer by
constructing a transition diagram called finite automation. A finite automation can be deterministic or
non deterministic where non deterministic means that more than one transition out of a state may be
possible out of a state may be possible on a same input symbol.
Dfa s are faster recognizers than nfas but can be much bigger than equivalent nfas.

Non deterministic finite automata


A mathematical model consisting :
1) a set of states S
2) input alphabet
3) transition function
4) initial state
5) final state

26
Deterministic finite automata
Special case of nfa in which
1) no state has epsilon transition
2) for each state s and input symbol a, there is at most one edge labeled a leaving s

Conversion of nfa to dfa


Subset construction algorithm
input: nfa N
output: equivalent dfa D
Method:
Operations on nfa states: operation description
Epsilon-closure(S) Set of nfa states reachable from nfa state s on
e-transitions alone
Epsilon-closure(T) Set of nfa states reachable from nfa state s in
Ton e-transitions alone
Move(T,a) Set of nfa states to which there is a transition
on input symbol afrom nfa state s in T

CONSTRUCTION OF AN NFA FROM A REGULAR EXPRESSION


Thomson’s Construction
To convert regular expression r over an alphabet Σ into an NFA N accepting L(r)
Parse r into its constituent sub-expressions.
Construct NFAs for each of the basic symbols in r.

For ∈, construct the NFA

Here i is a new start state and f is a new accepting state. This NFA recognizes {∈}.

For a in Σ, construct the NFA

27
Again i is a new start state and f is a new accepting state. This NFA accepts {a}.

If a occurs several times in r, then a separate NFA is constructed for each occurrence.
Keeping the syntactic structure of the regular expression in mind, combine these NFAs
inductively until the NFA for the entire expression is obtained. Each intermediate NFA produced
during the course of construction corresponds to a sub-expression r and has several important
properties – it has exactly one final state, no edge enters the start state and no edge leaves the final
state.

Suppose N(s) and N(t) are NFAs for regular expressions s and t.

(a) For regular expression s|t, construct the following composite NFA N(s|t) :

(b) For the regular expression st, construct the composite NFA N(st) :

(c) For the regular expression s* , construct the composite NFA N(s*) :

(d) For the parenthesized regular expression (s), use N(s) itself as the NFA

--- -----

28

You might also like