LT Unit 3 Notes 2017
LT Unit 3 Notes 2017
Unit -3
Syllabus
Source Program Analysis: Compilers – Analysis of the Source Program – Phases of a Compiler
– Cousins of Compiler – Grouping of Phases – Compiler Construction Tools.
Lexical Analysis: Role of Lexical Analyzer – Input Buffering – Specification of Tokens –
Recognition of Tokens –A Language for Specifying Lexical Analyzer.
Text Book
Alfred Aho, V. Ravi Sethi, and D. Jeffery Ullman, “Compilers Principles, Techniques and Tools”,
Addison-Wesley, 1988.
Compiler
A compiler is a program that can read a program in one language — the source language and
translate it into an equivalent program in another language — the target language. It is also
expected that a compiler should make the target code efficient and optimized in terms of time
and space. An important role of the compiler is to report any errors in the source program that it
detects during the translation process.
Compiler design principles provide an in-depth view of translation and optimization process.
Compiler design covers basic translation mechanism and error detection & recovery. It includes
lexical, syntax, and semantic analysis as front end, and code generation and optimization as
back-end.
-------
1
Language Processing System (Cousins of Complier)
In addition to a compiler, several other programs may be required to create an executable target
program.
I) Preprocessor: A preprocessor is a program that processes its input data to produce output that is
used as input to another program. The preprocessor is executed before the actual compilation of code
begins. They may perform the following functions
1. Macro processing 2. File Inclusion 3."Rational Preprocessors 4. Language extension
1. Macro processing: A macro is a rule or pattern that specifies how a certain input sequence (often a
sequence of characters) should be mapped to an output sequence (also often a sequence of characters)
according to a defined procedure.
2
When the preprocessor encounters this directive, it replaces any occurrence of identifier in the rest of
the code by replacement.
Example:
#define TABLE_SIZE 100
int table1[TABLE_SIZE]; After the preprocessor has replaced TABLE_SIZE, the code becomes
equivalent to: int table1[100];
2. File Inclusion
Preprocessor includes header files into the program text. When the preprocessor finds an #include
directive it replaces it by the entire content of the specified file. There are two ways to specify a file
to be included:
The only difference between both expressions is the places (directories) where the compiler is going
to look for the file.
In the first case where the file name is specified between double-quotes, the file is searched first
in the same directory that includes the file containing the directive. In case that it is not there, the
compiler searches the file in the default directories where it is configured to look for the standard
header files.
If the file name is enclosed between angle-brackets <> the file is searched directly where the
compiler is configured to look for the standard header files. Therefore, standard header files are
usually included in angle-brackets, while other specific header files are included using quotes.
3."Rational Preprocessors:
These processors augment older languages with more modern flow of control and data structuring
facilities. For example, such a preprocessor might provide the user with built-in macros for
constructs like while-statements or if-statements,where none exist in the programming language
itself.
4. Language extension :
These processors attempt to add capabilities to the language by what amounts to built-in macros.
For example, the language equal is a database query language embedded in C. Statements begging
with ## are taken by the preprocessor to perform the database access
II) Assembler:
Typically, a modern assembler creates object code by translating assembly instruction mnemonics
into opcodes, and by resolving symbolic names for memory locations and other entities. There are
two types of assemblers based on how many passes through the source are needed to produce the
executable program.
One –pass
Two -Pass
3
One-pass assembler goes through the source code once and assumes that all symbols will be defined
before any instruction that references them. Two-pass assemblers create a table with all symbols and
their values in the first pass, and then use the table in a second pass to generate code.
Compilers, assemblers and linkers usually produce code whose memory references are made relative
to an undetermined starting location that can be anywhere in memory (relocatable machine code). A
loader calculates appropriate absolute addresses for these memory locations and amends the code to
use these addresses. The process of loading consists of taking relocatable machine code, altering
the relocatable addresses and placing the altered instructions and data in memory at the
proper locations.
A linker combines object code (machine code that has not yet been linked) produced from compiling
and assembling many source programs, as well as standard library functions and resources supplied
by the operating system. This involves resolving references in each object file to external variables
and procedures declared in other files. A linker or link editor is a program that takes one or more
objects generated by a compiler and combines them into a single executable program.
------
The analysis phase breaks up the source program into constituent pieces and creates an intermediate
representation of the source program. Analysis consists of three phases:
• Linear analysis
• Hierarchical analysis
• Semantic analysis
The lexical analysis phase reads the characters in the source program and grouped them as tokens
that are sequence of characters having a collective meaning.
It involves grouping the tokens of the source program hierarchically into nested collections that are
used by the complier to synthesize output.
4
Semantic analysis :
In this phase checks the source program for semantic errors and gathers type information for
subsequent code generation phase. An important component of semantic analysis is type checking.
Example : int to real conversion
---- ---
The process of compilation has two parts namely : Analysis and Synthesis
The analysis part is often called the front end of the compiler; the synthesis part is the back end
of the compiler.
Analysis :The analysis part breaks up the source program into constituent pieces and creates an
intermediate representation of the source program. The front end analyzes the source program,
determines its constituent parts, and constructs an intermediate representation of the program.
Typically the front end is independent of the target language.
Synthesis : The synthesis part constructs the desired target program from the intermediate
representation . The back end synthesizes the target program from the intermediate representation
produced by the front end. Typically the back end is independent of the source language.
5
Phases of a Compiler
A compiler operates in phases. A phase is a logically interrelated operation that takes source program
in one representation and produces output in another representation. The different phases are as
follows:
Lexical Analysis
The first phase of a compiler is called lexical analysis linear analysis or scanning.The lexical
analyzer reads the stream of characters making up the source program and groups the characters
into meaningful sequences called lexemes. For each lexeme, the lexical analyzer produces as
output a token of the form - (token-name, attribute-value) , that it passes on to the subsequent
phase, syntax analysis.
6
For example, suppose a source program contains the assignment statement
p o s i t i o n := i n i t i a l + r a t e * 60
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
The blanks separating the characters are eliminated during lexical analysis
Syntax Analysis
The second phase of the compiler is syntax analysis or hierarchical analysis or parsing. In this
phase expressions, statements, declarations etc… are identified by using the results of lexical
analysis. The tokens from the lexical analyzer are grouped hierarchically into nested collections
with collective meaning. Syntax analysis is aided by using techniques based on formal grammar
of the programming language. This is represented using a parse tree.
The tokens from the lexical analyzer are grouped hierarchically into nested collections with
collective meaning called “Parse Tree” followed by syntax tree as output.
A Syntax Tree is a compressed representation of the parse tree in which the operators appears as
interior nodes & the operands as child nodes.
:=
id1 +
id2 *
id3 60
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for subsequent use during
intermediate-code generation. An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands. For example, a binary arithmetic
operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the
operator is applied to a floating-point number and an integer, the compiler may convert the
integer into a floating-point number. For the above syntax tree we apply the type conversion
considering all the identifiers to be real values, we get
7
:=
id1 +
id2 *
id3 inttoreal
60
1. Each three-address assignment instruction has at most one operator on the right side.
2. The compiler must generate a temporary name to hold the value computed by a three-address
instruction.
3. Some "three-address instructions may have fewer than three operands.
Three address Code – consists of a sequence of instructions, each of which has at most
three operands; Eg: A =B+ C , A = B; Sum = 10
temp1 = inttoreal(10)
temp 2= id3 * temp 1
temp 3 = id2 + temp 2
id 1 = temp3
Code Optimization
8
Compilation speed Vs Execution Speed
Two optimization techniques
Local Optimization
Elimination of common sub expression copy propagation
Loop Optimization
Finding out loop invariants & avoiding them
Optimized Code
Code Generation
The code generator takes as input an intermediate representation of the source program and maps
it into the target language. If the target language is machine code, registers or memory locations
are selected for each of the variables used by the program. Then, the intermediate instructions are
translated into sequences of machine instructions that perform the same task.
The final phase of the compiler is the generation of target code, consisting normally of
relocatable machine code or assembly code.
Memory locations are selected for each of the variables used by the program. Then,
intermediate instructions are each translated into a sequence of machine instructions that
perform the same task.
A crucial aspect is the assignment of variables to registers.
MOVF id3, R2
MULF #10.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
9
10
Symbol-Table Management
An essential function of a compiler is to record the variable names used in the source
program and collect information about various attributes of each name.
These attributes may provide information about the storage allocated for a name, its type, its
scope (where in the program its value may be used), and in the case of procedure names,
such things as the number and types of its arguments, the method of passing each argument
(for example, by value or by reference), and the type returned.
The symbol table is a data structure containing a record for each variable name, with fields
for the attributes of the name. When an identifier in the source program is detected by the lex
analyzer, the identifier is entered into the Symbol Table
The data structure should be designed to allow the compiler to find the record for each name
quickly and to store or retrieve data from that record quickly.
Address Symbol Attribute Memory Location
1 Position id1, real 1000
2 = Operator 1100
3 Initial
4 +
5 Rate
6 *
7 10
Each phase can encounter errors. Features of the compiler is to detect & report errors.
Lexical Analysis --- Characters may be misspelled
Syntax Analysis --- Structure of the statement violates the rules of the language
Semantic Analysis --- No meaning in the operation involved
Intermediate Code Generation --- Operands have incompatible data types
Code Optimizer --- Certain Statements may never be reached
Code Generation --- Constant is too long
Symbol Table --- Multiple declared variables
The syntax and semantic analysis phases usually handle a large fraction of the errors
detectable by the compiler. The lexical phase can detect errors where the characters
remaining in the input do not form any token of the language. Errors when the token
stream violates the syntax of the language are determined by the syntax analysis phase.
During semantic analysis the compiler tries to detect constructs that have the right
syntactic structure but no meaning to the operation involved.
After detecting an error, a phase must be able to recover from the error so that
compilation can proceed and allow further errors to be detected.
A compiler which stops after detecting the first error is not useful. On detecting an error
the compiler must:
11
report the error in a helpful way,
correct the error if possible, and
Continue processing (if possible) after the error to look for further errors.
Activities from more than one phase are often grouped together. The phases are collected into a front
end and a back end
Front End:
The Front End consists of those phases or parts of phases that depends primarily on the
source language and is largely independent of target machine.
Lexical and syntactic analysis, symbol table, semantic analysis and the generation of
intermediate code is included.
Certain amount of code optimization can be done by the front end.
It also includes error handling that goes along with each of these phases.
Back End:
The Back End includes those portions of the compiler that depend on the target machine
and these portions do not depend on the source language.
Find the aspects of code optimization phase, code generation along with necessary error
handling and symbol table operations.
Reducing the number of passes: It is desirable to have relatively few passes, since it takes
time to read and write intermediate files.
On reducing the number of passes , the entire information of the pass has to be stored in
the temp memory. This increases the memory space needed to store the information
Lexical Analysis + Syntax Analysis
Code Generation cannot be done before IC generation
Intermediate and target code generation – Backpatching (Address of the branch
instruction can be left blank and can be filled in when the information is
available)
Compiler-Construction Tools
Compiler Construction tools are the tools that have been created for automatic design of
specific compiler components. Some commonly used compiler-construction tools include
1. Parser generator
12
2. Scanner generator
3. Syntax-directed translation engine
4. Automatic code generator
5. Data flow engine
Parser generators
- produce syntax analyzers from input that is based on context-free grammar.
- Earlier, syntax analysis consumed large fraction of the running time of a compiler +
large fraction of the intellectual effort of writing a compiler.
- This phase is now considered as one of the easiest to implement.
- Many parser generators utilize powerful parsing algorithms that are too complex to be
carried out by hand.
Scanner generators
- automatically generates lexical analyzers from a specification based on regular
expression.
- The basic organization of the resulting lexical analyzers is finite automation.
- produce collections of routines that walk a parse tree and generating intermediate code.
- The basic idea is that one or more “translations” are associated with each node of the
parse tree.
- Each translation is defined in terms of translations at its neighbor nodes in the tree.
13
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes the input characters of the source
program, group them into lexemes, and produce as output a sequence of tokens for each lexeme
in the source program. The lexical analyzer removes any whitespace or comments in the source
code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
It produces output a sequence of tokens that the parser uses for syntax analysis. As in the figure,
upon receiving a “get next token” command from the parser the lexical analyzer reads input
characters until it can identify the next token.
1. stripping out comments and whitespace (blank, newline, tab characters that are used to
separate tokens in the input).
2. Another task is correlating error messages generated by the compiler with the source program.
For instance, the lexical analyzer may keep track of the number of newline characters seen, so it
can associate a line number with each error message.
1. Simplicity – Techniques for lexical analysis are less complex that those required for syntax
analysis, so the lexical-analysis process can be simpler if it separate. Also, removing the low-
level details of lexical analysis from the syntax analyze makes the syntax analyzer both smaller
and cleaner.
14
2. Efficiency – Although it pays to optimize the lexical analyzer, because lexical analysis
requires a significant portion of total compilation time, it is not fruitful to optimize the syntax
analyzer. Separation facilitates this selective optimization.
3. Portability – Because the lexical analyzer reads input program files and often includes
buffering of that input, it is somewhat platform dependent. However, the syntax analyzer can be
platform independent. It is always a good practice to isolate machine dependent parts of any
software system.
Important terms
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-
code error. For instance in the following C statement
f i ( a == f ( x ) ) . ..
15
the token id to the parser and let some other phase of the compiler — probably the parser in this
case — handle an error due to transposition of the letters.
However, suppose a situation arises in which the lexical analyzer is unable to proceed because
none of the patterns for tokens matches any prefix of the remaining input. The simplest recovery
strategy is "panic mode" recovery. We delete successive characters from the remaining input,
until the lexical analyzer an find a well-formed token at the beginning of what input is left.
Input Buffering
Input Buffering is done to optimize the working speed of the lexical analyser.
Eg. if (c==10)
The lexical analyzer scans the characters of the source program one a t a time to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the
next token itself can be determined. For this and other reasons, it is desirable for the lexical
analyzer to read its input from an input buffer. Figure shows a buffer divided into two halves of,
say 100 characters each. One pointer marks the beginning of the token being discovered. A look
ahead pointer scans ahead of the beginning point, until the token is discovered .we view the
position of each pointer as being between the character last read and the character next to be read.
In practice, each buffering scheme adopts one convention; a pointer is at the symbol last read or
at the symbol, it is ready to read.
16
The distance which the lookahead pointer may have to travel past the actual token may be large. For
example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n)
Without knowing whether DECLARE is a keyword or an array name until we see the character that
follows the right parenthesis. In either case, the token itself ends at the second E. If the look ahead
pointer travels beyond the buffer half in which it began, the other half must be loaded with the next
characters from the source file. Since the buffer shown in above figure is of limited size there is an
implied constraint on how much look ahead can be used before the next token is discovered.
In the above example, if the look ahead traveled to the left half and all the way through the left half
to the middle, we could not reload the right half, because we would lose characters that had not yet
been grouped into tokens. While we can make the buffer larger if we chose or use another buffering
scheme, we cannot ignore the fact that overhead is limited
17
E.g. EOF – End of File.
If SP encounters EOF in a buffer, the next buffer is used for refilling. So two test is avoided…
Only one test is perfor
Sentinels
For each character read, we make two tests: one for the end of the buffer, and one to determine
what character is read. We can combine the buffer-end test with the test for the current character
if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character
that cannot be part of the source program, and a natural choice is the character eof. Note that eof
retains its use as a marker for the end of the entire input. Any eof that appears other than at the
end of a buffer means that the input is at an end.
forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
Specification of Tokens
Operations on Language
L and M are languages
Union of L and M - L U M = { s | s is in L OR s is in M }
Intersection of L and M - L M = { s | s is in L AND s is in M }
18
Concatenation of L and M - LM = { st | s is in L and t is in M }
Exponentiation of the Language L is - Li = L Li-1
Kleene closure of L (Zero or more Concatenations)
L* = U Li
i=o
Positive Closure of L (One or more Concatenations)
L+ = U Li
i=1
1. A prefix of string s is any string obtained by removing zero or more symbols from the end of
s. For example, ban, banana, and e are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the
beginning of s. For example, nana, banana, and e are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For example, banana,
nan, and e are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes, and
substrings, respectively, of s that are not ε or not equal to s itself.
4. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive
positions of s. For example, baan is a subsequence of banana.
Regular Expressions
19
4. If a is a symbol in Σ, then a is a regular expression, and L(a) = {a}, that is, the language
with one string, of length one, with a in its one position.
5. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
If two regular expressions r and s denote the same regular set, we say they are equivalent and
write r = s. For instance, (a|b) = (b|a).
is a RE
If ‘a’ is a symbol in , then a is a regular expression
If ‘ r’ and ‘s’ are regular expressions then,
r | s is a RE
r s is RE
If ‘r’ is a regular expression then
r * is a RE
(r) is a RE
Axioms for RE
The operator | is
commutative- r | s = s| r
Associative – r |(s |t) = (r |s) | t = r | s |t
20
The operator ‘.’ is
Associative – r.(s.t) = (r.s) . T
Distributive – r (s|t)= rs | rt
r=r
r* r * = (r*)* = r* = rr* |
(r|s) * = (r*s*)* = (r*s*)r* = (r*|s*) *
rr* = r*r
(rs)*r = r(sr)*
Notational Shorthands
Character Classes.
- The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a | b |
c.
- Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
- Identifiers as being strings generated by the regular expression,
[A–Za–z][A–Za–z0–9]*
Regular Set
- A language denoted by a regular expression is said to be a regular set.
Non-regular Set
- A language which cannot be described by any regular expression.
Eg. The set of all strings of balanced parentheses and repeating strings cannot be described by a
regular expression. This set can be specified by a context-free grammar.
21
Transition Diagrams
As an intermediate step in the construction of a lexical analyzer, we first convert patterns into
stylized flowcharts, called "transition diagrams." Transition diagrams have a collection of nodes
or circles, called states. Each state represents a condition that could occur during the process of
scanning the input looking for a lexeme that matches one of several patterns. Edges are directed
from one state of the transition diagram to another. Each edge is labeled by a symbol or set of
symbols.
1. Certain states are said to be accepting, or final. These states indicate that a lexeme has been
found. We always indicate an accepting state by a double circle, and if there is an action to be
taken — typically returning a token and an attribute value to the parser — we shall attach that
action to the accepting state.
2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does
not include the symbol that got us to the accepting state), then we shall additionally place a *
near that accepting state.
3. One state is designated the start state, or initial state; it is indicated by an edge, labeled "start,"
entering from nowhere.
4. The transition diagram always begins in the start state before any input symbols have been
read.
Lex helps write programs whose control flow is directed by instances of regular expressions in
the input stream. It is well suited for editor-script type transformations and for segmenting input
in preparation for a parsing routine.
Lex source is a table of regular expressions and corresponding program fragments. The table is
translated to a program which reads an input stream, copying it to an output stream and
partitioning the input into strings which match the given expressions. As each such string is
recognized the corresponding program fragment is executed. The recognition of the expressions
is performed by a deterministic finite automaton generated by Lex. The program fragments
written by the user are executed in the order in which the corresponding regular expressions
occur in the input stream.
The lexical analysis programs written with Lex accept ambiguous specifications and choose the
longest match possible at each input point. If necessary, substantial look-ahead is performed on
the input, but the input stream will be backed up to he end of the current partition, so that the
user has general freedom to manipulate it.
22
Introduction.
Lex is a program generator designed for lexical processing of character input streams. It accepts
a high-level, problem oriented specification for character string matching, and produces a
program in a general purpose language which recognizes regular expressions. The regular
expressions are specified by the user in the source given to Lex. The Lex written code recognizes
these expressions in an input stream and partitions the input stream into strings matching the
expressions. At the boundaries between strings program sections provided by the user are
executed. The Lex source file associates the regular expressions and the program fragments. As
each expression appears in the input to the program written by Lex, the corresponding fragment
is executed.
Lex is not a complete language, but rather a generator representing a new language feature which
can be added to different programming languages, called ‘‘host languages.’’
Lex can write code in different host languages. The host language is used for the output code
generated by Lex and also for the program fragments added by the user. Compatible run-time
libraries for the different host languages are also provided. This makes Lex adaptable to different
environments and different users. Each application may be directed to the combination of
hardware and host language appropriate to the task, the user’s background, and properties of
local implementations.
Lex turns the user’s expressions and actions (called source in this memo) into the host general-
purpose language; the generated program is named yylex The yylex program will recognize
expressions in a stream (called input in this memo) and perform the specified actions for each
expression as it is detected.
Source → Lex → yylex
Input −−−> yylex → Output
An overview of Lex
For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of
lines.
%%
[ \t]+$ ;
is all that is required. The program contains a%% delimiter to mark the beginning of the rules,
and one rule. This rule contains a regular which matches one or more instances of the characters
blank rtab (written \t for visibility, in with the C language convention) just prior to the end of a
line. The brackets indicate character class made of blank and tab; the + indicates ‘‘one or more
...’’; and the $ indicates ‘‘end of line,’’ as in QED. No action is specified, so the program
generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change
any remaining string of blanks or tabs to a single blank, add another rule:
%%
[ \t]+$ ;
[ \t]+ printf (" ");
The finite automaton generated for this source will scan for both rules at once, observing at the
termination of the string of blanks or tabs whether or not there is a newline character, and
23
executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of
lines, and the second rule all remaining strings of blanks or tabs. Lex can be used alone for
simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be
used with a parser generator to perform the lexical analysis phase; it is particularly easy to
interface Lex and Yacc .Lex programs recognize only regular expressions; Yacc writes parsers
that accept a large class of context free grammars, but require a lower level analyzer to recognize
input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a
preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser
generator assigns structure to the resulting pieces. The flow of control in such a case (which
might be the first half of a compiler, for example) is shown below.Additional programs, written
by other generators or by hand, can be added easily to programs written by Lex..
lexical grammar
rules rules
↓|
Lex Yacc
↓↓
Input → yylex → yyparse → Parsed input
In the program written by Lex, the user’s (representing the actions to be performed as each
regular expression is found) are gathered as cases of a switch. The automaton interpreter directs
the control flow. Opportunity is provided for the user to insert either declarations or additional
statements in the routine containing the actions, or to add subroutines outside this action routine.
Lex is not limited to source which can be interpreted on the basis of one character look-ahead.
For example, if there are two rules, one looking for ab and another for abcdefg , and the input
stream is abcdefh , Lex will recognize ab and leave the input pointer just before cd … Such is
more costly than the processing of languages.
Lex Source.
General format of Lex source is:
{definitions}
%%
{rules}
%%
{user subroutines}
24
where the definitions and the user subroutines are often omitted. The second % % is optional, but
the first is required to mark the beginning of the rules. The absolute minimum Lex program is
%% (no definitions, no rules) which translates into a program which copies the input to the
output unchanged. In the outline of Lex programs shown the rules represent the user’s control
signs; they are a table, in which the left column contains regular expressions and the right
column contains actions to be executed when the expressions are recognized. Thus an individual
rule might appear integer printf("found keyword INT"); to look for the string integer in the input
stream and print the message ‘‘found keyword INT’’ whenever it appears. In this example the
host procedural language is C and the C library function printf is used to print the string. The end
of the expression is indicated by the first blank or tab character. If the action is merely a single C
expression, it can just be given on the right side of the line; if it is compound, or takes more than
a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired to
change a number of words from British to American spelling.
Lex rules such as
colour printf ("color");
Mechanise printf ("mechanize");
petrol printf("gas");
would be a start.
A regular expression specifies a set of strings to be matched. It contains text characters (which
match the corresponding characters in the strings being compared) and operator characters
(which specify repetitions, choices, and other features). The letters of the alphabet and the digits
are always text characters; thus the regular expression integer matches the string integer
wherever it appears and the expression a57D looks for the string a57D
Metacharacter Matches
. any character except newline
\n newline
* zero or more copies of preceding expression
+ one or more copies of preceding expression
? zero or one copy of preceding expression
^ beginning of line
$ end of line
a|b a or b
(ab)+ one or more copies of ab (grouping)
"a+b" literal “a+b” (C escapes still work)
[ ] character class
Expression Matches
abc abc
abc* ab, abc, abcc, abccc, …
abc+ abc, abcc, abccc, …
a(bc)+ abc, abcbc, abcbcbc, …
a(bc)? a, abc
[abc] a, b, c
25
[a-z] any letter, a through z
[a\-z] a, -, z
[-az] -, a, z
[A-Za-z0-9]+ one or more alphanumeric characters
[ \t\n]+ whitespace
[^ab] anything except: a, b
[a^b] a, ^, b
[a|b] a, |, b
a|b a or b
name function
26
Deterministic finite automata
Special case of nfa in which
1) no state has epsilon transition
2) for each state s and input symbol a, there is at most one edge labeled a leaving s
Here i is a new start state and f is a new accepting state. This NFA recognizes {∈}.
27
Again i is a new start state and f is a new accepting state. This NFA accepts {a}.
If a occurs several times in r, then a separate NFA is constructed for each occurrence.
Keeping the syntactic structure of the regular expression in mind, combine these NFAs
inductively until the NFA for the entire expression is obtained. Each intermediate NFA produced
during the course of construction corresponds to a sub-expression r and has several important
properties – it has exactly one final state, no edge enters the start state and no edge leaves the final
state.
Suppose N(s) and N(t) are NFAs for regular expressions s and t.
(a) For regular expression s|t, construct the following composite NFA N(s|t) :
(b) For the regular expression st, construct the composite NFA N(st) :
(c) For the regular expression s* , construct the composite NFA N(s*) :
(d) For the parenthesized regular expression (s), use N(s) itself as the NFA
--- -----
28