60% found this document useful (5 votes)

1K views

Compiler Design Chapter-2

The document discusses lexical analysis and how regular expressions are used to specify patterns for tokens in a programming language. It defines tokens, patterns, and lexemes. Regular expressions use operations like concatenation, union, and Kleene closure to represent sets of strings. Lexical analyzers use regular expressions to recognize patterns in the input and return tokens that are passed to the parser.

Uploaded by

Vuggam Venkatesh

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

60% found this document useful (5 votes)

1K views

Compiler Design Chapter-2

Uploaded by

Vuggam Venkatesh

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 105

Chapter – two

Lexical analysis

1
Outline
 Introduction
 Interaction of the Lexical Analyzer with the Parser
 Token, pattern, lexeme
 Specification of patterns using regular expressions
 Regular expressions
 Regular expressions for tokens

 NFA and DFA

 Conversion from RE to NFA to DFA…
 Lex Scanner Generator
 Creating a Lexical Analyzer with Lex
 Regular Expressions in Lex
 Lex specifications and examples

2
Introduction
 The role of the lexical analyzer is:
• to read a sequence of characters from the source
program
• group them into lexemes and
• produce as output a sequence of tokens for each
lexeme in the source program.
 The scanner can also perform the following
secondary tasks:
 stripping out blanks, tabs, new lines
 stripping out comments
 keep track of line numbers (for error reporting)

3
Interaction of the Lexical Analyzer
with the Parser

next char next token

lexical Syntax
analyzer analyzer
get next
char get next
token

Source
Program
symbol
table

(Contains a record
for each identifier)

token: smallest meaningful sequence of characters

of interest in source program
4
Token, pattern, lexeme
 A token is a sequence of characters from the source
program having a collective meaning.
 A token is a classification of lexical units.
- For example: id and num
 Lexemes are the specific character strings that make up
a token.
– For example: abc and 123A
 Patterns are rules describing the set of lexemes
belonging to a token.
– For example: “letter followed by letters and digits”
 Patterns are usually specified using regular expressions.
[a-zA-Z]*
Example: printf("Total = %d\n", score);

5
Token, pattern, lexeme…
 Example: The following table shows some tokens and
their lexemes in Pascal (a high level, case insensitive
programming language)
Token Some lexemes pattern
begin Begin, Begin, BEGIN, Begin in small or capital
beGin… letters
if If, IF, iF, If If in small or capital letters
ident Distance, F1, x, Dist1,… Letters followed by zero or
more letters and/or digits

• In general, in programming languages, the following are

tokens:
keywords, operators, identifiers, constants, literals,
punctuation symbols…
6
Attributes of tokens
 When more than one pattern matches a lexeme, the
scanner must provide additional information about the
particular lexeme to the subsequent phases of the
compiler.
 For example, both 0 and 1 match the pattern for the
token num.
 But the code generator needs to know which number is
recognized.
 The lexical analyzer collects information about tokens
into their associated attributes.
• Tokens influence parsing decisions;
• Attributes influence the translation of tokens after
parse
7
Attributes of tokens…
 Practically, a token has one attribute:
 a pointer to the symbol table entry in which the
information about the token is kept.
 The symbol table entry contains various
information about the token
 such as its lexeme, type, the line number in which
it was first seen …

Ex. y = 31 + 28 * x, The tokens and their

attributes are written as:

8
Attributes of tokens…

9
9
Errors
 Very few errors are detected by the lexical
analyzer.
 For example, if the programmer mistakes
ebgin for begin, the lexical analyzer cannot
detect the error since it will consider ebgin as
an identifier.
 Nonetheless, if a certain sequence of
characters follows none of the specified
patterns, the lexical analyzer can detect the
error.

10
Errors…
 When an error occurs, the lexical analyzer
recovers by:
 skipping (deleting) successive characters from the
remaining input until the lexical analyzer can find a
well-formed token (panic mode recover)
 deleting one character from the remaining input
 inserting missing characters in to the remaining input
 replacing an incorrect character by a correct
character
 transposing two adjacent characters

11
Specification of patterns using
regular expressions

 Regular expressions
 Regular expressions for tokens

12
A regular expression is a string r that denotes a language L(r)
over some alphabet ∑.

 The six kinds of regular expressions and the languages they denote are
as follows. First, there are three kinds of atomic regular expressions:
 1. Any symbol a ∑ is a regular expression with L(a) = {a}.
 2. The special symbol H is a regular expression with L(ε) = {ε}.
 3. The special symbol ø is a regular expression with L(ø ) = {}.
 There are also three kinds of compound regular expressions, which are
built from smaller regular expressions, here called r, r1 , and r2 :
 4. (r1 + r2 ) is a regular expression with L(r1 + r2 ) = L(r1 ) U L(r2 )
 5. (r1 r2 ) is a regular expression with L(r1 r2 ) = L(r1 )L(r2 )
 6. (r)* is a regular expression with L((r)*) = (L(r))*
 The parentheses in compound regular expressions may be omitted, in
which case * has highest precedence and + has lowest precedence.

3-13
Regular expression: Definitions

 Represents patterns of strings of characters.

 An alphabet Σ is a finite set of symbols
(characters)
 A string s is a finite sequence of symbols
from Σ
 |s| denotes the length of string s
 ε denotes the empty string, thus |ε| = 0
 A language L is a specific set of strings over
some fixed alphabet Σ

14
Regular expressions…
 A regular expression is one of the following:
Symbol: a basic regular expression consisting of a single
character a, where a is from:
 an alphabet Σ of legal characters;
 the metacharacter ε: or
 the metacharacter ø.
 In the first case, L(a)={a};
 in the second case, L(ε)= { ε};
 in the third case, L(ø)= { }.
 {} – contains no string at all.
 {ε} – contains the single string consists of no character

15
Regular expressions…
 Alternation: an expression of the form r|s, where r
and s are regular expressions.
 In this case , L(r|s) = L(r) U L(s) ={r,s}

 Concatenation: An expression of the form rs, where r

and s are regular expressions.
 In this case, L(rs) = L(r)L(s)={rs}

 Repetition: An expression of the form r*, where r is a

regular expression.
 In this case, L(r*) = L(r)* ={ε, r,…}

16
Regular expression: Language Operations
 Union of L and M
L ∪ M = {s |s ∈ L or s ∈ M}
 Concatenation of L and M
 LM = {xy | x ∈ L and y ∈ M}
 Exponentiation of L
 L0 = {ε}; Li = Li-1L The following shorthands
 Kleene closure of L are often used:
 L* = ∪i=0,…,∞ Li r+ =rr*
 Positive closure of L r* = r+| ε
r? =r|ε
 L+ = ∪i=1,…,∞ Li
17
Examples

L1={a,b,c,d} L2={1,2}
L1 ∪ L2={a,b,c,d,1,2}
L1L2={a1,a2,b1,b2,c1,c2,d1,d2}
L1*=all strings of letter a,b,c,d and empty string.
L1+= the set of all strings of one or more letter a,b,c,d,
empty string not included

18
Regular expressions…
 Examples (more):
1- a | b = {a,b}
2- (a|b)a = {aa,ba}
3- (ab) | ε ={ab, ε}
4- ((a|b)a)* = {ε, aa,ba,aaaa,baba,....}
 Reverse
1 – Even binary numbers (0|1)*0
2 – An alphabet consisting of just three alphabetic
characters: Σ = {a, b, c}. Consider the set of all strings
over this alphabet that contains exactly one b.
(a | c)*b(a|c)* {b, abc, abaca, baaaac, ccbaca, cccccb}

19
Regular expressions for tokens

 Regular expressions are used to specify the

patterns of tokens.
 Each pattern matches a set of strings. It falls into
different categories:
 Reserved (Key) words: They are represented by
their fixed sequence of characters,
 Ex. if, while and do....
 If we want to collect all the reserved words into
one definition, we could write it as follows:
Reserved = if | while | do |...

20
Regular expressions for tokens…
 Special symbols: including arithmetic operators,
assignment and equality such as =, :=, +, -, *
 Identifiers: which are defined to be a sequence of
letters and digits beginning with letter,
 we can express this in terms of regular definitions as
follows:
letter = A|B|…|Z|a|b|…|z
digit = 0|1|…|9
or
letter= [a-zA-Z]
digit = [0-9]
identifiers = letter(letter|digit)*
21
Regular expressions for tokens…
 Numbers: Numbers can be:
 sequence of digits (natural numbers), or
 decimal numbers, or
 numbers with exponent (indicated by an e or E).
 Example: 2.71E-2 represents the number 0.0271.
 We can write regular definitions for these numbers as
follows:
nat = [0-9]+
signedNat = (+|-)? Nat
number = signedNat(“.” nat)?(E signedNat)?
 Literals or constants: which can include:
 numeric constants such as 42, and
 string literals such as “ hello, world”.
22
Regular expressions for tokens…

 reop < | <= | = | <> | > | >=

 Comments: Ex. /* this is a C comment*/
 Delimiter newline | blank | tab | comment
 White space = (delimiter )+

23
Recognition of tokens
a grammar for branching Patterns for tokens using
statements and conditional regular expressions
expressions
digit  [0-9]
nat  digit+
stmt  if expr then stmt signednat  (+|-)?nat
| if expr then stmt else stmt numbersignednat(“.”nat)?(E signednat)?
|ε letter  [A-Z a-z]
expr  term relop term | term idletter(letter|digit)*
term  id | number ifif
then then
elseelse
relop  <|>|<=|>=|=|<>
 For this language, the lexical analyzer will recognize:
the keywords if, then, else
Lexemes that match the patterns for relop, id, number
ws  (blank | tab | newline)+
24
3-25
3-26
3-27
3-28
3-29
Recognition of tokens…
Tokens, their patterns, and attribute values

30
Transition diagram that recognizes the lexemes
matching the token relop and id.

3-31
Coding…
token nexttoken() case 9: c = nextchar();
{ while (1) { if (isletter(c)) state = 10;
switch (state) { else state = fail();
case 0: c = nextchar(); break;
if (c==blank || c==tab || c==newline) { case 10: c = nextchar();
state = 0; if (isletter(c)) state = 10;
lexeme beginning++; else if (isdigit(c)) state = 10;
}
else state = 11;
else if (c==‘<’) state = 1;
break;
else if (c==‘=’) state = 5;
else if (c==‘>’) state = 6;
else state = fail();
break;
case 1: c = nextchar();
…

32
Design of a Lexical Analyzer/Scanner
Finite Automata
 Lex – turns its input program into lexical analyzer.
 At the heart of the transition is the formalism known as finite
automata.
 Finite automata is graphs, like transition diagrams, with a few
differences:
1. Finite automata are recognizers; they simply say "yes" or "no"
about each possible input string.
2. Finite automata come in two flavors:
a) Nondeterministic finite automata (NFA) have no restrictions on
the labels of their edges.
ε, the empty string, is a possible label.
b) Deterministic finite automata (DFA) have, for each state, and
for each symbol of its input alphabet exactly one edge with that
symbol leaving that state.
33
The Whole Scanner Generator Process
Overview
 Direct construction of Nondeterministic finite
Automation (NFA) to recognize a given regular
expression.
 Easy to build in an algorithmic way
 Requires ε-transitions to combine regular sub expressions
 Construct a deterministic finite automation
(DFA) to simulate the NFA Optional
 Use a set-of-state construction
 Minimize the number of states in the DFA
 Generate the scanner code.

34
Design of a Lexical Analyzer …
 Token  Pattern
 Pattern  Regular Expression
 Regular Expression  NFA
 NFA  DFA
 DFA’s or NFA’s for all tokens  Lexical Analyzer

35
Non-Deterministic Finite Automata
(NFA)
Definition
 An NFA M consists of five tuples: ( Σ,S, T, s0, F)
 A set of input symbols Σ, the input alphabet
 a finite set of states S,
 a transition function T: S × (Σ U { ε}) -> S (next state),
 a start state s0 from S, and
 a set of accepting/final states F from S.
 The language accepted by M, written L(M), is defined as:
The set of strings of characters c 1c2...cn with each ci from
Σ U { ε} such that there exist states s 1 in T(s0,c1), s2 in
T(s1,c2), ... , sn in T(sn-1,cn) with sn an element of F.
36
NFA…
 It is a finite automata which has choice of
edges
• The same symbol can label edges from one state to
several different states.
 An edge may be labeled by ε, the empty
string
• We can have transitions without any input
character consumption.

37
Transition Graph
 The transition graph for an NFA recognizing the
language of regular expression (a|b)*abb
all strings of a's and b's ending in the
particular string abb
a

start a b b
0 1 2 3

b S={0,1,2,3}
Σ={a,b}
S0=0
F={3}
38
Transition Table
 The mapping T of an NFA can be represented
in a transition table
State Input Input Input
a b ε
0 {0,1} {0} ø

T(0,a) = {0,1} 1 ø {2} ø

T(0,b) = {0}
T(1,b) = {2} 2 ø {3} ø
T(2,b) = {3}
3 ø ø ø

The language defined by an NFA is the set of input

strings it accepts, such as (a|b)*abb for the example
NFA
39
Acceptance of input strings by NFA
 An NFA accepts input string x if and only if there is some path in the transition graph from the start state to one of the accepting states
 The string aabb is accepted by the NFA:

a a b b
0 0 1 2 3 YES

a a b b
0 0 0 0 0 NO
Exercise:
 babb is accepted by (a|b)*abb ?
 bbabb is NOT? 40
Another NFA Exercise:
 aaa is accepted by aa*|bb* ?
 bbb isa NOT?
a


start
b
b


An -transition is taken without consuming any character from

the input.
What does the NFA above accepts?

aa*|bb*
41
Deterministic Finite Automata (DFA)

 A deterministic finite automaton is a special

case of an NFA
 No state has an ε-transition
 For each state S and input symbol a there is at
most one edge labeled a leaving S
 Each entry in the transition table is a single state
 At most one path exists to accept a string
 Simulation algorithm is simple

42
DFA example
A DFA that accepts (a|b)*abb

43
Simulating a DFA: Algorithm
How to apply a DFA to a string.
INPUT:
 An input string x terminated by an end-of-file character eof.
 A DFA D with start state So, accepting states F, and
transition function move.
OUTPUT: Answer ''yes" if D accepts x; "no" otherwise
METHOD
 Apply the algorithm in (next slide) to the input string x.
 The function move(s, c) gives the state to which there is an
edge from state s on input c.
 The function nextChar() returns the next character of the
input string x.

44
Simulating a DFA
Exercise:
s = so;  bbababb is accepted by (a|b)*abb ?
c = nextchar();  bbabab is NOT?
while ( c != eof ) {
s = move(s, c);
c = nextchar();
}
if ( s is in F )
return "yes"; DFA accepting (a|b)*abb
else return "no";
Given the input string ababb, this DFA enters the
sequence of states 0,1,2,1,2,3 and returns "yes"
45
DFA: Exercise

 Draw DFAs for the string matched by the

following definition:
digit =[0-9]
nat=digit+
signednat=(+|-)?nat
number=signednat(“.”nat)?(E signedNat)?

46
Design of a Lexical Analyzer Generator

Regular Expression DFA

Two algorithms:
1- Translate a regular expression into an NFA
(Thompson’s construction)

2- Translate NFA into DFA

(Subset construction)
47
From regular expression to an NFA
 It is known as Thompson’s construction.

Rules:
1- For an ε, a regular expressions, construct:

start a

48
From regular expression to an NFA…
2- For a composition of regular expression:
 Case 1: Alternation: regular expression(s|r), assume
that NFAs equivalent to r and s have been
constructed.

49
49
From regular expression to an NFA…
 Case 2: Concatenation: regular expression sr

ε
…r …s

Case 3: Repetition r*

50
Rules

 3. Rules for complex regular expressions

1  2 1  1‘  2

| 1
1 2  2


*
1 2 1  1‘  2

51
3-51
e.g. Let us construct N( r) for the regular
expression r=(a|b)*(aa|bb)(a|b)*
(a|b)*(aa|bb)(a|b)*
x y

(a|b)* (aa|bb) (a|b)*

x 1 2 y
a|b a|b
 aa
x 5  1 2  6  y
bb
a a 3 a a
 5  1 2  6 
x b b y
b 4 b 52
3-52
a a 3 a a
e.g
 5  1 2  6 
x b b y
b 4 b

I a b
I0={x,5,1} I1={5,3,1} I2={5,4,1}
I1={5,3,1} I3={5,3,2,1,6,y} I2={5,4,1}
I2={5,4,1} I1={5,3,1} I4={5,4,1,2,6,y}
I3={5,3,2,1,6,y} I3={5,3,2,1,6,y} I5={5,1,4,6,y}
I4={5,4,1,2,6,y} I6={5,3,1,6,y} I4={5,4,1,2,6,y}
I5={5,1,4,6,y} I6={5,3,1,6,y} I4={5,4,1,2,6,y}
I6={5,3,1,6,y} I3={5,3,2,1,6,y} I5={5,1,4,6,y}
3-
I a b
I0 I1 I2
I1 I3 I2 DFA is
a
I2 I1 I4 a b
I1 I3 I5
I3 I3 I5 a a
b a a b
I4 I6 I4 I0
b b
I5 I6 I4 b a
I2 I4 I6
I6 I3 I5 b

54
3-54
So, the minimized DFA is :

1 a
a a
0 b a 3
b b
b
2

55
3-55
From RE to NFA:Exercises

 Construct NFA for token identifier.

letter(letter|digit)*
 Construct NFA for the following regular
expression:
I. (a|b)*abb
II. (ab/ba)a*

56
From an NFA to a DFA
(subset construction algorithm)

 Input NFA N Both accept the same

Output DFA D language usage (RE)

Rules:
 Start state of D is assumed to be unmarked.
 Start state of D is = ε-closer (S0),
where S0 -start state of N.

57
NFA to a DFA…
ε- closure
ε-closure (S’) – is a set of states with the following
characteristics:
1- S’ € ε-closure(S’) itself
2- if t € ε-closure (S’) and if there is an edge labeled
ε from t to v, then v € ε-closure (S’)
3- Repeat step 2 until no more states can be added
to ε-closure (S’).
E.g: for NFA of (a|b)*abb
ε-closure (0)= {0, 1, 2, 4, 7}
ε-closure (1)= {1, 2, 4}

58
NFA to a DFA…
Algorithm
While there is unmarked state
X = { s0, s1, s2,..., sn} of D do
Begin
Mark X
For each input symbol ‘a’ do
Begin
Let T be the set of states to which there is a transition ‘a’ from state s i in X.
Y= ε-Closer (T)
If Y has not been added to the set of states of D then {
Mark Y an “Unmarked” state of D add a transition from X to Y labeled a if
not already presented
}
End
End

59
NFA for identifier: letter(letter|digit)*
ε

letter
3 4
ε ε
start
letter ε ε
0 1 2 7 8
digit ε
ε 5 6

60
NFA to a DFA…
Example: Convert the following NFA into the corresponding
DFA. letter (letter|digit)*
A={0}
B={1,2,3,5,8}
start letter C={4,7,2,3,5,8}
A B
D={6,7,8,2,3,5}

letter digit
letter
digit D digit
C

letter

61
Exercise: convert NFA of (a|b)*abb in to DFA.

62
Other Algorithms

 How to minimize a DFA ? (see Dragon Book

3.9, pp.173)
 How to convert RE to DFA directly ? (see
Dragon Book 3.9.5 pp.179)

63
The Lexical- Analyzer Generator: Lex
 The first phase in a compiler is, it reads the
input source and converts strings in the source
to tokens.
 Lex: generates a scanner (lexical analyzer or
lexer) given a specification of the tokens using
REs.
 The input notation for the Lex tool is referred to as
the Lex language and
 The tool itself is the Lex compiler.
 The Lex compiler transforms the input patterns into a
transition diagram and generates code, in a file called
lex.yy.c, that simulates this transition diagram.

64
Lex…

 By using regular expressions, we can specify

patterns to lex that allow it to scan and match
strings in the input.
 Each pattern in lex has an associated action.
 Typically an action returns a token, representing the
matched string, for subsequent use by the parser.
 It uses patterns that match strings in the input and
converts the strings to tokens.

65
General Compiler Infra-structure
Parse tree
Program source Tokens Parser
Scanner Semantic
(tokenizer) Routines
(stream of
characters) Annotated/decorated
x c
Le
Y ac tree

Analysis/
Transformations/
Symbol and optimizations
literal Tables
IR: Intermediate
Representation

Code
Generator

Assembly code

66
Scanner, Parser, Lex and Yacc

6767
Generating a Lexical Analyzer using Lex
Lex is a scanner generator ----- it takes lexical specification as
input, and produces a lexical analyzer written in C.

Lex source
program Lex compiler lex.yy.c
lex.l

lex.yy.c
C compiler a.out

Sequence of
Input stream
a.out tokens

Lexical Analyzer
68
Lex specification
 Program structure C declarations in %
...declaration section... { %}

%%
P1 { action1 }
...rule section... P2 { action2 }
%%
...user defined functions...
 Rules section – regular expression <--> action.
• The actions are C program.
 Declaration section – variables, constants

69
Skeleton of a lex specification (.l file)
x.l *.c is generated after
running

%{
< C global variables, prototypes, This part is copied as–is to
comments > the top of the generated
C file
%}

Substitutions simplifies
[DEFINITION SECTION] pattern matching

%% Define how to scan and

what action to take for
[RULES SECTION] each token
%% Any user code. For
example, a main function
< C auxiliary subroutines>
to call the scanning
function yylex(). 70
The rules section
%%
[RULES SECTION]

<pattern> { <action to take when matched> }

<pattern> { <action to take when matched> }
…
%%

Patterns are specified by regular expressions.

For example:
%%
[A-Za-z] +
{ printf(“this is a word”); }
%%
71
Design of a Lexical Analyzer Generator:
RE to NFA to DFA

NFA sim. alg

Thompson’s
construction

DFA sim. alg

72
Simulating an NFA
 INPUT: An input string x terminated by an end-of-file
character eof. An NFA N with start state s0, accepting
states F, and transition function move.
 OUTPUT: Answer "yes " if M accepts x; "no" otherwise.
Algorithm
S = ε-closure(s0);
c = nextchar();
while ( c != eof ) {
S = ε- closure (move(S, c)) ;
c = nextchar();
}
if ( Sn F != Ф ) return “yes”;
else return "no";
73
Combining and simulation of NFAs of a Set of
Regular Expressions: Example 1
start a
a {action1} 1 2
start b
abb {action2} a b
3 4 5 6
a*b+ {action3}
start a
Must find the longest b
prefix match: 7 b 8
Continue until no further
moves are possible a Action 1
ε 1 2
start b
a a b a*b+ b
0 ε 3 a 4 5 6
0 2 7 8
1 4 a ε Action 2
b
3 7 7 8 b
7 None a
Action 3 Action 3
74
Simulating NFA…

ε-closure({0}) = {0,1,3,7}
move({0,1,3,7},a) = {2,4,7}
ε-closure({2,4,7}) = {2,4,7}
move({2,4,7},a) = {7}
ε-closure({7}) = {7}
move({7},b) = {8}
ε-closure({8}) = {8}
move({8},a) = ∅

75
Combining and simulation of NFAs of a Set of
Regular Expressions: Example 2
start a
a {action1} 1 2
start b
abb {action2} a b
3 4 5 6
a*b+ {action3}
start a
When two or more b
accepting states are 7 b 8
reached, the action is
executed a Action 1
ε 1 2
start b
a b b b
0 ε 3 a 4 5 6
0 2 5 6
1 4 8 8 ε Action 2
a b
3 7 7 8 b
7 None a Action 3
Action 2
Action 3 76
DFA's for Lexical Analyzers
NFA DFA. Transition table for DFA

State a b Token
found
0137 247 8 None
247 7 58 a
8 - 8 a*b+
7 7 8 None
58 - 68 a*b+
68 - 8 abb

Example: simulate the above DFA for input abba

77
Lex Regular Expression Basics
. : matches everything except \n
* : matches 0 or more instances of the preceding regular expression
+ : matches 1 or more instances of the preceding regular expression
? : matches 0 or 1 of the preceding regular expression
| : matches the preceding or following regular expression
[xyz ] : match one character x,y,or z
[^xyz] : match any character except x,y, and z
() : groups enclosed regular expression into a new regular expression
“…” : matches everything within the “ “ literally
x :x, but only at beginning of line
x$ :x, but only at end of line
{d} : match the regular expression defined by d.

78
Pattern matching examples

79
Meta-characters

 meta-characters (do not match themselves, because

they are used as a special symbols in reg exps):
()[]{}<>+/,^*|.\"$?-%

 to match a meta-character, prefix with "\"

 to match a backslash, tab or newline, use \\, \t, or \n

80
Lex Regular Expression: Examples

• an integer: 12345

[1-9][0-9]*
• a word: cat
[a-zA-Z]+
• a (possibly) signed integer: 12345 or -12345
[-+]?[1-9][0-9]*
• a floating point number: 1.2345
[0-9]*”.”[0-9]+

1. lex will always match the longest (number of

characters) token possible.

2. If two or more possible tokens are of the same

length, then the token with the regular expression
that is defined first in the lex specification is
favored.

83
Lex variables
yyin - of the type FILE*. This points to the current file
being scanned by the lexer.
yyout - Of the type FILE*. This points to the location
where the output of the lexer will be written.
• By default, both yyin and yyout point to standard input
and output.
yytext – variable, a pointer to the matched strings (char
*)
yyleng - Gives the length of the matched pattern.
yylineno - Provides current line number information.

84
Lex functions

yylex() - The function that starts the analysis. It is

automatically generated by Lex.
yywrap() - This function is called when end of file (or
input) is encountered. If this function returns 1, the
parsing stops.
yymore() - This function tells the lexer to append the next
token to the current token.
input() – read next character from yyin. This is the function
invoked by yylex() to read the input.
output() – write yytext to yyout. This is the function
invoked by yylex() to write the output.

85
Lex predefined variables

86
Let us run a lex program

87
Lex : programs
 The first example is the shortest possible lex file:
%%
 Input is copied to output, one character at a time.
 The first %% is always required, as there must
always be a rules section.
 However, if we don’t specify any rules, then the
default action is to match everything and copy it to
output.
 Defaults for input and output are stdin and stdout,
respectively.
 Here is the same example, with defaults explicitly
coded:
88
Rule %%
section /* match everything except newline */
. ECHO;
/* match newline */
\n ECHO;
%%
int yywrap(void) { Invokes the
return 1; Lexical
analyzer
}
int main(void) {
User yylex();
definition return 0;
section
}

89
Developing Lexical analyzer using
Lex : Linux (Fedora)
 vi – used to edit lex and yacc source files.
 w – save
 q – quit
 w filename – save as
 wq – save and quit
 q! – exit overriding change

 To start , go to application  System tools 

terminal
90
Example 1:Compiling and running C
program
1- vi hello.c
2- i insert
3- #include<stdio.h>
int main() {
printf(“Hello World ”);
return 0;
}
4- escape
5- : wq
6- gcc hello.c
7- ./a.out
91
How to compile and run LEX programs
test.l  lex.yy.c gcc test (scanner)
1. lex test.l
2. gcc lex.yy.c -ll
3. ./a.out<hello.c
 Implementation example 1
1. vi lab1.l
2. i  insert mode
3. %%
. ECHO;
\n ECHO;
%%

92
How to compile and run LEX programs...

4. Press esc
5. Press :wq
6. lex lab1.l
7. gcc lex.yy.c -ll
8. ./a.out <hello.c

. Every character except new line

\n new line character
ECHO  display on screen

93
Examples (more) Regular
definitions
%% %{
/*Match every thing #include <stdio.h>
except new line*/ %}
digit [0-9]
. ECHO;
letter [A-Za-z]
/*Match new line*/ id {letter}({letter}|{digit})*
\n ECHO; %%
%% {digit}+ { printf(“number: %s\n”, yytext); }
int yywrap(void) { {id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
return 1; %%
} main()
int main(void) { { yylex();
yylex(); }
Translation
retrun 0; rules
}

94
Example :Finding the number of identifier in a given
program
digit [0-9]
letter [A-Za-z]
%{
int count;
%}
%%
{letter}({letter}/{digit})* count++;
%%
int main(void) {
yylex();
printf(“The number of identifiers are=%4d\n”,count);
return 0; }

95
Example: Here is a scanner that counts the number of
characters, words, and lines in a file.
%{
int nchar, nword, nline;
%}
%%
\n { nline++;}
[^ \t\n]+ { nword++, nchar += yyleng; }
. { nchar++; }
%%
int main(void) {
yylex();
printf("%d\t%d\t%d\n", nchar, nword, nline);
return 0;
}

96
%{ /* definitions of manifest constants */
LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER,
Regular definitions
RELOP */
%}
delim [ \t\n]
ws {delim}+ Return token to parser
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
{ws} {/*no action and no return*/ }
if {return IF;} Token attribute
then {return THEN;}
else {return ELSE;}
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;} Install yytext as identifier
“<>“ {yylval = NE; return RELOP;} in symbol table
“>“ {yylval = GT; return RELOP;}
“>=“ {yylval = GE; return RELOP;}
%%
int install_id() {}
int install_num() {}
97
Assignment on Lexical Analyzer

98
1. Write a program in LEX to count the no of
consonants and vowels for a given C and C++ source
programs.
2. Write a program in LEX to count the no of:
(i) positive and negative integers
(ii) positive and negative fractions.
For C and C++ source programs
3. Write a LEX program to recognize a valid C and C++
programs.

99
The MINI Language Introduction
 Assumptions:
 Source code – MINI language
 Target code – Assembly language

 Specifications:
 There are no procedures and declarations.
 All variables are integer variables, and variables are
declared simply by assigning values to them.
 There are only two control statements:
 An if – statement and
 A repeat statement
 Both the control statements may themselves
contain statement sequences.
100
The MINI Language Introduction...
 An if – statement has an optional else part and must
be terminated by the key word end.
 There are also read and write statements that
perform input/output.
 Comments are allowed with curly brackets,
comments cannot be nested.
 Expression in MINI are also limited to Boolean and
integer arithmetic expressions.
 A Boolean expressions consists of a comparison of
two arithmetic expressions using either of the two
comparison operators < and =.

101
The TINY Language...
 An arithmetic expression may involve integer constants,
variables, parenthesis, and any of the four integer
operators +, -, *, and / (integer division).
 Boolean expressions may appear only as tests in
control statements – i.e. There are no Boolean
variables, assignment, or I/O.
 Here is a sample program in this language for factorial
function.

102
{ sample program
in MINI language – computes factorials
}
read x; { input an integer }
if x > 0 then { don’t compute if x<= 0}
fact:= 1;
repeat
fact := fact * x ;
X:= x-1
until x = 0;
write fact { output factorial of x}
end
103
The MINI Language...
 In addition to the tokens, MINI has the following
lexical conventions:
 Comments : are enclosed in curly brackets {...} and
cannot be nested.
 White space : consists of blanks, tabs, and
newlines.
 The principles of longest substring is followed in
recognizing tokens.

104
Design a scanner for MINI language

 In designing a scanner for this language:

1. Start with regular expressions
2. Identify Tokens...
3. Develop and simulate NFA
4. Construct and simulate DFA
5. Write a lex program, to recognize tokens in
MINI language:
• Input: Tiny language
• Output: Tokens..,

Submission10/07/20 date
105

Hfe Onkyo TX-sr607 Service Manual
60% (5)
Hfe Onkyo TX-sr607 Service Manual
128 pages
Exit Exam
100% (1)
Exit Exam
258 pages
Adama Science and Technology University: Department of Electrical Engineering Stream of Computer Engineering
100% (3)
Adama Science and Technology University: Department of Electrical Engineering Stream of Computer Engineering
29 pages
5 - VoIP Debug Commands
No ratings yet
5 - VoIP Debug Commands
14 pages
ECE 385 Digital Systems Laboratory: Syllabus
No ratings yet
ECE 385 Digital Systems Laboratory: Syllabus
2 pages
Addis Ababa University: Computer Organization and Architecture
100% (1)
Addis Ababa University: Computer Organization and Architecture
15 pages
Compiler Design Chapter-3
0% (1)
Compiler Design Chapter-3
177 pages
Mr. Dereje B (MSC) : Automata and Complexity Theory
100% (1)
Mr. Dereje B (MSC) : Automata and Complexity Theory
35 pages
Fundamentals of Programming II Course Content
100% (1)
Fundamentals of Programming II Course Content
41 pages
Ambo University
No ratings yet
Ambo University
12 pages
ASTU SIS Full Document
100% (4)
ASTU SIS Full Document
66 pages
Microprocessors Mid Exam
75% (4)
Microprocessors Mid Exam
3 pages
Exit Exam MODEL ANSWER
No ratings yet
Exit Exam MODEL ANSWER
29 pages
Admas University Department of Computer Science
100% (1)
Admas University Department of Computer Science
20 pages
Class and Exam Scheduilng Full Documentation
100% (3)
Class and Exam Scheduilng Full Documentation
18 pages
Clearance Management System
No ratings yet
Clearance Management System
141 pages
INDUSTRIAL PRACTICE EXPERIENCE REPORT - Ethio Telecom
100% (4)
INDUSTRIAL PRACTICE EXPERIENCE REPORT - Ethio Telecom
36 pages
Course Outline: Addis Ababa University Department of Computer Science
No ratings yet
Course Outline: Addis Ababa University Department of Computer Science
1 page
Exit Exam Questions Part 1
100% (1)
Exit Exam Questions Part 1
4 pages
Software Engineering Notes
100% (1)
Software Engineering Notes
9 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
Hawassa University Institute of Technology Faculty of Electrical and Biomedical Engineering Department of Electrical and Computer Engineering
100% (3)
Hawassa University Institute of Technology Faculty of Electrical and Biomedical Engineering Department of Electrical and Computer Engineering
31 pages
Communication, Acting and Perceiving: Artificial Intelligence
No ratings yet
Communication, Acting and Perceiving: Artificial Intelligence
14 pages
Final Project 1
100% (1)
Final Project 1
68 pages
Degree Exit Exam Sample Questions
100% (1)
Degree Exit Exam Sample Questions
4 pages
Web Programming Exit Exam
No ratings yet
Web Programming Exit Exam
154 pages
Final Exam For Finite Automata
50% (2)
Final Exam For Finite Automata
4 pages
ETHIOPIAN HIGHER UNIVERSITY PLACEMENT SYSTEM Final Proposal
67% (3)
ETHIOPIAN HIGHER UNIVERSITY PLACEMENT SYSTEM Final Proposal
15 pages
Chapter 6 - Mobile Network Layer
No ratings yet
Chapter 6 - Mobile Network Layer
23 pages
Compiler Design Chapter-4
100% (2)
Compiler Design Chapter-4
77 pages
Advanced Database Systems Transactions Processing: What Is A Transaction?
No ratings yet
Advanced Database Systems Transactions Processing: What Is A Transaction?
102 pages
Data Communication and Computer Networks: Addis Ababa Science and Technology University
No ratings yet
Data Communication and Computer Networks: Addis Ababa Science and Technology University
191 pages
Microcomputer and Interfacing PDF
100% (4)
Microcomputer and Interfacing PDF
26 pages
SPM - Hotel Managemenet System (Updated)
No ratings yet
SPM - Hotel Managemenet System (Updated)
30 pages
Unit 4 Question Bank Solutions-1
No ratings yet
Unit 4 Question Bank Solutions-1
11 pages
ECE JU Curriculum Re Numbered
100% (4)
ECE JU Curriculum Re Numbered
322 pages
Hawassa University Department of Informatics Data Communication and Computer Networking Mid Exam
No ratings yet
Hawassa University Department of Informatics Data Communication and Computer Networking Mid Exam
5 pages
Automata Theory Questions and Answers - Regular Expression
100% (4)
Automata Theory Questions and Answers - Regular Expression
45 pages
Vacancy PDF
No ratings yet
Vacancy PDF
86 pages
Exit Exam Comptency & Course List
100% (1)
Exit Exam Comptency & Course List
18 pages
Chapter Five: Type Checking
100% (1)
Chapter Five: Type Checking
48 pages
Revised Exit Exam Blue Print For Computer Science and Engineering
100% (2)
Revised Exit Exam Blue Print For Computer Science and Engineering
16 pages
Industrial Practice Outline
No ratings yet
Industrial Practice Outline
6 pages
Course Tittle:-Project Title:-: Object Oriented Software Analysis and Design
100% (1)
Course Tittle:-Project Title:-: Object Oriented Software Analysis and Design
24 pages
Data Communication - 2 Marks and 16 Marks
92% (38)
Data Communication - 2 Marks and 16 Marks
36 pages
Fundamental of Programming II
67% (3)
Fundamental of Programming II
9 pages
Operating System Module Only For Exit Exam Preparation Dawit
No ratings yet
Operating System Module Only For Exit Exam Preparation Dawit
29 pages
Industrial Practice Report Format
100% (1)
Industrial Practice Report Format
2 pages
Project Proposal Final
100% (2)
Project Proposal Final
16 pages
Jimma University
100% (1)
Jimma University
29 pages
Arsi University Arsi University ICT Directorate
100% (2)
Arsi University Arsi University ICT Directorate
21 pages
MCQ's On Files and Streams: #Include #Include
No ratings yet
MCQ's On Files and Streams: #Include #Include
9 pages
Wolaita Sodo University School of Informatics Department of Information Systems
No ratings yet
Wolaita Sodo University School of Informatics Department of Information Systems
165 pages
Exit Exam CPP - 2
No ratings yet
Exit Exam CPP - 2
23 pages
Dilla University
100% (1)
Dilla University
34 pages
Graduate Verification System - Final Document
100% (4)
Graduate Verification System - Final Document
58 pages
Haramaya University: College of Computing and Informatics Department of Computer Science
No ratings yet
Haramaya University: College of Computing and Informatics Department of Computer Science
54 pages
Compiler Design: Syntactic Analysis Sample Exercises and Solutions
No ratings yet
Compiler Design: Syntactic Analysis Sample Exercises and Solutions
22 pages
Intern Ship Report by Yewlsew Mekonen Bahir Dar University IOT
89% (19)
Intern Ship Report by Yewlsew Mekonen Bahir Dar University IOT
27 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
Chapter-2[1]
No ratings yet
Chapter-2[1]
77 pages
Chapter-2
No ratings yet
Chapter-2
99 pages
Computer Graphics (CG CHAP 1)
No ratings yet
Computer Graphics (CG CHAP 1)
12 pages
Computer Graphics (CG CHAP 2)
No ratings yet
Computer Graphics (CG CHAP 2)
32 pages
Computer Graphics: Geometry and Line Generation
No ratings yet
Computer Graphics: Geometry and Line Generation
5 pages
Computer Graphics (CG CHAP 3)
0% (1)
Computer Graphics (CG CHAP 3)
15 pages
CHAP IInew
No ratings yet
CHAP IInew
29 pages
AI CH 4
No ratings yet
AI CH 4
19 pages
CG Chap 1
No ratings yet
CG Chap 1
3 pages
Compiler Design Ch1
No ratings yet
Compiler Design Ch1
13 pages
CHAP - I Microprocessor
No ratings yet
CHAP - I Microprocessor
49 pages
Compiler Design Chapter-6
No ratings yet
Compiler Design Chapter-6
83 pages
Chapter Seven: Code Generation
No ratings yet
Chapter Seven: Code Generation
33 pages
GFT Annual Report 2019
No ratings yet
GFT Annual Report 2019
148 pages
Unit 5 - Week 2: Assignment 2
No ratings yet
Unit 5 - Week 2: Assignment 2
6 pages
Extended Response-1 - 240229 - 120756
No ratings yet
Extended Response-1 - 240229 - 120756
20 pages
TTMO Syllabus
No ratings yet
TTMO Syllabus
12 pages
Minidisc Recorder
No ratings yet
Minidisc Recorder
116 pages
MTN Irancell: Iran: Company Profile Report
No ratings yet
MTN Irancell: Iran: Company Profile Report
32 pages
CH 9 COORDINATE GEOMATRY
No ratings yet
CH 9 COORDINATE GEOMATRY
4 pages
R Hari Prasad 2024pgp297
No ratings yet
R Hari Prasad 2024pgp297
2 pages
Elements of Assembly Language Programming
No ratings yet
Elements of Assembly Language Programming
12 pages
Coinmarketcap Com Currencies Firo
No ratings yet
Coinmarketcap Com Currencies Firo
2 pages
SNMP FM-MIB OID Table 7.5.1-8.2.1-9.0.1
No ratings yet
SNMP FM-MIB OID Table 7.5.1-8.2.1-9.0.1
1 page
Differential Privacy Protection On Weighted Graph in Wirel - 2021 - Ad Hoc Netwo
No ratings yet
Differential Privacy Protection On Weighted Graph in Wirel - 2021 - Ad Hoc Netwo
10 pages
Rural Education in Malaysia
No ratings yet
Rural Education in Malaysia
3 pages
Endoflife DLP
No ratings yet
Endoflife DLP
3 pages
Ibn 'Arabi - Taj Al-Tarajim
No ratings yet
Ibn 'Arabi - Taj Al-Tarajim
62 pages
Diode Ba157 - BA159 (Data Sheet)
No ratings yet
Diode Ba157 - BA159 (Data Sheet)
3 pages
David Uberti Resume
No ratings yet
David Uberti Resume
2 pages
(eBook PDF) Modern Database Management 12th Global Edition instant download
100% (5)
(eBook PDF) Modern Database Management 12th Global Edition instant download
57 pages
Basic Gate Boolean Algebra
No ratings yet
Basic Gate Boolean Algebra
38 pages
csc-363 Final Project Rewards
No ratings yet
csc-363 Final Project Rewards
2 pages
Iec 60027letter Symbols To Be Used in Electrical Technology
No ratings yet
Iec 60027letter Symbols To Be Used in Electrical Technology
16 pages
ASME A17.1 Elevator Recall
No ratings yet
ASME A17.1 Elevator Recall
6 pages
Robot Welding
No ratings yet
Robot Welding
6 pages
Landstar Log
No ratings yet
Landstar Log
2 pages
Interfaces Ethernet PDF
No ratings yet
Interfaces Ethernet PDF
1,826 pages
Ws Sinamics s120 Regenerativ en
100% (2)
Ws Sinamics s120 Regenerativ en
8 pages
First Quarter Assessment With Answer
No ratings yet
First Quarter Assessment With Answer
2 pages

Compiler Design Chapter-2

Uploaded by

Compiler Design Chapter-2

Uploaded by

Chapter – two

 NFA and DFA

next char next token

token: smallest meaningful sequence of characters

• In general, in programming languages, the following are

Ex. y = 31 + 28 * x, The tokens and their

 Represents patterns of strings of characters.

 Concatenation: An expression of the form rs, where r

 Repetition: An expression of the form r*, where r is a

 Regular expressions are used to specify the

 reop < | <= | = | <> | > | >=

T(0,a) = {0,1} 1 ø {2} ø

The language defined by an NFA is the set of input

An -transition is taken without consuming any character from

 A deterministic finite automaton is a special

 Draw DFAs for the string matched by the

Regular Expression DFA

2- Translate NFA into DFA

 3. Rules for complex regular expressions

(a|b)* (aa|bb) (a|b)*

 Construct NFA for token identifier.

 Input NFA N Both accept the same

 How to minimize a DFA ? (see Dragon Book

 By using regular expressions, we can specify

%% Define how to scan and

<pattern> { <action to take when matched> }

Patterns are specified by regular expressions.

NFA sim. alg

DFA sim. alg

Example: simulate the above DFA for input abba

 meta-characters (do not match themselves, because

 to match a meta-character, prefix with "\"

 to match a backslash, tab or newline, use \\, \t, or \n

1. lex will always match the longest (number of

2. If two or more possible tokens are of the same

yylex() - The function that starts the analysis. It is

 To start , go to application  System tools 

. Every character except new line

 In designing a scanner for this language:

You might also like