Atcd Module 2 2021 Scheme
Atcd Module 2 2021 Scheme
vtucode.in Page 1
Automata Theory & Compiler Design 21CS51 Module 2
If R is a regular expression denoting the language LR and S is a regular expression denoting the
language LS then R + S is a regular expression corresponding to the language LR U LS.
R.S is a regular expression corresponding to the language LR. LS.
R* is a regular expression corresponding to the language LR. Thus the expressions obtained by
applying any of the rules are regular expressions.
Examples of Regular expressions
Regular expression Meaning
a* String consisting of any number of a’s. (zero or more a’s)
a+ String consisting of at least one a. (one or more a’s)
(a + b) String consisting of either a or b
*
(a+b) String consisting of any nuber of a’s and b’s including ε
(a+b)* ab Strings of a’s and b’s ending with ab.
ab(a+b)* Strings of a’s and b’s starting with ab.
(a + b)* ab (a+b)* Strings of a’s and b’s with substring ab.
vtucode.in Page 2
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 3
Automata Theory & Compiler Design 21CS51 Module 2
Obtain regular expression to accept the language containing strings of a’s and b’s such that L = {
a2n+1 b2m+1 | n, m 0 }.
a2n+1 means odd number of a’s, regular expression = a(aa)*
b2m+1 means odd number of b’s, regular expression = b(bb)*
The regular expression for the given language = a(aa)*b(bb)*
Obtain regular expression to accept the language containing strings of 0’s and 1’s with exactly
one 1 and an even number of 0’s.
Regular expression for exactly one 1 = 1
Even number of 0’s = (00)*
So here 1 can be preceded or followed by even number of 0’s or 1 can be preceded and followed
by odd number of 0’s.
The regular expression for the given language = (00)* 1 (00)* + 0(00)* 1 0(00)*
Obtain regular expression to accept the language containing strings of 0’s and 1’s having no two
consecutive 0’s. OR
Obtain regular expression to accept the language containing strings of 0’s and 1’s with no pair of
consecutive 0’s.
Whenever a 0 occurs it should be followed by 1. But there is no restriction on number of 1’s. So
it is a string consisting of any combinations of 1’s and 01’s, ie regular expression = (1+01)*
Suppose string ends with 0, the above regular expression can be modified by inserting (0 + ε ) at
the end.
Regular expression for the given language = (1+01)* (0 + ε )
Obtain regular expression to accept the language containing strings of 0’s and 1’s having no two
consecutive 1’s. OR
Obtain regular expression to accept the language containing strings of 0’s and 1’s with no pair of
consecutive 1’s.
Whenever a 1 occurs it should be followed by 0. But there is no restriction on number of 0’s. So
it is a string consisting of any combinations of 0’s and 10’s, ie regular expression = (0+10)*
Suppose string ends with 1, the above regular expression can be modified by inserting (1 + ε ) at
the end.
Regular expression for the given language = (0+10)* (1 + ε )
Obtain regular expression to accept the following languages over Σ = { a, b}.
i. Strings of a’s and b’s with substring aab.
vtucode.in Page 4
Automata Theory & Compiler Design 21CS51 Module 2
So the regular expression for the given language = [(a+b) (a+b) (a+b)]*+ [(a+b)
(a+b)]*
ix. Obtain the regular expression to accept the language L = { anbm | m+n is even }
Here n represents number of a’s and m represents number of b’s.
m+n is even results in two possible cases;
case i. when even number of a’s followed by even number of b’s.
regular expression : (aa)*(bb)*
case ii. Odd number of a’s followed by odd number of b’s.
regular expression = a(aa)* b(bb)*.
So the regular expression for the given language = (aa)*(bb)* + a(aa)* b(bb)*
x. Obtain the regular expression to accept the language L = { anbm | n 4 and m 3 }.
Here n 4 means at least 4 a’s, the regular expression for this = aaaa(a)*
m 3 means at most 3 b’s, regular expression for this = (ε+b) (ε+b) (ε+b).
So the regular expression for the given language = aaaa(a)* (ε+b) (ε+b) (ε+b).
xi. Obtain the regular expression to accept the language L = { anbm cp | n 4 and m 3 p
2}.
Here n 4 means at least 4 a’s, the regular expression for this = aaaa(a)*
m 3 means at most 3 b’s, regular expression for this = (ε+b) (ε+b) (ε+b).
p 2 means at most 2 c’s, regular expression for this = (ε+c) (ε+c)
So the regular expression for the given language = aaaa(a)*(ε+b) (ε+b) (ε+b) (ε+c)
(ε+c).
xii. All strings of a’s and b’s that do not end with ab.
Strings of length 2 and that do not end with ab are ba, aa and bb.
So the regular expression = (a+b)*(aa + ba +bb)
xiii. All strings of a’s, b’s and c’s with exactly one a.
The regular expression = (b+c)* a (b+c)*
xiv. All strings of a’s and b’s with at least one occurrence of each symbol in Σ = {a, b}.
At least one occurrence of a’s and b’s means ab + ba, in between we have n number
of a’s and b’s.
So the regular expression =(a+b)* a (a+b)* b(a+b)* +(a+b)* b(a+b)* a(a+b)*
vtucode.in Page 6
Automata Theory & Compiler Design 21CS51 Module 2
Case ii. Since nm 3, if m = 1 then n should be 3. The equivalent regular expression is given
by: RE = aaa(a)* b
Case iii. Since nm 3, if m 2 and n 2 then the equivalent regular expression is given by:
RE = aa(a)* bb(b)*
So the final regular expression is obtained by adding all the above regular expression.
Regular expression = abbb(b)* + aaa(a)*b + aa(a)*bb(b)*
Application of Regular expression:
1. Regular expressions are used in UNIX.
2. Regular expressions are extensively used in the design of Lexical analyzer phase.
3. Regular expressions are used to search patterns in text.
FINITE AUTOMATA AND REGULAR EXPRESSIONS
1. ****Converting Regular Expressions to Automata:
Prove that every language defined by a regular expression is also defined by a finite automata.
Proof:
Suppose L = L(R) for a regular expression R, we show that L = L(E) for some ε-NFA E with:
a. Exactly one accepting state.
b. No arcs into the initial state.
c. No arcs out of the accepting state.
The proof must be discussed with the following transition diagrams for the basis of the
construction of an automaton.
vtucode.in Page 7
Automata Theory & Compiler Design 21CS51 Module 2
Starting at new start state, we can go to the start state of either the automaton for R or S. We then
reach the accepting state of one of these automata R or S. We can follow one of the ε- arcs to the
accepting state of the new automaton.
Automaton for R.S is given by:
The start state of the first( R) automata becomes the start state of the whole and the final state of
the second(S) automata becomes the final state of the whole.
Automaton for R* is given by:
From start state to final state one arc labeled ε ( for ε in R*) or the to the start state of automaton
R through that automaton one or more time and then to the final state.
vtucode.in Page 8
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 9
Automata Theory & Compiler Design 21CS51 Module 2
Finally the ε-NFA for the regular expression: (0+1)*1(0+1) is given by:
vtucode.in Page 10
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 11
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 12
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 13
Automata Theory & Compiler Design 21CS51 Module 2
3. If the start state of FSM M is part of a loop (i.e: it has any transitions coming into it), then
create a new start state s and connects to M ‘s start state via an ε-transition. This new
start state s will have no transitions into it.
4. If a FSM M has more than one accepting state or if there is just one but there are any
transitions out of it, create a new accepting state and connect each of M’s accepting states
to it via an ε-transition. Remove the old accepting states from the set of accepting states.
Note that the new accepting state will have no transitions out from it.
5. At this point, if M has only one state, then that state is both the start state and the
accepting state and M has no transitions. So L (M} = {ε}. Halt and return the simple
regular expression as ε.
6. Until only the start state and the accepting state remain do:
6.1. Select some state s of M which is of any state except the start state or the accepting
state.
6.2 Remove that state s from M.
6.3 Modify the transitions among the remaining states so that M accepts the same
strings The labels on the rewritten transitions may be any regular expression.
7. Return the regular expression that labels the one remaining transition from the start state
to the accepting state
Consider the following FSM M: Show a regular expression for L(M).
OR
Obtain the regular expression for the following finite automata using state elimination method.
We can build an equivalent machine M' by eliminating state q2 and replacing it by a transition
from q1 to q3 labeled with the regular expression ab*a.
So M' is:
Obtain the regular expression for the following finite automata using state elimination method.
There is no incoming edge into the initial state as well as no outgoing edge from final state. So
there is only two states, initial and final.
There is no incoming edge into the initial state as well as no outgoing edge from final state.
After eliminating the state B:
Regular expression = ab
Obtain the regular expression for the following finite automata using state elimination method.
There is no incoming edge into the initial state as well as no outgoing edge from final state.
After eliminating the state B:
vtucode.in Page 15
Automata Theory & Compiler Design 21CS51 Module 2
Obtain the regular expression for the following finite automata using state elimination method.
Since initial state has incoming edge, and final sate has outgoing edge, we have to create a new
iniatial and final state by connecting new initial state to old initial state through ε and old final
state to new final state through ε. Make old final state has non-final state.
vtucode.in Page 16
Automata Theory & Compiler Design 21CS51 Module 2
Since there are multiple final states, we have to create a new final state.
vtucode.in Page 17
Automata Theory & Compiler Design 21CS51 Module 2
Obtain the regular expression for the following finite automata using state elimination method.
vtucode.in Page 18
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 19
Automata Theory & Compiler Design 21CS51 Module 2
Obtain the regular expression for the following finite automata using state elimination method.
Since start state 1 has incoming transitions, we create a new start state and link that state to state
1 through ε.
vtucode.in Page 21
Automata Theory & Compiler Design 21CS51 Module 2
Since accepting state 1 and 2 has outgoing transitions, we create a new accepting state and link
that state to state 1 and state 2 through ε. Remove the old accepting states from the set of
accepting states. (ie: consider 1 and 2 has non final states)
Finally we have only start and final states with one transition from start state 1 to final state 2,
The labels on transition path indicates the regular edpression.
Regular Expression = (ab U aaa* b)* (a U ε )
vtucode.in Page 22
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 23
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 24
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 25
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 26
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 27
Automata Theory & Compiler Design 21CS51 Module 2
The first goes from state i to state k without passing through k, the last piece goes from k to j
without passing through k, and all the pieces in the middle go from k to itself, without passing
through k. When we combine the expressions for the paths of the two types above, we have the
expression for the labels of all paths from state i to state j that go through no state higher than k.
Rij0 = Regular expressions for the paths that can go through no intermediate states at all.
Rij1 = Regular expressions for the paths that can go through an intermediate state 1 only.
Rij2 = Regular expressions for the paths that can go through an intermediate state 1 and state 2
only.
Rij3 = Regular expressions for the paths that can go through an intermediate state 1, state 2 and
vtucode.in Page 29
Automata Theory & Compiler Design 21CS51 Module 2
Write the regular expression for the language accepted by the following DFA:
Answer:
When k =0; (passing through no intermediate state), the various regular expressions are:
vtucode.in Page 30
Automata Theory & Compiler Design 21CS51 Module 2
When k =1; (passing through sate 1 as intermediate state), the various regular expressions are:
Therefore the regular expression corresponding to the language accepted by the DFA is given by:
R122 (state 1(i) is the start state and state 2(j) is the final state). By using the formula:
Answer:
Number of states in DFA = 3; ie: k = 3
By renaming the states of DFA:
vtucode.in Page 31
Automata Theory & Compiler Design 21CS51 Module 2
Regular expressions for paths that can go through a) no state, b) state 1 only and c) states 1 and 2
only.
Therefore the regular expression corresponding to the language accepted by the DFA is given by:
R133 (state 1(i) is the start state and state 3 (j) is the final state). By using the formula:
q2 q2 q3
*q3 q3 q2
Answer:
Number of states in DFA = 3; ie: k =3
By renaming the states of DFA as q1 = 1, q2 = 2, q3 = 3
Transition diagram of DFA:
vtucode.in Page 32
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 33
Automata Theory & Compiler Design 21CS51 Module 2
Regular expressions for paths that can go through 3 intermediate states: states 1, states 2 and
states 3 only.
Rij(3)
R11(3) Ø+ε=ε
R12(3) b
R13(3) (a + bb)b*
R14(3) ab*a + bbb*a
R21(3) Ø
R22(3) Ø+ε=ε
(3)
R23 bb*
R24(3) bb*a
R31(3) Ø
R32(3) Ø
R33(3) b*
R34(3) b*a
R41(3) Ø
R42(3) Ø
R43(3) Ø
R44(3) Ø+ε=ε
The regular expression corresponding to the language accepted by the DFA is given by: R144
(state 1(i) is the start state and state 4 (j) is the final state). By using the formula:
vtucode.in Page 34
Automata Theory & Compiler Design 21CS51 Module 2
Let L be a regular language. Then there exists a constant ‘n’ (which depends on L) such that for
every string ‘w’ in L such that |w| ≥ n, we can break w into three strings, w=xyz, such that:
2. |xy| ≤ n
vtucode.in Page 35
Automata Theory & Compiler Design 21CS51 Module 2
each ai is an input symbol. Since we have ‘m’ input symbols, naturally we should have ‘m+1’
states, in sequence q0, q1, q2……………….qm where q0 is → start state and qm is → final state.
Since |w| ≥ n, by the pigeonhole principle it is not possible to have distinct transitions, since there
are only ‘n’ different states. So one of the state can have a loop. Thus we can find two different
integers i and j with 0 ≤ i < j ≤ n, such that qi = qj. Now we can break the string w = xyz as
follows:
x = a1a2a3……………..ai.
y = ai+1, ai+2, ……..aj ( loop string where i =j)
z = aj+1,aj+2,…………..am.
The relationships among the strings and states are given in figure below:
‘x’ may be empty in the case that i= 0. Also ‘z’ may be empty if j = n = m. However, y cannot be
empty, since ‘i’ is strictly less than ‘j’.
Thus for any k ≥ o, xykz is also accepted by DFA ‘A’; that is for a language L to be a regular,
xykz is in L for all k ≥ o.
Applications of Pumping lemma:
1. It is useful to prove certain languages are non-regular.
2. It is possible to check whether a language accepted by FA is finite or infinite.
Show that L= {an bn | n>= 0} is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
since |w| = n +n = 2n ≥ n, we can split ‘w’ into xyz such that |xy| ≤ n and |y |≥ 1 as
Where |x| = n-1 and |y| = 1 so that |xy| = n-1 +1 = n ≤ n, which is true.
vtucode.in Page 36
Automata Theory & Compiler Design 21CS51 Module 2
Where |x| = n-1 and |y| = 1 so that |xy| = n-1 +1 = n ≤ n, which is true.
According to pumping lemma xykz € L for all k ≥ o.
If ‘k’ = 0, the string ‘y’ does not appear, so the string ‘w’ has ‘n’ number of ‘a‘s followed by ‘n’
number of ‘b’s. ie: w = an bn.
But according to pumping lemma ‘n+1’number of ‘a’s should be followed by ‘n’ of ‘b’s, which
is a contradiction to the assumption that the language is regular.
So the language L= {ai bj | i>j} is not regular language.
Show that L= {w | na(w) < nb(w) } is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
Consider the string w = an-1 bn
since |w| = n-1 + n = 2n-1 ≥ n, we can split ‘w’ into ‘xyz’ such that |xy| ≤ n and |y |≥ 1 as
Where |x| = n-1 and |y| = 1 so that |xy| = n-1 +1 = n ≤ n, which is true.
According to pumping lemma xykz € L for all k ≥ o.
vtucode.in Page 37
Automata Theory & Compiler Design 21CS51 Module 2
If ‘k’ = 0, the string ‘y’ does not appear, so the string ‘w’ has ‘n-1’ number of ‘a‘s followed by
‘n-1’ number of ‘b’s. ie: w = an-1 bn-1.
But according to pumping lemma ‘n-1’number of ‘a’s should be followed by ‘n’ of ‘b’s, which is
a contradiction to the assumption that the language is regular.
So the language L= {w | na(w) < nb(w) } is not regular.
Show that L= {w | na(w) = nb(w) } is not regular.
We can prove that L is not regular by taking string w= anbn | n>=0.
For solution refer problem1.
Show that L= {ai bj | i ≠ j} is not regular.
ie: i ≠ j means i > j or i < j; so we can take string ‘w’ = an+1bn or w= an-1bn.
Solution is similar to the previous problems.
Show that L= {an bm cn+m | n,m >= 0} is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
Since L is regular it is closed under homomorphism. So we can take h(a) = a, h(b) = a and h(c) =
c.
Now the language L is reduced to L = {an am cn+m | n+m >= 0}
ie: L= {an+m cn+m | n+m >= 0} which is in the form
L = { ai bj | i >=0},
Consider w = an bn
since |w| = n +n = 2n ≥ n, we can split ‘w’ into xyz such that |xy| ≤ n and |y |≥ 1 as
Where |x| = n-1 and |y| = 1 so that |xy| = n-1 +1 = n ≤ n, which is true.
According to pumping lemma xykz € L for all k ≥ o.
If ‘k’ = 0, the string ‘y’ does not appear, so the number of ‘a’s will be less than number of ‘b’s
ie: w = an-1 bn.
Which is a contradiction to the assumption that the language is regular. So the given language
So the language L= {an bn | n>= 0} is not regular language L= {an bm cn+m | n,m >= 0} is not
regular.
vtucode.in Page 38
Automata Theory & Compiler Design 21CS51 Module 2
Where |x| = n-1 and |y| = 1 so that |xy| = n-1 +1 = n ≤ n, which is true.
According to pumping lemma xykz € L for all k ≥ o.
If ‘k’ = 0, the string ‘y’ does not appear, so the number of ‘a’s on the left of first b will be less
than number of ‘a’s after the first b
ie: ww = an-1 bnanbn.
Which is a contradiction to the assumption that the language is regular.
So the language L= {ww | w € (a+b)*} is not regular is not regular language.
Show that L= {wwR | w € (a+b)* } is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
Consider the string w = an bn, therefore wwR = an bn bn an
since |w| = n+n+n + n = 4n ≥ n, we can split ‘w’ into ‘xyz’ such that |xy| ≤ n and |y |≥ 1 as
x = an-1
y=a
z = bn bn an
Where |x| = n-1 and |y| = 1 so that |xy| = n-1 +1 = n ≤ n, which is true.
According to pumping lemma xykz € L for all k ≥ o.
If ‘k’ = 0, the string ‘y’ does not appear, so the number of ‘a’s on the left of first b will be less
than number of ‘a’s after the first b
ie: wwR = an-1 bn bn an.
Which is a contradiction to the assumption that the language is regular.
So the language L= {wwR | w € (a+b)*} is not regular is not regular language.
Show that L= {an! | n ≥ 0 } is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
vtucode.in Page 39
Automata Theory & Compiler Design 21CS51 Module 2
vtucode.in Page 41
Automata Theory & Compiler Design 21CS51 Module 2
input.
Note: The speed of lexical analysis is a concern in compiler design, since only this phase reads the
source program character-by character.
Discuss the various issues of lexical analysis.
1. Lexical analyzer reads the source program character by character to produce tokens.
2. Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token
when the parser asks a token from it.
3. Normally L.A. don’t return a comment as a token. It skips a comment, and return the
next token (which is not a comment) to the parser.
4. Correlating error messages: It can associate a line number with each error message. In
some compilers it makes a copy of the source program with the error messages inserted
at the appropriate positions.
5. If the source program uses a macro-preprocessor, the expansion of macros may be
performed by the lexical analyzer.
Role of Lexical Analyzer
Explain the role of lexical analyzer with a block diagram.
• Read the input characters of the source program, group them into lexemes and produces
output as a sequence of tokens.
• It interacts with the symbol table.
• Initially parser calls the lexical analyzer, by means of getNextToken command.
• In response to this command LA read characters from its input until it can identify the
next lexeme and produce a token for that lexeme, which can be returned to parser.
• It eliminates comments and white space.
vtucode.in Page 42
Automata Theory & Compiler Design 21CS51 Module 2
i. Token.
ii. Pattern.
iii. Lexeme.
Token: It describes the class or category of input string. A token is a pair consisting of a token
name and an optional attribute value.
For example, identifier, keywords, constants are called tokens.
Pattern: Set of rule that describes the tokens. It is a description of the form that the lexemes of a
token may take.
Example: letter [A-Za-z].
Lexeme: Sequence of characters in the source program that are matched with the pattern of the
token.
Example: int, a, num, ans etc.
Token representation:
In many programming languages, the following classes cover most or all of the tokens:
i. One token for each keyword; The pattern for a keyword is the same as the keyword itself.
ii. Tokens for the operators, either individually or in classes such as the token comparison.
iii. One token representing all identifiers.
vtucode.in Page 43
Automata Theory & Compiler Design 21CS51 Module 2
iv. One or more tokens representing constants, such as numbers and literal.
v. Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon.
Attributes for tokens
A token has only a single attribute that is a pointer to the symbol-table entry in which the
information about the token is kept.
Example: The token names and associated attribute values for the statement E = M * C ** 2 are
written below as a sequence of pairs.
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>
Lexical errors:
1. It is hard for a lexical analyzer to tell, without the aid of other components, that there is a
source-code error. For instance, if the string fi is encountered for the first time in a C
program in the context:
fi ( a == f ( x ) )
A lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an undeclared
function identifier. Since fi is a valid lexeme for the token id, the lexical analyzer must return
the token id to the parser and let some other phase of the compiler- probably the parser in
this case handle an error due to transposition of the letters.
2. Suppose a situation arises in which the lexical analyzer is unable to proceed because none of
the patterns for tokens matches any prefix of the remaining input. The simplest recovery
strategy is "panic mode" recovery. We delete successive characters from the remaining input,
until the lexical analyzer can find a well-formed token at the beginning of what input is left.
This recovery technique may confuse the parser, but in an interactive computing environment
it may be quite adequate.
vtucode.in Page 44
Automata Theory & Compiler Design 21CS51 Module 2
The forward pointer moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as forward pointer encounters
a blank space, the lexeme is identified.
The fp will be moved ahead when it sees white space. That is when fp encounters white space
it ignores and moves ahead. Then both fp and bp is set at next token.
1. One buffer
2. Two buffer
One buffer scheme:
Here only one buffer is used to store the input string. But the problem with this scheme is that, if
a lexeme is very long, then it crosses the buffer boundary. To scan the remaining part of lexeme,
the buffer has to be refilled, that makes overwriting of first part of lexeme. Sometimes it may
result in loss of data due to the user misinterpretation.
Two Buffer scheme:
Why two buffer schemes is used in lexical analysis? Explain.
Because of the amount of time taken to process characters and the large number of characters
that must be processed during the compilation of a large source program, specialized two
buffering techniques have been developed to reduce the amount of overhead required to process
a single input character.
Here a buffer (array) divided into two N-character halves, where N = number of
characters on one disk block Ex: 4096 bytes – If fewer than N characters remain in the
input file , then special character, represented by eof, marks the end of source file and it is
different from input character.
One read command is used to read N characters. Two pointers are maintained: beginning
of the lexeme pointer and forward pointer.
Initially, both pointers point to the first character of the next lexeme.
Using this method we can overcome the problem faced by one buffer scheme, even
though the input is lengthier the user knows from where he has to begin in the next
buffer, as he can see the contents of previous buffer. Thus there is no scope for loss of
any data.
Sentinels:
In two buffering scheme we must check the forward pointer, each time it is incremented. Thus
we make two tests: one for the end of the buffer, and one to determine what character is read.
We can combine these two tests, if we use a sentinel character at the end of buffer.
vtucode.in Page 46
Automata Theory & Compiler Design 21CS51 Module 2
Sentinel is a special character inserted at the end of buffer, that cannot be a part of source
program; eof is used as sentinel.
Look ahead code:
Operations on Languages:
Give the formal definitions of operations on languages with notations.
In lexical analysis the most important operations on languages are:
i. Union
ii. Concatenation
iii. Star closure
iv. Positive closure.
These operations are formally defined as follows
Regular Expressions
We use regular expressions to describe tokens of a programming language.
A regular expression is built up of simpler regular expressions (using defining rules)
vtucode.in Page 47
Automata Theory & Compiler Design 21CS51 Module 2
Algebraic properties
r|s = s|r
r|(s|t)= (r|s)|t
(rs)t = r(st)
r(s|t) = rs|rt
(s|t)r = sr|tr
εr=r
rε=r
r* = (r| ε)* r** = r*
vtucode.in Page 48
Automata Theory & Compiler Design 21CS51 Module 2
OR
letter → [A-Za-z_]
digit → [0-9]
vtucode.in Page 49
Automata Theory & Compiler Design 21CS51 Module 2
digits → ( digit )+
optionalFraction → . digits | ε
optionalExponent → ( E ( + | - | ε ) digits ) | ε
number → digits optionalFraction optionalExponent
OR
digit → [0-9]
digits → digit+
number → digits ( . digits ) ? ( E [+-]? digits )?
Recognition of Tokens
Our current goal is to perform the lexical analysis needed for the following grammar.
vtucode.in Page 50
Automata Theory & Compiler Design 21CS51
Module 2
vtucode.in Page 51
Automata Theory & Compiler Design 21CS51
Specification of Token
To specify tokens Regular Expressions are used.
Recognition of Token: To recognize tokens there are 2 steps
1. Design of Transition Diagram
2. Implementation of Transition Diagram
Transition Diagrams
A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each
possible token. It shows the decisions that must be made based on the input seen. The two main
components are circles representing states (think of them as decision points of the lexer) and arrows
representing edges (think of them as the decisions made).
It is fairly clear how to write code corresponding to this diagram. You look at the first character, if
it is <, you look at the next character. If that character is =, you return (relop, LE) to the parser. If
instead that character is >, you return (relop, NE). If it is another character, return (relop, LT) and
adjust the input buffer so that you will read this character again since you have not used it for the
current lexeme. If the first character was =, you return (relop, EQ).
Write the transition diagram to recognize the token given below:
i. relop (relational operator)
ii. Identifier and keyword
iii. Unsigned number
iv. Integer constant
v. Whitespace
i. Transition diagram for relop:
v. Whitespace:
Whitespace characters are represented by delimiter, where delim includes the characters like
blank, tab, new line and other characters that are not considered by the language design to be part
of any token.
There are two ways we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially: When we find an identifier, a call
to installID( ) function places that identifier into the symbol table if it is not already there
and returns a pointer to the symbol table entry. The function getToken( ) examines the
symbol table for the lexeme found, and returns token name as either id or one of the
keyword token that was initially installed in the table.
2. Create separate transition diagrams for each keyword
Architecture of a transition diagram based lexical analyzer
The idea is that we write a piece of code for each decision diagram. This piece of code contains a
case for each state, which typically reads a character and then goes to the next case depending on
the character read. nextchar() is used to read a next char from the input buffer. The numbers in the
circles are the names of the cases. Accepting states often need to take some action and return to the
parser. Many of these accepting states (the ones with stars) need to restore one character of input.
This is called retract() in the code.
What should the code for a particular diagram do if at one state the character read is not one of
those for which a next state has been defined? That is, what if the character read is not the label of
any of the outgoing arcs? This means that we have failed to find the token corresponding to this
diagram.
The code calls fail(), is not an error case. It simply means that the current input does not match
this particular token. So we need to go to the code section for another diagram after restoring the
input pointer so that we start the next diagram at the point where this failing diagram started. If
we have tried all the diagram, then we have a real failure and need to print an error message and
Coding part:
TOKEN getRelop( )
{
TOKEN retToken = new(RELOP);
while(1)
{ /* repeat character processing until a return or failure occurs */
Switch (state)
{
case 0: c = nextChar( );
if ( c == '< ‘ ) state = 1; else
if ( c == '=' ) state = 5; else if
( c == '>' ) state = 6;
else fail( ); /* lexeme is not a relational operator…other */
break;
case 1: c = nextChar( );
if ( c == '=' ) state = 2;
else if ( c == '>' ) state = 3;
else if ( c == other character ) state = 4;
else fail( ); /* lexeme is not a relational operator…other */
break;
……..
……..
case 8: retract( );
retToken.attribute = GT;
return(retToken);
}
}
}