ATCD Material
ATCD Material
Finite-state machines provide a simple computational model with many applications. Finite-state
machines, also called finite-state automata (singular: automaton) or just finite automata.
Definition: A FSM (Finite State Machine) is also called finite automata is a machine, or a mathematical
model of a machine, which can only reach a finite no.of states and transitions. It is used in mathematical
problem analysis. Computation begins in the start state with an input string. It changes to new states
depending on the transition function.
Examples of a FSM:
1. Counting to five,
2. Getting up in the morning
3. A Playing board
4. Traffic Light
5. Vending Machine
Alphabet: An alphabet is a non-empty finite set. We normally use the symbols a, b, c…with or
without subscripts or 0, 1, 2, . . ., etc. for the elements of an alphabet.
Since the empty sequence is a finite sequence, it is also a string. We use ε, to denote the
empty string.
For example, if Σ = {0, 1}, then Σ*= {ε, 0, 1, 00, 01, 10, 11, 000, 001, . . .}. Although the
set Σ* is infinite, it is a countable set. In fact, Σ∗ is countably infinite for any alphabet Σ.
Operations on Strings:
1. Concatenation:
3. Substring: If w is a string, then v is a substring of w if there exist strings x and y such that w = xvy
R
4. Reversal: Reversal of a string w denoted w is the string spelled
5. Kleen Closure: Let w be a string. w* is the set of strings obtained by applying any number of
Formal Languages:
A formal language is an abstraction for general characteristics of programming language, that can
be defined as a set of strings over an alphabet ∑ .
A language L is a possibly infinite set of strings over a finite alphabet Σ. It is denoted by L.
L(M) is the notation for a language defined by a machine M. The machine M accepts a certain set
of strings, thus a language.
Operations on Languages:
1. Concatenation of Languages :
{xy | x ∈ L1, y ∈
Given languages L1 and L2, we define their concatenation to be the language L1 ◦ L2 =
L2} Example.
• L1 = {hello} and L2 = {world} then L1 ◦ L2 = {helloworld}
2. Kleene Closure:
We write Ln to denote the language which is obtained by concatenating n copies of L. More formally,
L 0 = {ε} and
L n = L n−1 0 L, for n ≥ 1.
Example:
Example: L1 = { 0, 11, 01, 011 } , L2 = { 1, 01, 110 } then L1 U L2 = { 0, 11, 01, 011,
111 }
6. Relative Complement: Given some alphabet Σ, for any two languages S, T over Σ, the
difference S − T of S and T is the language S − T = {w ∈ Σ* | w ∈ S and w ∈ T }.
A special case of the difference is obtained when S = Σ* , in which case we define the
complement L of a language L as L = {w ∈ Σ* | w ∈ L}.
7. Reversal of a Language: Given a language L over Σ, we define the reverse of L as LR ={w R | w ∈ L}.
Regular Language
Regular Language: The regular languages are defined as follows Given a finite alphabet Σ:
1. ∅ is a regular language.
2. For any string x ∈ Σ ∗ , {x} is a regular language.
3. A language L is regular if there exists an FSA M such that L(M) = L.
4. A language L is regular if there exists a regular expression r such that Lr = L
These languages are accepted by DFA and NDFA
Closure Properties:
A variety of operations which preserve regularity. i.e., the universe of regular languages is closed
under these operations.
1. Regular Languages are closed under union ∪, Concatenation ◦ and Closure ∗.
1. Pumping Lemma
2. Myhill-Nerode Theorem
Regular Language
The languages accepted by FA are regular languages and these languages are easily described by
simple expressions called regular expressions.
For any regular expression r and s over Σ corresponding to the languages L r and Ls respectively, each
of the following is a regular expression corresponding to the language indicated.
(rs) corresponding to the language LrLs
(r + s) corresponding to the language Lr ∪ Ls
r* corresponding to the language Lr.
1. L (01) = {0, 1}
2. L (01 + 0) = {01, 0}
3. L (0 (1+ 0)) = {01, 00}
4. L (0*) = {ε, 0, 00, 000, …}
5. L ((0 + 10)* (ε + 1)) = all strings of 0‟s and 1‟s without two consecutive 1‟s.
If L1 and L2 are regular languages in Σ*, then L1 ∪ L2, L1 ∩ L2, L1 – L2 and L1 (complement of
L1), are all regular languages.
Pumping lemma is a useful tool to prove that a certain language is not regular.
Regular Expression
Regular expressions mean to represent certain sets of strings in some algebraic fashion. A regular
expression over the alphabet Σ is defined as follows
Regular Set
A set represented by a regular expression is called regular set e.g., If Σ = {a, b} is an alphabet, then
The following points are the some identities for regular expressions.
ϕ+R=R+ϕ=R
εR=Rε=R
R + R = R, where R is the regular expression.
(R*)* = R* ϕR = Rϕ = ϕ
ε * = ε and ϕ* = ε
RR* = R*R = R+
R*R* = R*
(P + Q)* = (P*Q*)* = (P* + Q*)*, where P and Q are regular expressions.
R (P + Q) = RP + RQ and (P + Q)R = PR + QR
P(QP)* = (PQ)*P
1. Union
2. Concatenation
3. Kleene closure
4. Complementation
5. Transpose
6. Intersection
Finite Automata, Deterministic Finite Automata(DFA)
Automata (singular : automation) are a particularly simple, but useful, model of computation. They were
initially proposed as a simple model for the behavior of neurons.
States, Transitions and Finite-State Transition System:
Let us first give some intuitive idea about a state of a system and state transitions before describing finite
automata. Informally, a state of a system is an instantaneous description of that system which gives all
relevant information necessary to determine how the system can evolve from that point on.
Transitions are changes of states that can occur spontaneously or in response to inputs to the states.
Though transitions usually take time, we assume that state transitions are instantaneous (which is an
abstraction). Some examples of state transition systems are: digital systems, vending machines, etc. A
system containing only a finite number of states and transitions among them is called a finite-state
transition system. Finite-state transition systems can be modeled abstractly by a mathematical model
called finite automation
Deterministic Finite (-state) Automata
Informally, a DFA (Deterministic Finite State Automaton) is a simple machine that reads an in- put string
-- one symbol at a time -- and then, after the input has been completely read, decides whether to accept or
reject the input. As the symbols are read from the tape, the automaton can change its state, to reflect how
it reacts to what it has seen so far. A machine for which a deter-ministic code can be formulated, and if
there is only one unique way to formulate the code, then the machine is called deterministic finite
automata.
Thus, a DFA conceptually consists of 3 parts:
1. A tape to hold the input string. The tape is divided into a finite number of cells. Each cell
holds a symbol from .
2. A tape head for reading symbols from the tape
3. A control , which itself consists of 3 things:
finite number of states that the machine is allowed to be in (zero or more states
are designated as accept or final states),
a current state, initially set to a start state,
a state transition function for changing the current state.
Deterministic Finite State Automaton : A Deterministic Finite State Automaton (DFA) is
a 5-tuple :
• Q is a finite set of states.
• is a finite set of input symbols or alphabet
• is the “next state” transition function (which is total ). Intuitively, a function that tells
which state to move to in response to an input, i.e., if M is in state q and sees input a, it moves to state
.
• is the start state.
• F is the set of accept or final states.
Design of DFAs
That is, is the state the automation reaches when it starts from the state q and finish processing
the string w.
- transitions :
In an -transition, the tape head doesn't do anything- it doesnot read and it doesnot move.
However, the state of the automata can be changed - that is can go to zero, one
or more states. This is written formally as implying that the next state
could by any one of w/o consuming the next input symbol.
Formal definition of NFA:
It is worth noting that a DFA is a special type of NFA and hence the class of languages accepted by DFA s
is a subset of the class of languages accepted by NFA s. Surprisingly, these two classes are in fact equal.
NFA s appeared to have more power than DFA s because of generality enjoyed in terms of - transition and
multiple next states. But they are no more powerful than DFA s in terms of the languages they accept.
- closure :
In the equivalent DFA , at every step, we need to modify the transition functions to keep track of all
the states where the NFA can go on -transitions. This is done by
Besides this the initial state of the DFA D has to be modified to keep track of all the states that can be
reached from the initial state of NFA on zero or more -transitions.
It is clear that, at every step in the processing of an input string by the DFA D , it enters a state that
corresponds to the subset of states that the NFA N could be in at that particular point. This has been
proved in the constructions of an equivalent NFA for any -NFA
If the number of states in the NFA is n , then there are states in the DFA . That is, each state in the
DFA is a subset of state of the NFA . But, it is important to note that most of these states are
inaccessible from the start state and hence can be removed from the DFA without changing the accepted
language. Thus, in fact, the number of states in the equivalent DFA would be much less than .
It is interesting to note that we can avoid encountering all those inaccessible or unnecessary states in
the equivalent DFA by performing the following two steps inductively.
1. If is the start state of the NFA, then make - closure ( ) the start state of the equivalent
DFA . This is definitely the only accessible state.
Arden's Theorem
In order to find out a regular expression of a Finite Automaton, we use Arden‟s Theorem along with
the properties of regular expressions.
Statement −
If P does not contain null string, then R = Q + RP has a unique solution that is R = QP*
Proof −
= Q + QP + RPP
When we put the value of R recursively again and again, we get the following equation −
R = Q + QP + QP2 + QP3…..
R = Q (ε + P + P2 + P3 + …. )
Hence, proved.
Solution
The equations for the three states q1, q2, and q3 are as follows −
q3 = q2a
q1 = q1a + q3a + ε
= (a + b(b + ab)*aa)*
In general, any regular expression X can be converted to an equivalent NFA called NFA X
containing a single start state and a single accepting state
output is sequence of states with transitions accepting those symbols e.g., the
Case 2: disjunction
if A and B are regular expressions whose equivalent NFAs are NFA A and NFAB, then the we can
construct an NFA called NFAA|B that accepts the language generated by A|B as follows:
use ε-transitions to connect the start state of NFAA|B to the start states of NFAA and NFAB
change start states of NFAA and NFAB so that they are no longer start states
use ε-transitions to connect accepting states of NFAA and NFAB to the accepting state of NFAA|B
change accepting states of NFAA and NFAB so that they are no longer accepting states e.g., the
If A is a regular expression whose equivalent NFA is NFA A, then we can construct an NFA called
NFAA* which accepts the language generated by A* as follows:
from start state of NFAA* to start state of NFAA change start state of NFAA
create ε-transitions from accepting state of NFA A to accepting state of NFAA* change
Case 4: concatenation
if A and B are regular expressions whose equivalent NFAs are NFA A and NFAB, then the we can
construct an NFA called NFAAB that accepts the language generated by AB as follows:
create ε-transition from start state of NFA AB to start state of NFA A ,change start state of NFA A so it is not
a start state, create ε-transition from accepting state of NFA A to start state of NFAB change accepting state
create ε-transition from accepting state of NFA B to accepting state of NFA AB change
second part: c
The finite automata concepts also used in various fields. In the design of a compiler, it used in the lexical
analysis to produce tokens in the form of identifiers, keywords and constants from the input program. In pattern
recognition, it used to search keywords by using string-matching algorithms.
Lex generates regular expressions into transition diagrams,Then it translates the transition diagram into C code to
recognize tokens in the input stream.There are many posiible algorithms are there ,the simples algorithm is
RENFADFA
Unit – II
In the literary sense of the term, grammars denote syntactical rules for conversation in natural languages.
Linguistics have attempted to define grammars since the inception of natural languages like English,
Sanskrit, Mandarin, etc. The theory of formal languages finds its applicability extensively in the fields of
Computer Science. Noam Chomsky gave a mathematical model of grammar in 1956 which is effective
for writing computer languages.
Grammar
Example
Grammar G1 −
Here,
Productions, P : S → AB, A → a, B → b
Strings may be derived from other strings using the productions in a grammar. If a grammar G has a
production α → β, we can say that x α y derives x β y in G. This derivation is written as −
x α y ⇒G x β
y
Example
According to Noam Chomosky, there are four types of grammars − Type 0, Type 1, Type 2, and Type
1. The following table shows how they differ from each other −
Recursivelyenumerable
Type 0 Unrestricted grammar language Turing Machine
Context-sensitive Linear-bounded
Type 1 Context-sensitive language
grammar automaton
Context-free
Type 2 grammar Context-free language Pushdown automaton
Take a look at the following illustration. It shows the scope of each type of grammar −
Type - 3 Grammar
Type-3 grammars generate regular languages. Type-3 grammars must have a single non-terminal on the
left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a single
non-terminal.
Type - 2 Grammar
Example
S→Xa
X → a
X → aX
X → abc
X→ε
Type - 1 Grammar
Type-1 grammars generate context-sensitive languages. The productions must be in the form
αAβ→αγβ
where A ∈ N (Non-terminal)
The rule S → ε is allowed if S does not appear on the right side of any rule. The languages generated
by these grammars are recognized by a linear bounded automaton.
Example
AB → AbBc
A → bcA
B→b
Type - 0 Grammar
Type-0 grammars generate recursively enumerable languages. The productions have no restrictions.
They are any phase structure grammar including all formal grammars.
The productions can be in the form of α → β where α is a string of terminals and non-terminals with at
least one non-terminal and α cannot be null. β is a string of terminals and non-terminals.
Example
S → ACaB
Bc → acB
CB → DB
aD → Db
Derivation Tree
A derivation tree or parse tree is an ordered rooted tree that graphically represents the semantic information a
string derived from a context-free grammar.
Representation Technique
Root vertex − Must be labeled by the start symbol.
Vertex − Labeled by a non-terminal symbol.
Leaves − Labeled by a terminal symbol or ε.
If S → x1x2 …… xn is a production rule in a CFG, then the parse tree / derivation tree will be as follows −
Bottom-up Approach −
Starts from tree leaves
Proceeds upward to the root which is the starting symbol S
We can remove the ambiguity by removing the left recursing and left factoring.
Left Recursion
Where β1, β2 ….. βn do not begin with A. Then we replace A production in the form of
A1 → α1A1|α2A1|α3A1|…,|αnA1|∧
Left Factoring
Two or more productions of a variable A of the grammar G = (V N, E, S, P) are said to have left
factoring, if the productions are of the form
α and y1, y2,….ym does not contain a as a prefix, then we replace the production into the form as
follows
A → αA1|Y1Y2|…..|YM, where
A1 → β1|β2|…..|βn
Eliminate the Useless Productions/Symbols
The symbols that cannot be used in any productions due to their unavailability in the productions or
inability in deriving the terminals, are known as useless symbols.
S → aS |A| C
A→aB
→ aa C
→ ab
= {A, B, S}
Because C does not produce terminal symbols so this production will be deleted. Now the modified
productions are
S → aS |A
A→a
B → aa
→ AB
In this graph, B variable is not reachable from S so it will be deleted also. Now the productions are
S → aS |A
A→a
Elimination of Є - Productions
Step 2 Find all productions which does not include null productions.
e.g., consider the CFG has following productions S → ABaC
A → BC
B → b|∧
C → D|∧
D→d
solve step find the nulable variables firstly the set is empty
N = {}
N = {B, C}
N = {A, B, C}
A→B|C
B → b
C → D
D→d
The above grammar is the every possible combination except ∧ Now put this new grammar with
original grammar with null.
∵S B&B A
∴S A
Step 2 Now the production without unit productions
S → Aa S → bb | a | bc
B → bb + A → bb
A → a | bc B → a | bc
S → Aa | bb | a | bc
B → bb | a | bc
A → a | bc | bb
Parsers
Markup languages
THE ROLE OF PARSER
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies that the
string can be generated by the grammar for the source language. It reports any syntax errors in the program.
It also recovers from commonly occurring errors so that it can continue processing its input.
PARSING
It is the process of analyzing a continuous stream of input in order to determine its grammatical structure
with respect to a given formal grammar.
Parse tree:
Graphical representation of a derivation or deduction is called a parse tree. Each interior node of the parse
tree is a non-terminal; the children of the node can be terminals or nonterminals.
Types of parsing:
1.Top down parsing 2.Bottom up
parsing
Top-down parsing : A parser can start with the start symbol and try to transform it to the
input string. Example : LL Parsers.
Bottom-up parsing :A parser can start with input and attempt to rewrite it into the start
symbol.Example : LR Parsers.
Top-down parsing:
Top-down parsing can be viewed as the problem of constructing a parse tree for the input string,starting from
the root and creating the nodes of the parse tree in preorder Equivalently, top-down parsing can be viewed as
finding a leftmost derivation for an input string.
Example
This sequence of trees corresponds to a leftmost derivation of the input
At each step of a top-down parse, the key problem is that of determining the production
to be
applied for a non terminal, say A. Once an A-production is chosen, the rest of the parsing
process
consists of "matching7' the terminal symbols in the production body with the input string.
For example, consider the top-down parsing as below
Recursive-Descent Parsing
A recursive-descent parsing program consists of a set of procedures, one for each non
terminal.Execution begins with the procedure for the start symbol, which halts and announces success if its
procedure body scans the entire input string. Pseudo code for a typical non terminal appears in Fig. 4.13.
Note that this pseudo code is nondeterministic, since it begins by choosing the Aproduction to apply in a
manner that is not specified.
General recursive-descent may require backtracking; that is, it may requirerepeated scans over the input
However, backtracking is rarely needed to parse programming language constructs, so backtracking parsers
are not seen frequently.
Example for recursive decent parsing:
A left-recursive grammar can cause a recursive-descent parser to go into an infinite loop. Hence,
elimination of left-recursion must be done before parsing. Consider
the grammar for arithmetic expressions E→E+T|T
T→T*F|F
F→(E)|id
After eliminating the left-recursion the grammar becomes, E →TE‘ E‘→+TE‘
|ε
T →FT‘
T‘→*FT‘ | ε
F → (E) | id
Now we can write the procedure for grammar as follows:
Recursive procedure:
Procedure E()
begin
T( );
EPRIME( );
End
Procedure EPRIME( )
begin
If input_symbol=‘+‘ then ADVANCE( ); T( );
EPRIME( );
end
Procedure T( )
begin
F( );
TPRIME( );
end
Procedure TPRIME( )
begin
If input_symbol=‘*‘ then ADVANCE( ); F( );
TPRIME( );
end Procedure
F( ) begin
If input-symbol=‘id‘ then ADVANCE( ); else if
input-symbol=‘(‗ then ADVANCE( ); E( );
else if input-symbol=‘)‘ then ADVANCE( );
end
else ERROR( );
Stack implementation:
To recognize input id+id*id :
PROCEDURE INPUT STRING
E( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
EPRIME( ) id+id*id
ADVANCE( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
ADVANCE( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
LL(1)Parsing
The construction of a predictive parser is aided by two functions associated with a grammar G :
1. FIRST
2. FOLLOW
Rules for first( ):
1. If X is terminal, then FIRST(X) is {X}.
2. If X →ε is a production, then add ε to FIRST(X).
3. If X is non-terminal and X → aα is a production then add a to FIRST(X).
4. If X is non-terminal and X → Y1 Y2…Yk is a production, then place a in FIRST(X)
if for some i, a is in FIRST(Yi), and ε is in all of FIRST(Y1),…,FIRST(Yi-1); that is,
Y1,….Yi-1 => ε. If ε is in FIRST(Yj) for all j=1,2,..,k, then add ε to FIRST(X).
Rules for follow( ):
1. If S is a start symbol, then FOLLOW(S) contains $.
2. If there is a production A→ αBβ, then everything in FIRST(β) except ε is placed in
follow(B).
3. If there is a production A → αB, or a production A → αBβ where FIRST(β) contains ε, then
everything in FOLLOW(A) is in FOLLOW(B).
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry. This
type of grammar is called LL(1) grammar.
Consider this following grammar: S→iEtS |
iEtSeS| a
E→b
After eliminating left factoring, we have
S→iEtSS‘|a
S‘→ eS | ε
E→b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non- terminals.
FIRST(S) = { i, a }
FIRST(S‘)={e,ε}
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S‘)={$,e} FOLLOW(E) = {t}
Since there are more than one production, the grammar is not LL(1) grammar.
Actions performed in predictive parsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementation of predictive parser:
1. Elimination of left recursion, left factoring and ambiguous grammar.
2. Construct FIRST() and FOLLOW() for all non-terminals.
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table.
Bottom-up parsers build parse trees from the leaves and work up to the root.
Shift-reduce parsing attempts to construct a parse tree for an input string beginning at the leaves (the
bottom) and working up towards the root (the top). At each reduction step a particular substring
matching the right side of a production is replaced by the symbol on the left of that production, and if the
substring is chosen correctly at each step, a rightmost derivation is traced out in reverse.
Example 2.7.1
Consider the grammar
S → aABe
A → Abc | b
B→d
The sentence abbcde can be reduced to S by the following steps.
abbcde
aAbcde
aAde
aABe
S
Handles:
A handle of a string is a substring that matches the right side of a production, and whose reduction
to the nonterminal on the left side of the production represents one step along the reverse of a rightmost
derivation.
Handle Pruning:
A rightmost derivation in reverse can be obtained by handle pruning. i.e., start with a string of
terminals w that is to parse. If w is a sentence of the grammar at hand, thenw = γn, where γn is the nth right
sentential form of some as yet unknown rightmost derivation.
S = γ0 γ1 γ2 … γn-1 γn = w.
Example for right sentential form and handle for grammar
E→E+E
E→E*E
E→(E)
E → id
A shift-reduce parser uses a parse stack which (conceptually) contains grammar symbols. During the
operation of the parser, symbols from the input are shifted onto the stack. If a prefix of the symbols on top of the
stack matches the RHS of a grammar rule which is the correct rule to use within the current context, then the
parser reduces the RHS of the rule to its LHS,replacing the RHS symbols on top of the stack with the
nonterminal occurring on the LHS of the rule. This shift-reduce process continues until the parser terminates,
reporting either success or failure. It terminates with success when the input is legal and is accepted by the
parser. It terminates with failure if an error is detected in the input. The parser is nothing but a stack automaton
which may be in one of several discrete states. A state is usually represented simply as an integer. In reality, the
parse stack contains states, rather than grammar symbols. However, since each state corresponds to a unique
grammar symbol, the state stack can be mapped onto the grammar symbol stack mentioned earlier. The operation
of the parser is controlled by a couple of tables.
ACTION TABLE
The action table is a table with rows indexed by states and columns indexed by terminal symbols. When
the parser is in some state s and the current look ahead terminal is t, the action taken by the parser depends on
the contents of action[s][t], which can contain four different kinds of entries:
GOTO TABLE
The goto table is a table with rows indexed by states and columns indexed by nonterminal symbols.
When the parser is in state s immediately after reducing by rule N, then the next state to enter is given by
goto[s][N].
Stack Input
$S $
After entering this configuration, the parser halts and announces successful completion of parsing.
There are four possible actions that a shift-reduce parser can make: 1) shift 2) reduce 3) accept 4) error.
In a shift action, the next symbol is shifted onto the top of the stack.
In a reduce action, the parser knows the right end of the handle is at the top of the stack. It must then
locate the left end of the handle within the stack and decide with what nonterminal to replace the handle.
In an accept action, the parser announces successful completion of parsing.
In an error action, the parser discovers that a syntax error has occurred and calls an error recovery
routine.
Note: an important fact that justifies the use of a stack in shift-reduce parsing: the handle will always
appear on top of the stack, and never inside.
Example 2.8.1
Consider the grammar
E→E+E
E→E*E
E→(E)
E → id and the input string id1 + id2 * id3. Use the shift-reduce parser to check whether the input
string is accepted by the Grammar
Conflicts during shift-reduce parsing:
Use the state s on top of the parse stack and the current lookahead t to consult the action table entry
action[s][t]:
If the action table entry is shift s' then push state s' onto the stack and advance the input so that the
lookahead is set to the next token.
If the action table entry is reduce r and rule r has m symbols in its RHS, then pop m symbols off the parse
stack. Let s' be the state now revealed on top of the parse stack and N be the LHS nonterminal for rule r.
Then consult the goto table and push the state given by goto[s'][N] onto the stack. The lookahead token is
not changed by this step.
If the action table entry is accept, then terminate the parse with success.
If the action table entry is error, then signal an error
Repeat step (2) until the parser terminates.
Model of LR Parsers
There are three types of LR parsers: LR(k), simple LR(k), and lookahead LR(k) (abbreviated to LR(k),
SLR(k), LALR(k))). The k identifies the number of tokens of lookahead. We will usually only concern ourselves
with 0 or 1 tokens of lookahead, but the techniques do generalize to k > 1. The different classes of parsers all
operate the same way (as shown above, being driven by their action and goto tables), but they differ in how their
action and goto tables are constructed, and the size of those tables.
We will consider LR(0) parsing first, which is the simplest of all the LR parsing methods. It is also the
weakest and although of theoretical importance, it is not used much in practice because of its limitations. LR(0)
parses without using any lookahead at all.
Adding just one token of lookahead to get LR(1) vastly increases the parsing power. Very few grammars
can be parsed with LR(0), but most unambiguous CFGs can be parsed with LR(1). The drawback of adding the
lookahead is that the algorithm becomes somewhat more complex and the parsing table gets much, much bigger.
The full LR(1) parsing table for a typical programming language has many thousands of states compared to the
few hundred needed for LR (0). A compromise in the middle is found in the two variants SLR(1) and LALR(1)
which also use one token of lookahead but employ techniques to keep the table as small as LR(0). SLR(k) is an
improvement over LR(0) but much weaker than full LR(k) in terms of the number of grammars for which it is
applicable. LALR(k) parses a larger set of languages than SLR(k) but not quite as many as LR(k). LALR(1) is
the method used by the yacc parser generator.
LL LR
Does a leftmost derivation. Does a rightmost derivation in reverse.
Starts with the root nonterminal on the Ends with the root nonterminal on the stack.
stack.
Ends when the stack is empty. Starts with an empty stack.
Uses the stack for designating what is still to Uses the stack for designating what is already
be expected. seen.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Continuously pops a nonterminal off the Tries to recognize a right hand side on the
stack, and pushes the corresponding right stack, pops it, and pushes the corresponding
hand side. nonterminal.
Expands the non-terminals. Reduces the non-terminals.
Reads the terminals when it pops one off Reads the terminals while it pushes them on
the stack. the stack.
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
The initial state of the DFA (state 0) is the closure of the item S ::= . a $, where S ::= a$ is the first rule.
In simple words, if there is an item X ::= a . s b in an item set,where s is a symbol (terminal or nonterminal), we
have a transition labelled by s to an item set that contains X ::= a s . b. But it's a little bit more complex
than that:
If we have more than one item with a dot before the same symbol s, say X ::= a. s b and Y ::= c . s d,
then the new item set contains both X ::= a s . b and Y ::= c s . d.
We need to get the closure of the new item set.
We have to check if this item set has been appeared before so that we don't create it
again.
1) S ::= E $
2) E ::= E + T
3) |T
4) T ::=
id 5)
|(E)
has the following item sets:
I0: S ::= . E $ I4: E ::= E + T .
E ::= . E + T
E ::= . T I5: T ::= id .
T ::= . id
T ::= . ( E ) I6: T ::= ( . E )
E ::= . E + T
I1: S ::= E . $ E ::= . T
E ::= E . + T T ::= . id
T ::= . ( E )
I2: S ::= E $ .
I7: T ::= ( E . )
I3: E ::= E + . T E ::= E . + T
T ::= . id
T ::= . ( E ) I8: T ::= ( E ) .
I9: E ::= T .
The ACTION and GOTO tables that correspond to this DFA are:
0) S' ::= S $
1) S ::= B B
2) B ::= a
B 3) B ::=
c
Recall that in the SLR method, state i calls for reduction by A -> α if the set of items I contains
item A -> α@+ and a is in FOLLOW(A). In some situations, however, when state I appears on top of the
stack, the viable prefix βα on the stack is such that βA cannot be followed by a in any right- sentential form.
Thus, the reduction by A -> α should be invalid on input a.
Formally, we say LR(1) item [A -> α@β, a+ is valid for a viable prefix γ if there is a derivation
S=> δAω => δαβω , where
B->aB|b
There is a rightmost derivation S => aaBab => aaaBab. We see that item [B -> a@B, a+ is valid for a viable
prefix γ = aaa by letting δ = aa, A = B, ω = ab, α = a, and β = B in the above definition. There is also a
rightmost derivation
S => BaB => BaaB.
From this derivation we see that item [B -> a@B, $] is valid for viable prefix Baa.
Output:The sets of LR(1) items that are the set of items valid for one or more viable prefixes of G’ .
Example:Consider the following augmented grammar: -
S’ S
S CC
C Cc | d
I0 : S’ .S ,
S .CC,
C .Cc, c |
d C .d, c |
as:- I1 : S’ S., $
I2 : S .Cc,
$ C
.Cc, $
C .d, $
I3 : C c.C, $
C .c C , c |
d C .d, $
I4 : C d. , c | d
I5 : S CC. , $
I6 : C c.C,
$ C .c
C ,$ C
.d , $
I7 : C d. , $
I8 : C c C. , c | d
I9 : C c C. , $
Method :
If a conflict results from above rules, the grammar is said not to be LR(1), and the algorithm is said to be fail.
States:
I0 : S’ -> .S,
$ S -
> .CC, $
C -> .c C,
c /d C -> .d,
c /d
I1: S’ -> S., $
C -> .d, $
I3: C -> c. C, c
/d C -
> .Cc, c /d
C -> .d, c /d
0 S3 S4 1 2
1 acc
2 S6 S7 5
3 S3 S4 8
4 R3 R3
5 R1
6 S6 S7 9
7 R3
8 R2 R2
9 R2
NOTE: For goto graph see the construction used in Canonical LR.
goto
START Action
s
C D $ S C
0 S36 S47 1 2
1 Acc
2 S36 S47 5
36 S36 S47 89
47 R3 R3 R3
5 R1
89 R2 R2 R2
UNIT-3
Syntax-Directed Definitions
A syntax-directed definition uses a CFG to specify the syntatic structure of the input.
A syntax-directed definition associates a set of attributes with each grammar symbol.
A syntax-directed definition associates a set of semantic rules with each production rule.
X→YZ
And also let that nodes X, Y and Z have associated attributes X.a, Y.a and Z.a respectively.
diagram
Synthesized Attributes
An attribute is synthesized if its value at a parent node can be determined from attributes of its children.
diagram
Since in this example, the value of a node X can be determined from 'a' attribute of Y and Z nodes attribute 'a' in a
synthesized attribute.
Synthesized attributes can be evaluated by a single bottom-up traversal of the parse tree.
Example 2.6: Following figure shows the syntax-directed definition of an infix-to-postfix translator.
Attribute Grammar
Attribute grammar is a special form of context-free grammar where some additional
information (attributes) are appended to one or more of its non-terminals in order to provide
context-sensitive information. Each attribute has well-defined domain of values, such as
integer, float, character, string, and expressions.
Example:
E → E + T {E.value=E.value+T.value}
The right part of the CFG contains the semantic rules that specify how the grammar should
be interpreted. Here, the values of non-terminals E and T are added together and the result
is copied to the non-terminal E.
Semantic attributes may be assigned to their values from their domain at the time of
parsing and evaluated at the time of assignment or conditions. Based on the way the
attributes get their values, they can be broadly divided into two categories : synthesized
attributes and inherited attributes.
Synthesized attributes
These attributes get values from the attribute values of their child nodes. To illustrate,
assume the following production:
S → ABC
If S is taking values from its child nodes (A,B,C), then it is said to be a synthesized attribute,
as the values of ABC are synthesized to S.
As in our previous example (E → E + T), the parent node E gets its value from its child node.
Synthesized attributes never take values from their parent nodes or any sibling nodes.
Inherited attributes
In contrast to synthesized attributes, inherited attributes can take values from parent and/or
siblings. As in the following production,
S → ABC
A can get values from S, B and C. B can take values from S, A, and C. Likewise, C can take
values from S, A, and B.
Semantic analysis uses Syntax Directed Translations to perform the above tasks.
Semantic analyzer receives AST (Abstract Syntax Tree) from its previous stage (syntax
analysis).
Semantic analyzer attaches attribute information with AST, which are called Attributed AST.
For example:
S-attributed SDT
If an SDT uses only synthesized attributes, it is called as S-attributed SDT. These attributes
are evaluated using S-attributed SDTs that have their semantic actions written after the
production (right hand side).
L-attributed SDT
This form of SDT uses both synthesized and inherited attributes with restriction of not taking
values from right siblings.
In L-attributed SDTs, a non-terminal can get values from its parent, child, and sibling nodes.
As in the following production
S → ABC
S can take values from A, B, and C (synthesized). A can take values from S only. B can take
values from S and A. C can get values from S, A, and B. No non-terminal can get values from
the sibling to its right.
Attributes in L-attributed SDTs are evaluated by depth-first and left-to-right parsing manner.
We may conclude that if a definition is S-attributed, then it is also L-attributed as L-
attributed definition encloses S-attributed definitions.
Intermediate code can be either language specific (e.g., Byte Code for Java) or language independent (three-address
code).
Three-Address Code
Intermediate code generator receives input from its predecessor phase, semantic analyzer, in the form of an annotated
syntax tree. That syntax tree then can be converted into a linear representation, e.g., postfix notation. Intermediate
code tends to be machine independent code. Therefore, code generator assumes to have unlimited number of memory
storage (register) to generate code.
For example:
a = b + c * d;
The intermediate code generator will try to divide this expression into sub-expressions and then generate the
corresponding code.
r1 = c * d;
r2 = b + r1;
r3 = r2 + r1;
a = r3
A three-address code has at most three address locations to calculate the expression. A three-address code can be
represented in two forms : quadruples and triples.
Quadruples
Each instruction in quadruples presentation is divided into four fields: operator, arg1, arg2, and result. The above
example is represented below in quadruples format:
Each instruction in triples presentation has three fields : op, arg1, and arg2.The results of respective sub-expressions
are denoted by the position of expression. Triples represent similarity with DAG and syntax tree. They are equivalent
to DAG while representing expressions.
Op arg1 arg2
* c d
+ b (0)
+ (1) (0)
= (2)
Triples face the problem of code immovability while optimization, as the results are positional and changing the
order or position of an expression may cause problems.
Indirect Triples
This representation is an enhancement over triples representation. It uses pointers instead of position to store results.
This enables the optimizers to freely re-position the sub-expression to produce an optimized code.