Unit 4
Unit 4
The pushdown automata (PDA) is used to design the syntax analysis phase.
The Grammar for a Language consists of Production rules.
Example: Suppose Production rules for the Grammar of a language are:
S -> cAd
A -> bc|a
And the input string is “cad”.
Now the parser attempts to construct a syntax tree from this grammar for the given input
string. It uses the given production rules and applies those as needed to generate the string.
To generate string “cad” it uses the rules as shown in the given diagram:
In step (iii) above, the production rule A->bc was not a suitable one to apply (because the
string produced is “cbcd” not “cad”), here the parser needs to backtrack, and apply the
next production rule available with A which is shown in step (iv), and the string “cad” is
produced.
Thus, the given input can be produced by the given grammar, therefore the input is correct
in syntax. But backtrack was needed to get the correct syntax tree, which is really a
complex process to implement.
There can be an easier way to solve this, which we shall see in the next article “Concepts
of FIRST and FOLLOW sets in Compiler Design”.
Advantages
Advantages of using syntax analysis in compiler design include:
Structural validation: Syntax analysis allows the compiler to check if the source code
follows the grammatical rules of the programming language, which helps to detect
and report errors in the source code.
Improved code generation: Syntax analysis can generate a parse tree or abstract
syntax tree (AST) of the source code, which can be used in the code generation phase
of the compiler design to generate more efficient and optimized code.
Easier semantic analysis: Once the parse tree or AST is constructed, the compiler can
perform semantic analysis more easily, as it can rely on the structural information
provided by the parse tree or AST .
Disadvantages
Disadvantages of using syntax analysis in compiler design include:
Complexity: Parsing is a complex process, and the quality of the parser can greatly
impact the performance of the resulting code. Implementing a parser for a complex
programming language can be a challenging task, especially for languages with
ambiguous grammars.
Reduced performance: Syntax analysis can add overhead to the compilation process,
which can reduce the performance of the compiler.
Limited error recovery: Syntax analysis algorithms may not be able to recover from
errors in the source code, which can lead to incomplete or incorrect parse trees and
make it difficult for the compiler to continue the compilation process.
Inability to handle all languages: Not all languages have formal grammars, and some
languages may not be easily parseable.
Overall, syntax analysis is an important stage in the compiler design process, but it
should be balanced against the goals.
Syntax analysis, also known as parsing, is a crucial stage in the process of compiling a
program. Its primary task is to analyze the structure of the input program and check
whether it conforms to the grammar rules of the programming language. This process
involves breaking down the input program into a series of tokens and then constructing a
parse tree or abstract syntax tree (AST) that represents the hierarchical structure of the
program.
Steps in Syntax Analysis Phase
Tokenization: The input program is divided into a sequence of tokens, which are
basic building blocks of the programming language, such as identifiers, keywords,
operators, and literals.
Parsing: The tokens are analyzed according to the grammar rules of the
programming language, and a parse tree or AST is constructed that represents the
hierarchical structure of the program.
Error handling: If the input program contains syntax errors, the syntax analyzer
detects and reports them to the user, along with an indication of where the error
occurred.
Symbol table creation: The syntax analyzer creates a symbol table, which is a data
structure that stores information about the identifiers used in the program, such as
their type, scope, and location.
The syntax analysis phase is essential for the subsequent stages of the compiler, such
as semantic analysis, code generation, and optimization. If the syntax analysis is not
performed correctly, the compiler may generate incorrect code or fail to compile the
program altogether.
Why FOLLOW?
The parser faces one more problem. Let us consider below grammar to understand this
problem.
A -> aBb
B -> c | ε
And suppose the input string is “ab” to parse.
As the first character in the input is a, the parser applies the rule A->aBb.
A
/| \
a B b
Now the parser checks for the second character of the input string which is b, and the
Non-Terminal to derive is B, but the parser can’t get any string derivable from B that
contains b as first character. But the Grammar does contain a production rule B -> ε, if
that is applied then B will vanish, and the parser gets the input “ab”, as shown below.
But the parser can apply it only when it knows that the character that follows B in the
production rule is same as the current character in the input. In RHS of A -> aBb, b
follows Non-Terminal B, i.e. FOLLOW(B) = {b}, and the current input character read is
also b. Hence the parser applies this rule. And it is able to get the string “ab” from the
given grammar.
A A
/ | \ / \
a B b => a b
|
ε
So FOLLOW can make a Non-terminal vanish out if needed to generate the string from
the parse tree. The conclusions is, we need to find FIRST and FOLLOW sets for a
given grammar so that the parser can properly apply the needed rule at the correct
position. In the next article, we will discuss formal definitions of FIRST and FOLLOW,
and some easy rules to compute these sets.
FIRST sets
FIRST(E) = FIRST(T) = { ( , id }
FIRST(E’) = { +, ? }
FIRST(T) = FIRST(F) = { ( , id }
FIRST(T’) = { *, ? }
FIRST(F) = { ( , id }
Example 2:
Production Rules of Grammar
S -> ACB | Cbb | Ba
A -> da | BC
B -> g | ?
C -> h | ?
FIRST sets
FIRST(S) = FIRST(ACB) U FIRST(Cbb) U FIRST(Ba)
= { d, g, h, b, a, ?}
FIRST(A) = { d } U FIRST(BC)
= { d, g, h, ? }
FIRST(B) = { g , ? }
FIRST(C) = { h , ? }
Notes:
1. The grammar used above is Context-Free Grammar (CFG). Syntax of most
programming languages can be specified using CFG.
2. CFG is of the form A -> B, where A is a single Non-Terminal, and B can be a set of
grammar symbols ( i.e. Terminals as well as Non-Terminals)
Definition: The FIRST set of a nonterminal symbol is the set of terminal symbols that
can appear as the first symbol in a string derived from that nonterminal. In other words,
it is the set of all possible starting symbols for a string derived from that nonterminal.
Calculation: The FIRST set for each nonterminal symbol is calculated by examining
the productions for that symbol and determining which terminal symbols can appear as
the first symbol in a string derived from that production.
Recursive Descent Parsing: The FIRST set is often used in recursive descent parsing,
which is a top-down parsing technique that uses the FIRST set to determine which
production to use at each step of the parsing process.
Ambiguity Resolution: The FIRST set can help resolve ambiguities in the grammar by
providing a way to determine which production to use based on the next input symbol.
Follow Set: The FOLLOW set is another concept used in syntax analysis that represents
the set of symbols that can appear immediately after a nonterminal symbol in a
derivation. The FOLLOW set is often used in conjunction with the FIRST set to resolve
parsing conflicts and ensure that the parser can correctly identify the structure of the
input code.
S S
/ \ / \
A a A c
| |
b b
Example 1:
Production Rules:
E -> TE’
E’ -> +T E’|Є
T -> F T’
T’ -> *F T’ | Є
F -> (E) | id
FIRST set
FIRST(E) = FIRST(T) = { ( , id }
FIRST(E’) = { +, Є }
FIRST(T) = FIRST(F) = { ( , id }
FIRST(T’) = { *, Є }
FIRST(F) = { ( , id }
FOLLOW Set
FOLLOW(E) = { $ , ) } // Note ')' is there because of 5th rule
FOLLOW(E’) = FOLLOW(E) = { $, ) } // See 1st production rule
FOLLOW(T) = { FIRST(E’) – Є } U FOLLOW(E’) U FOLLOW(E) = { + , $ , ) }
FOLLOW(T’) = FOLLOW(T) = {+,$,)}
FOLLOW(F) = { FIRST(T’) – Є } U FOLLOW(T’) U FOLLOW(T) = { *, +, $, ) }
Example 2:
Production Rules:
S -> aBDh
B -> cC
C -> bC | Є
D -> EF
E -> g | Є
F -> f | Є
FIRST set
FIRST(S) = { a }
FIRST(B) = { c }
FIRST(C) = { b , Є }
FIRST(D) = FIRST(E) U FIRST(F) = { g, f, Є }
FIRST(E) = { g , Є }
FIRST(F) = { f , Є }
FOLLOW Set
FOLLOW(S) = { $ }
FOLLOW(B) = { FIRST(D) – Є } U FIRST(h) = { g , f , h }
FOLLOW(C) = FOLLOW(B) = { g , f , h }
FOLLOW(D) = FIRST(h) = { h }
FOLLOW(E) = { FIRST(F) – Є } U FOLLOW(D) = { f , h }
FOLLOW(F) = FOLLOW(D) = { h }
Example 3:
Production Rules:
S -> ACB|Cbb|Ba
A -> da|BC
B-> g|Є
C-> h| Є
FIRST set
FIRST(S) = FIRST(A) U FIRST(B) U FIRST(C) = { d, g, h, Є, b, a}
FIRST(A) = { d } U {FIRST(B)-Є} U FIRST(C) = { d, g, h, Є }
FIRST(B) = { g, Є }
FIRST(C) = { h, Є }
FOLLOW Set
FOLLOW(S) = { $ }
FOLLOW(A) = { h, g, $ }
FOLLOW(B) = { a, $, h, g }
FOLLOW(C) = { b, g, $, h }
Note:
1. Є as a FOLLOW doesn’t mean anything (Є is an empty string).
2. $ is called end-marker, which represents the end of the input string, hence used while
parsing to indicate that the input string has been completely processed.
3. The grammar used above is Context-Free Grammar (CFG). The syntax of a
programming language can be specified using CFG.
4. CFG is of the form A -> B, where A is a single Non-Terminal, and B can be a set of
grammar symbols ( i.e. Terminals as well as Non-Terminals)
Role of Parser
In the syntax analysis phase, a compiler verifies whether or not the tokens generated by
the lexical analyzer are grouped according to the syntactic rules of the language. This is
done by a parser. The parser obtains a string of tokens from the lexical analyzer and
verifies that the string can be the grammar for the source language. It detects and reports
any syntax errors and produces a parse tree from which intermediate code can be
generated.
Types of Parsing
The parsing is divided into two types, which are as follows:
1. Top-down Parsing
2. Bottom-up Parsing
Top-Down Parsing
Top-down parsing attempts to build the parse tree from the root node to the leaf node. The
top-down parser will start from the start symbol and proceed to the string. It follows the
leftmost derivation. In leftmost derivation, the leftmost non-terminal in each sentential is
always chosen.
1. Recursive parsing or predictive parsing are other names for top-down parsing.
2. A parse tree is built for an input string using bottom-up parsing.
3. When parsing is done top-down, the input symbol is first transformed into the start
symbol.
The top-down parsing is further categorized as follows:
1. With Backtracking:
Brute Force Technique
2. Without Backtracking:
Recursive Descent Parsing
Predictive Parsing or Non-Recursive Parsing or LL(1) Parsing or Table Driver Parsing
Bottom-up Parsing
Bottom-up parsing builds the parse tree from the leaf node to the root node. The bottom-
up parsing will reduce the input string to the start symbol. It traces the rightmost derivation
of the string in reverse. Bottom-up parsers are also known as shift-reduce parsers.
1. Shift-reduce parsing is another name for bottom-up parsing.
2. A parse tree is built for an input string using bottom-up parsing.
3. When parsing from the bottom up, the process begins with the input symbol and builds
the parse tree up to the start symbol by reversing the rightmost string derivations.
Generally, bottom-up parsing is categorized into the following types:
1. LR parsing/Shift Reduce Parsing: Shift reduce Parsing is a process of parsing a string
to obtain the start symbol of the grammar.
LR(0)
SLR(1)
LALR
CLR
2. Operator Precedence Parsing: The grammar defined using operator grammar is
known as operator precedence parsing. In operator precedence parsing there should be no
null production and two non-terminals should not be adjacent to each other.
To construct the GOTO graph using LR(0) parsing, we rely on two essential
functions: Closure() and Goto().
Firstly, we introduce the concept of an augmented grammar. In the augmented grammar,
a new start symbol, S’, is added, along with a production S’ -> S. This addition helps the
parser determine when to stop parsing and signal the acceptance of input. For example,
if we have a grammar S -> AA and A -> aA | b, the augmented grammar will be S’ -> S
and S -> AA.
Next, we define LR(0) items. An LR(0) item of a grammar G is a production of G with a
dot (.) positioned at some point on the right-hand side. For instance, given the
production S -> ABC, we obtain four LR(0) items: S -> .ABC, S -> A.BC, S -> AB.C,
and S -> ABC. It is worth noting that the production A -> ? generates only one item: A -
> .?.
By utilizing the Closure() function, we can calculate the closure of a set of LR(0) items.
The closure operation involves expanding the items by considering the productions that
have the dot right before the non-terminal symbol. This step helps us identify all the
possible items that can be derived from the current set.
The Goto() function is employed to construct the transitions between LR(0) items in the
GOTO graph. It determines the next set of items by shifting the dot one position to the
right. This process allows us to navigate through the graph and track the parsing
progress.
Augmented Grammar: If G is a grammar with start symbol S then G’, the augmented
grammar for G, is the grammar with new start symbol S’ and a production S’ -> S. The
purpose of this new starting production is to indicate to the parser when it should stop
parsing and announce acceptance of input. Let a grammar be S -> AA A -> aA | b, The
augmented grammar for the above grammar will be S’ -> S S -> AA A -> aA | b.
LR(0) Items: An LR(0) is the item of a grammar G is a production of G with a dot at
some position in the right side. S -> ABC yields four items S -> .ABC S -> A.BC S ->
AB.C S -> ABC. The production A -> ? generates only one item A -> .?
Closure Operation: If I is a set of items for a grammar G, then closure(I) is the set of
items constructed from I by the two rules:
1. Initially every item in I is added to closure(I).
2. If A -> ?.B? is in closure(I) and B -> ? is a production then add the item B -> .? to I, If
it is not already there. We apply this rule until no more items can be added to closure(I).
Eg:
Input Parsing
Stack Buffer Action
$ (a,(a,a))$ Shift
$( a,(a,a))$ Shift
$ ( L, ( L ))$ Shift
$ ( L, ( L ) )$ Reduce S → (L)
$ ( L, S )$ Reduce L → L, S
$(L )$ Shift
$S $ Accept
Advantages:
It can parse a large variety of programming languages and is widely used in practice.
Disadvantages:
Shift-reduce parsing has a limited lookahead, which means that it may miss some
syntax errors that require a larger lookahead.
In some cases, the parse tree generated by shift-reduce parsing may be more complex
than other parsing techniques.
SLR Parser
LR parser is also called as SLR parser
it is weakest of the three methods but easier to implement
a grammar for which SLR parser can be constructed is called SLR grammar
Steps for constructing the SLR parsing table
1. Writing augmented grammar
2. LR(0) collection of items to be found
3. Find FOLLOW of LHS of production
4. Defining 2 functions:goto[list of terminals] and action[list of non-terminals] in the
parsing table
Solution:
RULE –
If any non-terminal has ‘ . ‘ preceding it, we have to write all its production and add ‘ . ‘
preceding each of its production.
RULE –
from each state to the next state, the ‘ . ‘ shifts to one place to the right.
APPLICATIONS GALORE:
Compiler
Data Validation
Natural Language Processing(NLP)
Protocol Parsing
Advantages of LR parsing :
It recognizes virtually all programming language constructs for which CFG can be
written
It is able to detect syntactic errors
It is an efficient non-backtracking shift reducing parsing method.
CLR Parser :
The CLR parser stands for canonical LR parser.It is a more powerful LR parser.It makes
use of lookahead symbols. This method uses a large set of items called LR(1) items.The
main difference between LR(0) and LR(1) items is that, in LR(1) items, it is possible to
carry more information in a state, which will rule out useless reduction states.This extra
information is incorporated into the state by the lookahead symbol. The general syntax
becomes [A->∝.B, a ]
where A->∝.B is the production and a is a terminal or right end marker $
LR(1) items=LR(0) items + look ahead
CASE 2 –
Now if the 0th production was like this,
A->∝.B, a
Here, we can see there’s nothing after B. So the lookahead of 0th production will be the
lookahead of 1st production. ie-
B->.D, a
CASE 3 –
Assume a production A->a|b
A->a,$ [0th production]
A->b,$ [1st production]
Here, the 1st production is a part of the previous production, so the lookahead will be
the same as that of its previous production.
These are the 2 rules of look ahead.
Steps for constructing CLR parsing table :
1. Writing augmented grammar
2. LR(1) collection of items to be found
3. Defining 2 functions: goto[list of terminals] and action[list of non-terminals] in the
CLR parsing table
EXAMPLE
Construct a CLR parsing table for the given context-free grammar
S-->AA
A-->aA|b
Solution :
STEP 1 – Find augmented grammar
The augmented grammar of the given grammar is:-
S'-->.S ,$ [0th production]
S-->.AA ,$ [1st production]
A-->.aA ,a|b [2nd production]
A-->.b ,a|b [3rd production]
Top-Down Parsers without backtracking can further be divided into two parts:
In this article, we are going to discuss Non-Recursive Descent which is also known as
LL(1) Parser.
LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done from
the Left to Right manner and the second L shows that in this parsing technique, we are
going to use the Left most Derivation Tree. And finally, the 1 represents the number of
look-ahead, which means how many symbols are you going to see when you want to make
a decision.
These conditions are necessary but not sufficient for proving a LL(1) parser.
*ε denotes epsilon
E’ –> +TE’/
{ +, ε } { $, ) }
ε
T’ –> *FT’/
{ *, ε } { +, $, ) }
ε
As you can see that all the null productions are put under the Follow set of that symbol
and all the remaining productions lie under the First of that symbol.
Note: Every grammar is not feasible for LL(1) Parsing table. It may be possible that one
cell may contain more than one production.
Step 1: The grammar does not satisfy all properties in step 1, as the grammar is
ambiguous. Still, let’s try to make the parser table and see what happens
Step 2: Calculating first() and follow()
Find their First and Follow sets:
First Follow
S S –> A, S –> a
A A –> a
Here, we can see that there are two productions in the same cell. Hence, this grammar is
not feasible for LL(1) Parser.
Trick – Above grammar is ambiguous grammar. So the grammar does not satisfy the
essential conditions. So we can say that this grammar is not feasible for LL(1) Parser even
without making the parse table.
S (,a $, )
L (,a )
L’ ), ε )
)
( a $
L’->(SL’
L’
L’->ε
Here, we can see that there are two productions in the same cell. Hence, this
grammar is not feasible for LL(1) Parser. Although the grammar satisfies all the
essential conditions in step 1, it is still not feasible for LL(1) Parser. We saw in
example 2 that we must have these essential conditions and in example 3 we saw that
those conditions are insufficient to be a LL(1) parser.
Suppose this is the 0th production.Now, since ‘ . ‘ precedes B,so we have to write B’s
productions as well.
B->.D [1st production]
Suppose this is B’s production. The look ahead of this production is given as- we look at
previous production i.e. – 0th production. Whatever is after B, we find FIRST(of that
value) , that is the lookahead of 1st production. So, here in 0th production, after B, C is
there. Assume FIRST(C)=d, then 1st production become.
B->.D, d
CASE 2 –
Now if the 0th production was like this,
A->∝.B, a
Here,we can see there’s nothing after B. So the lookahead of 0th production will be the
lookahead of 1st production. ie-
B->.D, a
CASE 3 –
Assume a production A->a|b
A->a,$ [0th production]
A->b,$ [1st production]
Here, the 1st production is a part of the previous production, so the lookahead will be
the same as that of its previous production.
STEP 3 –
Defining 2 functions: goto[list of terminals] and action[list of non-terminals] in the
parsing table.Below is the CLR parsing table
Once we make a CLR parsing table, we can easily make a LALR parsing table from it.
In the step2 diagram, we can see that
I3 and I6 are similar except their lookaheads.
I4 and I7 are similar except their lookaheads.
I8 and I9 are similar except their lookaheads.
In LALR parsing table construction , we merge these similar states.
Wherever there is 3 or 6, make it 36(combined form)
Wherever there is 4 or 7, make it 47(combined form)
Wherever there is 8 or 9, make it 89(combined form)
Below is the LALR parsing table.