CD Module2 16 03 23 PDF
CD Module2 16 03 23 PDF
Syntax Analysis
Soumya Majumdar
Parser
● Parser is a program that is usually part of a compiler.
● Receives input in the form of sequential source program instructions, interactive online
commands, markup tags or some other defined interface.
● Parsing happens during the analysis stage of compilation
● In parsing, code is taken from the preprocessor, broken into smaller pieces and analyzed
so other software can understand it. The parser does this by building a data structure out
of the pieces of input.
● Parser consists of three components, each of which handles a different stage of the
parsing process. The three stages are: Lexical analysis, Syntactic analysis, Semantic
analysis
●
Lexical Analysis
● Lexical analyzer or scanner takes code from preprocessor and breaks it into smaller
pieces.
● Groups the input code into sequences of characters called lexemes, each of which
corresponds to a token.
● Tokens are units of grammar in the programming language that the compiler understands.
● Lexical analyzers also remove white space characters, comments and errors from the
input.
Syntactic Analysis
● Checks the syntactical structure of the input using a data structure called a parse tree or
derivation tree.
● Syntax analyzer uses tokens to construct a parse tree that combines the predefined
grammar of the programming language with the tokens of the input string.
● Syntactic analyzer reports a syntax error if the syntax is incorrect.
Semantic Analysis
● Verifies the parse tree against a symbol table and determines whether it is semantically
consistent. This process is also known as context sensitive analysis.
● Includes data type checking, label checking and flow control checking.
●
Types of Parser
<sentence> ::= <subject> <verb> <object>
backup. It is a top-down parser that does not require backtracking. At each step, the choice of
● Earley parsers: These parse all context-free grammars, unlike LL and LR parsers. Most
● Shift-reduce parsers: These shift and reduce an input string. At each stage in string, they
reduce word to a grammar rule. This approach reduces the string until it has been completely
checked.
Types of Parser
Top-down parser
● When the parser starts constructing the parse tree from the start symbol and then tries to
it uses recursive procedures to process the input. Recursive descent parsing suffers from
backtracking.
● Backtracking : If one derivation of a production fails, the syntax analyzer restarts process using
different rules of same production. This technique may process the input string more than once
●
Recursive-descent parser
● Recursive descent is a top-down parsing technique that constructs the parse tree from the top
● This parsing technique recursively parses the input to make a parse tree, which may or may not
require back-tracking.
● A form of recursive-descent parsing that does not require any back-tracking is known as
predictive parsing.
● This parsing technique is regarded recursive as it uses context-free grammar which is recursive
in nature.
Back-tracking
S → rXd | rZd
X → oa | ea
Z → ai
Back-tracking
Input string: read
Predictive Parser
● Predictive parser is a recursive descent parser, which has the capability to predict which
production is to be used to replace the input string.
● Predictive parser does not suffer from backtracking.
● Predictive parser uses a look-ahead pointer, which points to the next input symbols.
● To make parser back-tracking free, predictive parser puts some constraints on grammar and
accepts only a class of grammar known as LL(k) grammar.
● Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree.
● Both the stack and the input contains an end symbol $ to denote that the stack is empty and the
input is consumed
● Parser refers to the parsing table to take any decision on the input and stack element
combination.
Predictive Parser
Recursive-descent vs Predictive Parser
● In recursive descent parsing, the parser may have more than one production to choose from for a
single instance of input.
● In predictive parser, each step has at most one production to choose.
● There might be instances where there is no production matching the input string, making the
parsing procedure to fail.
LL Parser
● LL Parser accepts LL grammar.
● LL grammar is a subset of context-free grammar but with some restrictions to get the simplified
version, in order to achieve easy implementation.
● LL grammar can be implemented by means of both algorithms namely, recursive-descent or
table-driven.
● LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the
second L in LL(k) stands for left-most derivation and k itself represents the number of look
aheads. Generally k = 1, so LL(k) may also be written as LL(1).
LL Parser
Bottom-up Parser
● Bottom-up parsing starts with input symbols and tries to construct the parse tree up to the start
symbol
● Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it
reaches the root node.
Bottom-up Parser
Bottom-up Parser
Grammar:
1. S → S+S
2. S → S-S
3. S → (S)
4. S→a
Input string:
a1-(a2+a3)
Bottom-up Parser
Parsing table:
Context free grammar
● A context free grammar (CFG) is a forma grammar which is used to generate all the possible
patterns of strings in a given formal language. It is defined as four tuples −
G=(V,T,P,S)
● G is a grammar, which consists of a set of production rules. It is used to generate the strings of a
language.
● T is the final set of terminal symbols. It is denoted by lower case letters.
● V is the final set of non-terminal symbols. It is denoted by capital letters
● P is a set of production rules, which is used for replacing non-terminal symbols (on the left side
of production) in a string with other terminals (on the right side of production).
● S is the start symbol used to derive the string
Context free grammar
● Context free grammar consists of terminals, non-terminals, start symbol and production.
● Terminal: basic symbols from which strings formed. (can ve also called “token name”)
● Non-terminal: syntactic variables that denotes set of strings. Set of strings denoted by
non-terminal help to define languages generated by grammar.
● Start Symbol: one non-terminal is distinguished as start symbol.
● Production: specify the manner in which terminals and non-terminals can be combined to form
strings
● A production consists of (i) a non-terminal symbol (head/left side of production) (ii) -> symbol
or ::= symbol (iii) terminal/non-terminals (right side of production)
Context free grammar
LR Parser
● Bottom-up parser for context-free grammar that is very generally used by computer
programming language compiler and other associated tools.
● LR parser reads their input from left to right and produces a right-most derivation.
● It is called a Bottom-up parser because it attempts to reduce the top-level grammar productions
by building up from the leaves.
● LR parsers are the most powerful parser of all deterministic parsers in practice.
● LR(k) parser: here the L refers to the left-to-right scanning, R refers to the rightmost derivation
in reverse
● k refers to the number of input symbols for lookahead that are used in making parsing decision.
LR Parser advantages
● Can be constructed to recognise vairually all programming languages construct for which context
free grammer can be written
● LR parsing method is most general non-backtracking shift-reduce parsing method
● LR parser can detect a syntactic error as soon as possible to do so on a left-right scan of the input
Disadvantage: it is too much work to construct an LR parser by hand for a typical programming
language grammer.
LR(0) item
● An LR(0) item is a production of the grammar with exactly one dot on the right-hand side.
● For example, production T → T * F leads to four LR(0) items:
T→⋅T*F
T→T⋅*F
T→T*⋅F
T→T*F⋅
● What is to the left of the dot has just been read, and the parser is ready to read the remainder,
after the dot.
● Two LR(0) items that come from the same production but have the dot in different places are
considered different LR(0) items.
Closure of LR(0) item
S is a set of LR(0) items. The following rules tell how to build closure(S), the closure of S. We must
add LR(0) items to S until there are no more to add.
Since there is an item with a dot immediately before nonterminal T, we add T → ⋅ F and T → ⋅ T * F.
The set now contains the following LR(0) items.
E→E+⋅T
T→⋅F
T→⋅T*F
Closure of LR(0) item
Now there is an item in the set with a dot immediately followed by F. So we add items F → ⋅ n and F →
⋅ ( E ). The set now contains the following items.
E→E+⋅T
T→⋅F
T→⋅T*F
F→⋅n
F→⋅(E)
● LR(0) item E → E + ⋅ T indicates that the parser has just finished reading an expression
followed by a + sign. In fact, E + are the top two symbols on the stack.
● Now, the parser is looking to see if there is a T next. (It does not predict that there is a T next. It
is just considering that as a possibility.)
● But that means it should be looking for something that is the right-hand side of a production for
T. So we add items for T with the dot at the beginning.
Problem 1
Consider the following grammar-
E→E–E
E→ExE
E → id
S→(L)|a
L→L,S|S
S → 0S0 | 1S1 | 2