Formal Language and Applications notes
Formal Language and Applications notes
Length of Strings
Number of symbols a string contains.
If w is a string, its length is written as |W|.
Null String
Can we have strings of length 0?
Null String.
We use λ or ε to refer to Null strings.
Concatenation
∑ = {a, b, c, 0, 1}
W1 = abab
W2 =0101
W3 = abab0101
∑ = {a, b, c, 0, 1}
W = ab
w0= X(null String)
abab= ww = w2
ababab = www = w3
abababab = wwww = w4
Reversing a String
∑ = {a, b, c, 0, 1}
W = ab01
Reverse(WR)= 10ba
Pre x
x is a pre x of string y if a string z exists such that, xz=y
x=ab, z=01, y=ab01
fi
fi
fi
fi
fi
Su x of a String - Example
Su x
x is a su x of string y if a string z exists such that, zx=y
x= 01, z =ab, y= ab01
Kleene Star
Let ∑ = (0,1}
∑* contains all possible strings made by selecting & concatenating elements from E
∑+ = ∑* - {λ}
Basis
|uv| = |u| + |v| is true for all u of any length and all v of length 1
Inductive Assumption
|uv| = |u| + |v| is true for all u of any length and all v of length upto n
Inductive Step
Therefore,
|uv| = |u| + |v|
L = {0n 1n : n ≥ 0}
Output = Yes/No
Let ∑ = {0,1}
L1 = {0n 1n : n ≥ 0}
L2 = {1n 0n : n ≥ 0}
L3 = L1 U L2
L3 = ∑* - L1
L3 = (wR : w belong L1)
Using Concatenation
L1 = {00, 000}
L2 = (1, 11,111}
L3 = L1L2 {xy : x € L1, y € L2}
L3 = { 001, 0011, 00111, 0001, 00011, 000111}
Rules / Productions
All three rules given, that starts with
<sentence>
Variables
<sentence> <noun-phrase>, <predicate>,
<article>, <noun>, <verb>
Starting Symbol
<sentence>
Terminal Symbols
Any English word we substitute for Variables
De nition of a Grammar
Automata
Once the input is completely read and processed, the automaton produces an output yes or no. In this
case, our automaton works as an acceptor. It accepts the sentence if it is valid. At times, the automaton
produces some other useful input or transforms the input given. In other words, it takes its input and then
produces another output, not just yes or no. In such cases, we call an automaton, an automaton works as
a transducer while performing. The emphasis on automata is relatively lesser in this course. The larger
emphasis is on formal languages, their corresponding grammar, properties and applications.
fi
fi
Week - 2
. = concatenation
*= Repeat
+= OR
Example:
Language L = {0,1}
RE is 0+1
Language L = (01}
RE is 0.1
Regular Expressions
Yes
Yes
No
Example :
We've learned that for every regular language L, one, we will be able to nd a regular expression to
describe it. Two, there is a DFA that accepts this language L.
Regular Languages
Regular languages are very useful and important type of formal language. You will learn that regular
expressions can be used to describe languages and if a language is given, you will learn how to nd the
regular expression if possible. Great! The language associated with any regular expression is called as
regular language. This means that whenever we have a regular expression, the set of strings it de nes is a
regular language. Furthermore, for every regular language, there is a corresponding regular expression. So
given a regular language, you should be able to nd the corresponding regular expression. In a sense,
regular expressions and regular languages are two sides of the same coin. Regular languages are quite a
useful class of formal languages. Imagine that I am handing over a document to you, a long document,
and I request you to nd all the phone numbers in the document. You know for sure, the valid phone
numbers form a pattern and you can certainly write a regular expression to search all of them. How do you
think? The text editors or IDEs are highlighting certain kind of text, keywords, etc. They use regular
expressions to nd all the matching patterns and they highlight them. When you declare a variable or a
constant in a programming language, when you write a numerical value or a string value, or when you use
a keyword in a programming language, the compiler understands them correctly by modeling these items
as regular languages. We will see more of this shortly in this course. In network security, they help to
detect patterns that might indicate a cyber attack. In data mining, they help nding useful patterns in large
data sets. Very close. Regular languages are a powerful class of formal languages which has got so many
applications in computer science. I am sure you will learn what regular expressions are and how regular
expressions and regular languages are connected.
fi
fi
fi
fi
fi
fi
fi
Deterministic Finite Automata
Example:
Regular Languages
Let M be DFA.
Language accepted by M, L(M), is regular
A Language L is regular if and only if there exists some DFA such that L = L(M)
Regular languages are exactly the languages that can be accepted by DFA and represented by regular
expressions, making them formal languages with precise structural de nitions.
fi
In RLG, the non-terminal appears on the
right hand side of the production.
Indistinguishable States
If states p and q of a DFA are called indistinguishable then the states p and q are equivalent.
Indistinguishability has the properties of an equivalence relation.(re exive, symmetric and transitive)
fl
Mark-Reduce Method for Simplifying DFA
Important
Draw table to con rm if the state do belong to the derived answer.
fi
Recognizing languages that can have a DFA and that cannot have DFA
The Myhill-Nerode theorem states that a language is regular if the equivalence classes are nite.
Pumping Lemma
The Pumping Lemma shows that if a language is regular, strings longer than a certain length can be
"pumped," which leads to contradictions for non-regular languages.
Regular languages cannot represent languages requiring recursion or nested dependencies, as these
require more computational power than nite automata can provide.
Key Concepts:
1. Grammar:
• A grammar is a formal system that de nes the syntactic structure of a language.
• Formal De nition: A grammar G is de ned as a 4-tuple G=(V,T,S,P), where:
◦ V: Set of non-terminal symbols.
◦ T: Set of terminal symbols (alphabet).
◦ S: Start symbol from which derivation begins.
◦ P: Set of production rules, with each rule in the form A → x, where A ∈ V and x is a
string from (V ∪ T)*.
2. Context-Free Grammar (CFG):
• A CFG is a speci c type of grammar where each production rule has the form A → x, where
A is a single non-terminal, and x is any string of terminals and non-terminals.
• Context-free means that the left-hand side of a production consists of exactly one non-
terminal, with no dependency on surrounding symbols (context).
• CFGs are widely used in programming languages and mathematical expressions, as they
can represent recursive language structures.
Example:
Let us de ne a CFG for Palindromic strings only consisting of characters a and b.
• Non-terminals: V = {S} (Start Symbol)
• Terminals: T = {a, b}
• Production rules:
◦ S → aSb
◦ S → bSa
◦ S→λ
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
1. Language of CFGs
• A language (CFL) is the set of all strings that can be derived from the start symbol of a CFG.
• Starting from the start symbol, repeatedly apply production rules until only terminal symbols of a
string remain. This is the Derivation Process for that string.
• Parse Trees are graphical representations of the derivation of a string, showing how the string is
generated step-by-step.
Key Concepts:
This CFG generates strings with n number of a’s followed by the same number of b’s.
1. String Derivation:
• Each step in the derivation involves substituting a non-terminal with the corresponding rule
until a string of terminals is produced. We can generate the string aabb using the grammar
in this example.
Key Concepts:
1. Pattern Recognition:
• This CFG demonstrates the power of context-free languages to manage complex
dependencies within strings, such as matched pairs.
fi
fi
fi
fl
ff
Key Concepts:
1. Language (L ={w ∈ {0,1}* | String w contains an equal number of 0’s and 1’s}:
• This CFG de nes a language where every valid string has an equal number of 0s and 1s.
• Production Rules: The rules ensure that each derivation leads to strings meeting this
balance condition
◦ Rule 1 - S → 1S0S
◦ Rule 2 - S → 0S1S
◦ Rule 3 - S → λ
Key Concepts:
1. Language L={a2nbm | n, m ≥ 0}:
2. This language represents strings with an even number of 'a's followed by zero or more 'b's.
3. Production Rules: The CFG uses rules to ensure that every string has the required number of 'a's
and 'b's in the correct order.
• Rule 1 - S → AB
• Rule 2 - A → aaA
• Rule 3 - A → λ
• Rule 4 - B → Bb
• Rule 5 - B → λ
4. Left-Most Derivation:
• In a left-most derivation, the left-most variable is replaced at each step.
• Example Derivation of "aab": Starting from S, apply production rules by always
substituting the left-most non-terminal until reaching "aab".
• Deriving aab using Left Most Derivation
◦ S → AB (Applying Rule 1)
◦ AB → aaAB (Applying Rule 2)
◦ aaAB → aaB (Applying Rule 3)
◦ aaB → aaBb (Applying Rule 4)
◦ aaBb → aab (Applying Rule 5)
5. Right-Most Derivation:
• In a right-most derivation, the right-most variable is replaced at each step.
• Example Derivation of "aab": Begin with SSS and apply production rules, replacing the
right-most variable at each stage to arrive at the same string.
• Deriving aab using Left Most Derivation
• S → AB (Applying Rule 1)
• AB → ABb (Applying Rule 4)
• ABb → Ab (Applying Rule 5)
• Ab → aaAb (Applying Rule 2)
• aaAb → aab (Applying Rule 3)
fi
fi
fi
Key Concepts:
1. Language L {ab2m | m ≥ 1}:
• This language includes strings with a single 'a' followed by an even number of 'b's.
• Production Rules: The CFG ensures that every string adheres to the pattern, with one 'a'
followed by a multiple of two 'b's.
◦ S → aAB
◦ A → bBb
◦ B→A
◦ B→λ
2. Left-Most Derivation:
• Each derivation step replaces the left-most non-terminal, gradually producing the target
string.
• Example Derivation of "abbbb": Starting with SSS, the CFG applies left-most derivation
rules to arrive at the balanced structure.
◦ S → aAB (Applying Rule 1)
◦ aAB → abBbB (Applying Rule 2)
◦ abBbB → abAbB (Applying Rule 3)
◦ abAbB → abbBbbB (Applying Rule 2)
◦ abbBbbB → abbbbB (Applying Rule 4)
◦ abbbbB → abbbb(Applying Rule 4)
3. Right-Most Derivation:
• Each derivation step replaces the right-most non-terminal, producing the string in a di erent
order.
• Example Derivation of "abbbb": Starting with SSS, the right-most derivation method
follows a structured path to reach the nal string.
◦ S → aAB (Applying Rule 1)
◦ aAB → aA (Applying Rule 4)
◦ aA → abBb (Applying Rule 2)
◦ abBb → abAb (Applying Rule 3)
◦ abAb → abbBbb (Applying Rule 2)
◦ abbBbb → abbbb (Applying Rule 4)
Key Concepts:
1. De nition of Context-Free Languages (CFLs):
• A language L is context-free if there exists a CFG G such that L = L(G), where L(G) is the
language generated by the grammar G.
• CFLs are recognised by Pushdown Automata (PDAs), which use a stack to handle recursive
structures.
2. Examples of CFLs:
• Common examples of context-free languages include:
◦ Programming Language Syntax: Most programming languages, like Java, C, and
Python, are de ned by CFGs.
◦ Arithmetic Expressions: CFGs can represent the structure of mathematical
expressions.
◦ Balanced Parentheses: Languages with balanced symbols, such as parentheses,
are CFLs.
◦ Palindromes: Languages consisting of strings that read the same forward and
backwards.
3. Applications of CFLs:
• Programming Language Design: CFGs de ne syntactic rules for compilers, ensuring valid
code structure in languages like Java and Python.
• Natural Language Processing (NLP): CFGs model the syntax of natural languages, helping
in parsing and understanding language structure.
• Formal Veri cation: CFLs ensure conformity to syntactic rules in veri cation tasks.
fi
fi
fi
fi
fi
fi
ff
• XML/HTML Parsing: CFGs are used to parse structured documents like XML and HTML,
ensuring valid document structure.
Key Concepts:
1. Components of a Push-Down Automata (PDA):
• A PDA consists of the following elements:
◦ Q: A nite set of states
◦ Σ: The input alphabet (terminal symbols)
◦ Γ: The stack alphabet (set of symbols that can be pushed/popped onto/from the
stack)
◦ δ: The transition function (which de nes the behaviour based on the current state,
input symbol, and stack symbol)
◦ q₀: The start state (an element of Q)
◦ Z: The start symbol of the stack
◦ qf: The set of accepting states
2. Transitions in a PDA:
• Each transition in a PDA is based on the current input symbol, current state, and top of the
stack.
• The transition may:
◦ Pop the top of the stack.
◦ Push new symbols onto the stack.
PDAs are like Non-Deterministic Finite Automata but have an extra component known as stack
The unbounded stack provides additional memory beyond the nite amount available in the control, which
allows them recognise some non-regular languages
Formally:
∀w1 ∈ L1 and ∀w2 ∈ L2 , is w1=w2
• Challenges:
◦ Undecidability: The equivalence problem for CFGs is undecidable in general, meaning no
algorithm can solve all instances of this problem.
◦ Practical Heuristics: In some restricted cases, equivalence can be checked using heuristic
methods or approximations.
• Applications:
◦ Compiler Optimisation: Determining whether two di erent CFGs describe equivalent
subsets of a programming language.
◦ Grammar Simpli cation: Verifying those simplifying transformations (e.g., removing
redundant rules) preserves language equivalence.
◦ Formal Veri cation: Checking whether certain properties hold for systems (e.g., safety
properties, termination) by deciding whether behaviors of the system are part of a desired
language.
◦ Compiler Design: Membership and equivalence checks are crucial in lexical and syntax
analysis, ensuring programs are valid and equivalent optimizations are performed
The membership problem asks if a given string w belongs to the language L generated by a CFG.
If w belong in L, then w can be derived from the start symbol S of the CFG.
Derivation involves applying production rules to the start symbol to obtain a string.
This membership problem is decidable for CFGs, meaning there exists an algorithm to verify membership
for any given string W.
Membership can be checked using parsing techniques like top-down or bottom-up parsing.
Parse Trees
Ambiguity in CFGs
Ambiguous CFGs
A grammar is ambiguous if there exists at least one string in the language generated by the grammar that
has more than one distinct parse tree (or derivation).
For example, S - aSb|SS|lamda is ambiguous, for string aabb, there are 2 distinct parse trees.
fi
ff
Need for Unambiguous CFGs
Di erent derivations can imply di erent meanings for the same string.
Ambiguity can also lead to structural problems when designing compilers and parsers for the language.
Unambiguous grammars are critical in many applications where the structure of a string must be
interpreted in a consistent way.
Unambiguous CFG
A grammar is unambiguous if there exists exactly one derivation for each string in the language described
by that grammar.
Unambiguous grammars have a clear and consistent interpretation of the language structure, and are
easier to parse.
Simplifying CFGs
Normal Form refers to speci c structured formats of a context-free grammar (CFG) where the productions
follow a standardised pattern.
A CFG is said to be in Normal Form when every production rule of grammar has some speci c form/
restrictions.
The Chomsky Normal Form(CNF) is one of the simplest and most useful normal forms for a Context-Free
Grammar.
Where a is a terminal symbol, while A, B and C are arbitrary non-terminals. B and C may not be the start
variable of the grammar.
We may also permit the rule S -> λ, in case our language derives the empty string.
CNF is particularly valuable because it standardizes the grammar structure, simplifying parsing and
derivations.
In CNF, every string derivation with n characters requires exactly 2n−1 steps. This regularity allows for
straightforward parsing and derivation checks, enabling easier validation of a string’s membership in a
language.
Every CFG has an equivalent form in CNF, making it a universal tool in grammar normalisation.
Normalization of CFGs simpli es the structure of a grammar, making it easier to analyse and process.
More e cient parsing algorithms can be designed for grammars in a normal form.
Normal forms help in proving properties of languages, such as membership in the context- free language
class.
• The Pumping Lemma for CFLs provides a necessary condition that must be satis ed for any
string within a context-free language. Unlike the pumping lemma for regular languages, the
lemma for CFLs addresses the recursive and nested nature of context-free structures.
• This lemma is particularly useful in demonstrating the limitations of CFLs by proving that
certain languages cannot be generated by a context-free grammar.
3. Pumping Cases:
• Case 1: vwx consists of only 0’s - Pumping the 0's would either increase or decrease their
count. When we decrease the count to nil, the string s will violate the structure of the
language L.
• Case 2: vwx consists of 0’s and 1’s - Pumping them would either increase or decrease their
count. When we increase the count of them both, the second half will not mirror these
changes, thus the string s will violate the structure of the language L.
• Case 3: vwx straddles the middle - Pumping 0’s and 1’s in the middle would either increase
or decrease their count. When we decrease the count to nil the string s will violate the
structure of the language L.
fi
fi
fi
fi
Conclusion: Since pumping in all cases results in a string that does not have two identical halves, L
cannot be a context-free language.
• Consider the language L = {w : na(w) = nb(w) = nc(w)}, which includes strings with equal
numbers of a’s, b’s, and c’s.
• Non-CFL Proof Using Intersection: Suppose L is context-free. Then L ∩ L(a*b*c*) = {anbncn: n
≥ 0} would also be context-free, but we know this is not true(proved before). Therefore, L is
not context-free.
• A property is decidable if there exists an algorithm that provides a yes/no answer in nite
time. In the context of CFLs, two critical decidable properties are:
◦ Is L(G) empty?
◦ Is L(G) in nite?
• A CFG G generates an empty language if no string can be derived from its start symbol.
◦ Remove all non-generating symbols (those that cannot produce terminal strings) and
non-reachable symbols.
◦ After simplifying the grammar, verify if the start symbol SSS can still derive any
terminal string.
fi
fi
fi
◦ Example: For a grammar G with the production rules
▪ S → aSb | λ
▪ A → aA | b
• Step 2: Check if S can generate a string. After removing A, S can still derive the empty string
A which implies that L(G)≠∅.
• A CFG generates an in nite language if it can produce in nitely many distinct strings.
What is Parsing?
Parsing is the process of analysing a string of symbols, either in natural or programming languages,
according to the rules of a formal grammar.
It checks if the input string belongs to the language and provides structure to the input.
Used in compilers, interpreters, natural language processing, and data processing systems.
Types of Parsing
Bottom Up Parsing
Attempts to reduce the input string into the start symbol by reversing the production rules.
LL vs LR Parsers
Applications of Parsing
Recursive descent parsing is a top-down parsing technique where each non-terminal symbol in the
grammar corresponds to a recursive function in the parser. The parser works as follows:
These functions call each other to match production rules against the input tokens
ffi
The parser consumes input tokens one by one and applies the corresponding grammar production
until all input tokens are processed.
Example Walkthrough
Parsing Step-By-Step
2. Parse A:
Input starts with a: match a and recursively parse A again
Input is a: match again and recursively call A
Input is b: does not match a, stop parsing A (using ϵ)
3. Parse B:
Input is b: match b, recursively parse B
No more input: stop parsing B (using ϵ)
◦ Features: Recursive descent parsing directly follows the grammar structure, making it easy
to implement for small, simple grammars.
◦ Limitations: Cannot handle left-recursive or highly ambiguous grammars.
Code Structure
Non-Terminal Functions
Issues in the Expression Grammar Example
Key Concepts:
Example: This expression can be parsed as either "3 + (4 * 5)" resulting in 23, or "(3 + 4) * 5" resulting in
35, with both parse trees being valid.
These issues—multiple parse trees, unde ned operator order, and left recursion—complicate parsing in
expression grammars and highlight the need for grammar modi cations to resolve ambiguity
Left Recursion
Left recursion occurs when a non-terminal in a grammar has a production rule where the non- terminal
appears on the left-hand side of its own de nition. This causes the parser to continuously try expanding
the non- terminal inde nitely, leading to in nite recursion
The non-terminal E is on the left- hand side of its own production, making this left-recursive
For recursive descent parsers, this a causes a problem since the parser may get stuck trying to expand
the left-recursive rule in nitely, leading to non- termination
Left factoring is a grammar transformation technique used to eliminate ambiguity when two or more
productions a for a non- terminal begin with the same pre x. In such cases, the parser might struggle to
decide which production to apply. By factoring out the common pre xes, the grammar becomes more
suitable for parsers, particularly top-down parsers like recursive descent parsers
◦ The left-factored grammar is now unambiguous and better suited for recursive descent
parsing.
By applying left factoring, common pre xes are removed, resulting in a clearer, more e cient grammar for
top-down parsing techniques like recursive descent parsers.
Useless Symbols
◦ E Parsing:
▪ Rule: E' → Y E'
▪ Explanation: Begin by parsing T, which may represent a term or a parenthesised
expression. Then, parse E′ to check if there are further operators (e.g., "+" or "*").
◦ E' Parsing:
▪ Rule: E' -> Y E' | ε
▪ Explanation: Parse E′ if there is an operator, recursively handling sequences of
terms. If no operator follows, E′ is ϵ (empty), ending the expression.
◦ Y Parsing:
▪ Rule: Y→+T∣∗ T
▪ Explanation: Match the "+" or "*" operator followed by another term T.
◦ T Parsing:
▪ Rule: T→(E)∣ NUMBER
▪ Explanation: Parse either a number or a full expression enclosed in parentheses.
fi
fi
ffi
• Parsing Example:
Left Recursion
Problem: Recursive descent parsers cannot handle left-recursive grammars, where a non-terminal refers
to itself as the rst symbol in one of its productions (e.g., A -> A alpha)
Impact: If left recursion is present, the parser will enter an in nite recursive loop, causing the program to
crash
Solution: The grammar must be rewritten to eliminate left recursion, which can be cumbersome and may
not always be straightforward.
Problem:
Recursive descent parsers struggle with ambiguous grammars, where multiple parse trees are possible for
a given input. Ambiguities may cause the parser to backtrack extensively or fail altogether
Impact:
This can result in exponential time complexity due to backtracking, leading to very ine cient parsing
Solution:
Ambiguous grammars need to be carefully refactored or avoided, which might not be possible in all
situations
Backtracking Overhead
Problem:
Recursive descent parsers often rely on backtracking when they encounter choices in the grammar. For
example, if multiple productions are possible, the parser may try one production and backtrack if it fails,
trying the next production
Impact:
Excessive backtracking can make parsing ine cient, especially for grammars that have many possible
paths. This can lead to signi cant performance issues
Solution:
Lookahead techniques like LL(k) parsing, where the parser looks ahead at multiple tokens to decide which
production to apply, can reduce backtracking but add complexity to the parser
ffi
fi
fi
ffi
fi
ffi
Limited to LL(1) Grammars
Problem: Recursive descent parsers are typically designed to work with LL(1) grammars, meaning they
require only one token of lookahead to make parsing decisions
Impact: Many useful programming languages and grammars are not LL(1), meaning they require more than
one token of lookahead or have more complex rules
Solution: The grammar may need to be modi ed (left-factoring, removing ambiguities), or a di erent
parsing strategy (e.g., LR parsers or LL(k) parsers) may be needed.
Error Handling
Problem: Error handling is often poor in recursive descent parsers. When a syntax error is encountered, it
is not always clear how to recover gracefully and continue parsing
Impact: The parser may stop prematurely after encountering an error or provide unclear error messages,
making debugging and diagnostics di cult for large inputs
Solution: Implementing robust error recovery mechanisms (like panic mode or phrase-level recovery) is
challenging in recursive descent parsers and often requires complex logic.
ffi
fi
ff
week - 8
• LL(1) Parsing is a deterministic, top-down parsing technique that reads the input from Left
to right and constructs the Leftmost derivation of the input string using 1 lookahead symbol
to predict which production to apply.
• LL(1) parsers are known as predictive parsers because they can decide the next action
based on the lookahead symbol without backtracking.
• Context-Free Grammar (CFG): LL(1) parsers rely on CFGs that are compatible with LL(1)
rules.
• Parsing Table: A predictive parsing table determines which production rule to apply based
on the input symbol and the top of the stack.
• Stack and Input Processing: The parser uses a stack to keep track of non-terminals and
terminals during the parsing process. The current input symbol and top of the stack guide
the choice of production.
• Cannot Handle Left-Recursive Grammars: LL(1) parsers enter in nite loops if the
grammar contains left-recursive rules, as these rules make it impossible to determine the
next step without backtracking.
• Limited to One Lookahead Symbol: Since LL(1) parsers use only one lookahead symbol,
they cannot handle grammars that require more context.
• Not Suitable for Ambiguous Grammars: LL(1) parsers work only with unambiguous
grammars, as they need a deterministic approach without multiple derivations.
LL(1) parsing is an essential parsing method for simple, deterministic grammars, often used in educational
and compiler design contexts to illustrate predictive parsing principles.
• Initialise the Stack: Start with the stack containing the start symbol and the end-of-input
symbol ($).
• Lookahead Symbol: Examine the rst input symbol, which is used as the lookahead.
• Top of Stack Check: If the top of the stack is a non-terminal, use the parsing table to nd
the appropriate rule based on the lookahead symbol.
• Apply Rule or Match: If the top of the stack matches the lookahead symbol, pop both from
the stack and move forward in the input. Otherwise, replace the non-terminal with the right-
hand side of the production.
2. Example of Parsing with a Predictive Table: Using a parsing table, the parser applies each rule
based on the current top of the stack and the lookahead symbol, expanding non-terminals and
matching terminals with input symbols.
1. Nullable Set:
2. First Set:
These sets are foundational for predictive parsing, helping to ensure that parsers operate e ciently
without backtracking.
1. Follow Set:
• De nition: The Follow set of a non-terminal A is the set of terminals that can appear
immediately after A in any valid string derived from the grammar.
• Rules for Calculating the Follow Set:
• Start Symbol: Always include the end-of-input symbol ($) in FOLLOW of the start
symbol.
• Production Rule A→αBβ If there is a production A→αBβ then: Add FIRST(β)
(excluding ϵ) to FOLLOW(B). If β can derive ϵ, add FOLLOW(A) to FOLLOW(B).
• Production Rule A→αBA: If there is a production where B is at the end, add
FOLLOW(A) to FOLLOW(B).
fi
fi
fi
fi
fi
fi
ffi
2. Algorithm for Follow Set:
It contains symbols that can appear after the non-terminal in some derivation.
These sets (Nullable, First, and Follow) play a fundamental role in constructing predictive parsers, enabling
LL(1) parsers to function e ciently and without backtracking.
• Productions:
• S→ABC
• A→a∣ϵ
• B→b∣ϵ
• C→cD∣ϵ
• D→d∣ϵ
• Rules Applied:
• A→ϵ: A is nullable because it has an ϵ production.
• B→ϵ: B is nullable because it has an ϵ production.
• C→cD∣ϵ C is nullable because it has an ϵ production.
• D→d∣ϵ: D is nullable because it has an ϵ production.
• S→ABC: Since A, B, and C are nullable, S is also nullable.
• Final Nullable Set: Nullable(S)={ϵ}, Nullable(A)={ϵ}, Nullable(B)={ϵ}, Nullable(C)={ϵ},
Nullable(D)={ϵ}.
• Rules Applied:
• A→a∣ϵ: FIRST(A) = {a} because of the production A→a. Since A is nullable, add ϵ.
• B→b∣ϵ: FIRST(B) = {b} because of the production B→b. Since B is nullable, add ϵ.
• C→cD∣ϵ: FIRST(C) = {c} because C→cD. Since C is nullable, add ϵ.
• D→d∣ϵ: FIRST(D) = {d} because of the production D→dD. Since D is nullable, add ϵ.
• S→ABC: FIRST(S) depends on A, B, and C. Since A, B, and C are all nullable,
FIRST(S) = FIRST(A) ∪ FIRST(B) ∪FIRST(C).FIRST(S) = {a,b,c,ϵ}.
• Final FIRST Set: FIRST(S)={a,b,c,ϵ}, FIRST(A)={a,ϵ}, FIRST(B)={b,ϵ}, FIRST(C)={c,ϵ},
FIRST(D)={d,ϵ}.
◦ Rules Applied:
◦ Add $ to FOLLOW(S) since S is the start symbol.
◦ For S → ABC, add FIRST(B) to FOLLOW(A) and FIRST(C) to FOLLOW(B).
◦ Since C is nullable, FOLLOW(S) is added to FOLLOW(C), resulting in FOLLOW(C) =
{ $ }.
• Productions:
◦ S→AB
◦ A→aA∣b
◦ B→cBd∣ϵ
• Rules Applied:
◦ A is not nullable since neither production leads to ϵ.
◦ B is nullable due to B→ϵ.
◦ Since A is not nullable, S is also not nullable.
• Final Nullable Set:
◦ Nullable(S)={},Nullable(A)={},Nullable(B)={ϵ}
• Rules Applied:
◦ A→aA∣bFIRST(A) = {a,b} because of the two productions aA and b.FIRST(A)={a,b}
◦ B→cBd∣ϵ: FIRST(B) = {c,ϵ} because of the two productions cBd and
ϵ.FIRST(B)={c,ϵ}
◦ S→AB: FIRST(S) depends on FIRST(A). Since A→aA∣bB is nullable, so FIRST(S) =
FIRST(A) ∪ FIRST(B).FIRST(S)={a,b}
◦ Final FIRST Set: FIRST(S)={a,b}, FIRST(A)={a,b}, FIRST(B)={c,ϵ}
• Rules Applied:
◦ Add $ to FOLLOW(S) as the end-of-input marker.
◦ For S → AB, add FIRST(B) to FOLLOW(A). Since B is nullable, add FOLLOW(S) to
FOLLOW(A), resulting in { c, $ } for FOLLOW(A).
◦ For B → cBd, add d to FOLLOW(B).
• Final Follow Set:
◦ FOLLOW(S)={$}, FOLLOW(A)={c,$}, FOLLOW(B)={d,$}
These sets allow for the construction of predictive parsing tables necessary for LL(1) parsing, providing the
basis for determining which productions to apply based on lookahead symbols.
1. Algorithm Overview:
• Grammar:
◦ S→AB
◦ A→a∣ϵ
◦ B→b∣ϵ
• First and Follow Sets:
◦ FIRST(A) = { a, ϵ }
◦ FIRST(B) = { b, ϵ}
◦ FIRST(S) = { a, b, ϵ}
◦ FOLLOW(S) = { $ } (since S is the start symbol)
◦ FOLLOW(A) = { b, $ } (because B follows A in S→AB
◦ FOLLOW(B) = { $ } (because B is at the end of the production)
• Filled Table: ▪ For S→AB: FIRST(AB) contains a and b, so we place S→AB in both [S,a] and
[S,b]. Since B is nullable, also place S→AB[S,$] (from FOLLOW(S)).
◦ For A→a: FIRST(a) contains a, so place A→a [A,a].
◦ For A→ϵ: Since ϵ∈FIRST(A), place A→ϵ[A,b] and [A,$] (from FOLLOW(A)).
◦ For B→b: FIRST(b) contains b, so place B→b [B,b].
◦ For B→ϵ: Since ϵ∈FIRST(B), place B→ϵ[B,$] (from FOLLOW(B)).
fl
Parsing a String Using Table
Example Parsing:
Step-by-Step Parsing
[b,B,$]
LL(2) Parsers
• LL(2) parsing extends LL(1) parsing by allowing the parser to look ahead 2 input symbols instead of just
1. This gives the parser more context to decide.
Lookahead of 2 Symbols:
• In LL(2), the parser examines the current input symbol and the one that follows it, providing better insight
into which production to apply.
• This helps resolve ambiguities that LL(1) cannot handle, especially when two or more rules have
overlapping FIRST sets but di er on the second symbol.
By looking two symbols ahead, LL(2) parsers can di erentiate between rules that share common pre xes
but di er afterward. This reduces FIRST set con icts.
LL(k)
Increased Complexity:
• Although LL(k) parsers can handle more complex grammars, they are also more computationally
expensive. The parser's decision-making process becomes more complicated as it has to consider
multiple symbols simultaneously.
• The size of the parsing table grows exponentially with k, making the construction and storage of the
table more challenging for larger values of k.
• S -> a b X
• S -> a b Y
• S -> a c Z
• This needs 2 lookaheads.
Comparison
fl
ff
ff
fl
ff
fi
Week - 9
LL(k) parsers rely on a xed number of lookahead tokens (k) to make parsing decisions
For some grammars, even with large k, the parser may not have enough information to resolve ambiguities
or determine which production to apply
This results in parsing failures for languages that require more than k tokens of lookahead.
LL(k) parsers cannot handle left-recursive grammars, where a non-terminal calls itself on the left side of
the production
The grammar needs to be transformed to remove left recursion, which can complicate grammar design.
LL(k) parsers can only parse a subset of context-free languages called LL(k) grammars
More powerful parsing techniques (such as LR parsers) are required for these languages, which reduces
the utility of LL(k) parsers for modern compiler implementations.
Shift-Reduce Parsing is a type of bottom-up parsing. The goal is to reduce the entire input to the start
symbol of the grammar
The parser works by either shifting symbols from the input to a stack or reducing the top symbols of the
stack into non-terminals based on the grammar rules
Reduce: Apply a production rule to replace symbols on the stack with a non-terminal
Accept: The parsing is complete when the stack contains the start symbol and the input is empty
Error: Occurs when no valid action (shift or reduce) can be applied, and the parser cannot proceed.
fi
Shift-Reduce Parsing Example
Let us try to reduce input sequence id * id + id to the start symbol S using the grammar rules.
Let us look at how the input sequence and stack change during parsing.
Continue shifting and reducing until the stack contains the start symbol S and the input is empty
Can handle a wide range of grammars, including ambiguous ones with proper con ict resolution
Example: A shift/reduce parser can easily handle arithmetic expressions with proper operator precedence.
Requires careful handling of shift/reduce con icts and reduce/reduce con icts.
Parsing tables for larger grammars can be complex and di cult to debug.
What is LR Parsing?
LR Parsers are a type of bottom-up parsers that read input Left-to-right and construct a Rightmost
derivation in reverse
The term "LR" stands for Left-to-right scanning of input and Rightmost derivation in reverse
A LR Parsing Table is a precomputed structure that directs the parser's actions whether to shift, reduce,
accept, or declare an error
Without the table, the parser would have no guidance for making decisions and would not be able to
parse e ciently
ffi
ffi
fl
ffi
fl
fl
The parser uses the table to decide its next move based on the current state and the next input symbol.
E cient Decision Making: The parsing table allows the parser to e ciently decide its actions based on a
lookup
Con ict Resolution: Helps manage shift-reduce and reduce-reduce con icts
Predictive Parsing: The table precomputes all possible shifts and reductions, ensuring the parser operates
in a deterministic way.
Precomputation: Tables are computed once, making parsing faster during runtime
Deterministic Parsing
Allows us to handle Complex Grammars.
LR Parsers are type of bottom-up parsers that read input Left-to-right and construct a Rightmost
derivation in reverse
LR(0) is the simplest form of R parsing, which uses no lookahead symbols (i.e., 0 lookahead)
It's one of the earliest and most basic algorithms for constructing shift-reduce parsers
This means the parser decides whether to shift or reduce without looking at the next input symbol
The parser processes input by moving through a set of states, making decisions based solely on the
current state and the content of the stack
LR(0) Items
An LR(0) item is a production rule with a dot (.) indicating the current parsing position
These items track how much of a production rule has been parse.
Advantages
Simplest form of LR parsing
Can be implemented e ciently
Limitations
Cannot handle all context-free grammars
Prone to con icts due to the lack of lookahead (especially reduce-reduce con icts)
Only suitable for simple grammars with minimal ambiguity
ffi
fl
fl
ffi
ffi
fl
fl
Example
This grammar will be used to build the LR(0) parsing table and parse a string.
LR(0) items track the position of the "dot" in the production rules, showing progress in the parsing
process.
The Goto Table tells us how to transition between states based on non-terminals
Let us try to parse the string "aab" using the LR(0) Parsing Table constructed. This string is accepted by
the grammar
fi
Transitions:
Action Table:
For terminals, use
shift if there’s a move;
use reduce if the dot
is at the end of a
production.
Goto Table:
For non-terminals, transition to the corresponding state based on the grammar.
Parsing
1. Grammar Used:
• S→CC
• C→cC
• C→d
2. Augmented Grammar:
• Add an augmented production: S′→ S.
4. State Transitions:
• De ne shifts and gotos based on terminals (c, d) and non-terminals (C):
• I0 → c → I3 (Shift on c)
• I0 → d → I4 (Shift on d)
• I2 → c → I3 (Shift on c)
• I2 → d → I4 (Shift on d)
• I3 → c → I3 (Shift on c)
• I3 → d → I4 (Shift on d)
• I0 → C → I2 (Goto on C)
• I2 → C → I5 (Goto on C)
• I3 → C → I6 (Goto on C)
This process demonstrates how to construct an LR(0) table systematically, enabling deterministic parsing
for simple grammars.
fi
fi
What is LR(1) Parsing and LR(k) Parsing
Key Concepts:
2. Components of LR(1):
• States: Represent parsing con gurations at each step.
• Lookahead: A single symbol that helps the parser decide between ambiguous production
rules.
• Action Table: Dictates whether to:
◦ Shift: Move to the next state by reading the input symbol.
◦ Reduce: Apply a production rule to reduce the stack.
◦ Accept: Successfully parse the input.
◦ Error: Detect and report a syntax error.
LR(1) parsing is widely used in practice and forms the basis for many parser generators like Yacc and
Bison, o ering a balance between complexity and capability.
Key Concepts:
1. LL(k) Parsers:
• De nition: Top-down parsers that scan input left to right and produce a leftmost derivation
using k-lookahead symbols.
• Key Characteristics:
◦ Cannot handle left-recursive grammars.
◦ Simpler to implement but more restrictive.
• Class Inclusion: LL(1)⊆LL(2)⊆...⊆LL(k).
◦ As k increases, LL parsers can handle more complex grammars.
2. LR(0) Parsers:
• De nition: Bottom-up parsers that scan input left to right and construct a rightmost
derivation in reverse without using lookahead symbols.
• Key Characteristics:
◦ Can handle left recursion.
◦ More expressive than LL parsers but limited in resolving parsing con icts.
• Class Inclusion: LL(1)⊆LR(0).
fi
fi
fi
ff
ff
fi
fl
3. LR(1) Parsers:
• De nition: Bottom-up parsers that use 1 lookahead symbol to resolve con icts.
• Key Characteristics:
◦ Can handle all deterministic context-free grammars.
◦ Resolves ambiguities that LR(0) cannot handle.
• Class Inclusion: LL(1)⊆LL(2)⊆…⊆LL(k)⊆LR(0)⊆LR(1).
4. Hierarchy Summary:
• The hierarchy shows that LR(k) parsers are the most powerful, and able to handle a broader
class of grammars compared to LL(k).
• While LL(k) parsers are easier to implement, LR(k) parsers are better suited for complex
grammars, making them essential in real-world applications.
Understanding this hierarchy helps in selecting the appropriate parser type for speci c grammar
complexities.
fi
fl
fi
Week - 10
De nition: POS (Part-of-Speech) Tagging is a process in natural language processing (NLP) that involves
identifying and labeling the parts of speech (nouns, verbs, adjectives, etc.) in a given text.
• Rule-Based Approaches: Use prede ned linguistic rules; e ective for structured text but lacks
adaptability for complex scenarios.
• Neural Networks: Leverage word embeddings and deep learning for state-of-the-art performance.
• Transformers (e.g., BERT): Use contextualised embeddings for higher tagging accuracy.
• Syntax Analysis: Lays the groundwork for syntactic parsing and grammar checking.
• Named Entity Recognition (NER): Di erentiates named entities from other nouns.
• Sentiment Analysis: Identi es opinions using adjectives and adverbs.
• Information Retrieval: Improves query understanding by analysing word roles.
• Machine Translation: Enhances translation quality by aligning syntactic structures across
languages.
Developed by Noam Chomsky to explain the syntax of sentences without relying on context
Example:
"Colorless green ideas sleep furiously” (Chomsky, 1957) - grammatically correct but semantically
nonsensical
Focus on Syntax:
. CFGs operate at the syntactic level, analyzing structure, not meaning CFG for NLP:
• Helps build a rule-based approach to parse sentences
Understanding Ambiguity:
Types of Parsers:
• Top-Down Parsers: Start from the sentence level, breaking down into smaller constituents (e.g., NP,
VP)
• Bottom-Up Parsers: Start with individual words and build up to form the full sentence structure
Classic View:
Treats POS tagging as a parsing problem where parsing rules guide tagging
Modern Solutions:
• Use of machine learning and statistical models has largely replaced rule-based POS tagging
• A type of formal grammar in the Chomsky hierarchy, more powerful than Context- Free Grammars
(CFGs)
• Can capture language structures where the context around a symbol a ects its production rules
• Key Feature:
• The production rules can depend on the surrounding context of symbols, making CSGs more exible
for complex languages
• Context-Sensitive Grammars (CSGs) allow rules that depend on the context around symbols
• Unlike simpler grammars, CSGs can capture relationships between elements that rely on position or
neighboring symbols
• CSGs are more powerful than context-free grammars and describe languages that require context for
certain rules to apply
ff
fl
Context-Sensitive Grammars (CSGs) and POS Tagging
Computational Complexity:
O Parsing context-sensitive languages is generally more complex than context-free languages, often
requiring non-deterministic linear space
Practical Limitations:
O Few e cient parsers for CSGs exist, making them impractical for real-time applications
• Due to their complexity, many languages modeled by CSGs are approximated by simpler grammar types
where possible
2. Grammar Rules:
• <Program> → <Statement> <Program> | ε
• <Statement> → <Var> = <Expr> : <Type>
• <Expr>: int → <Expr> + <Expr> : int
• <Expr>: int → 0 | 1 | 2 | ... | 9
• <Expr>: str → "..." (any string literal)
• <Var>: int → x | y | z (integer variables)
• <Var>: str → s | t | u (string variables)
• <Type> → int | str
4. Example Parse:
• For x = 3 + 5 : int:
◦ <Statement> → x = <Expr> + <Expr> : int
◦ <Expr> + <Expr> : int → 3 + 5 : int
• For s = "hello" : str:
◦ <Statement> → s = <Expr> : str
◦ <Expr> : str → "hello"
2. Grammar Rules:
• <Classi cation> → <Family> : <Genus> <Species> <Classi cation> | ε
• <Family>: Felidae → <Genus>: Felis
• <Genus>: Felis → <Species>: catus | silvestris
• <Family>: Canidae → <Genus>: Canis
• <Genus>: Canis → <Species>: lupus | familiaris
fi
fi
fi
fi
fi
3. Context-Sensitive Rules:
• Latin Naming Conventions: Genus names are capitalised; species names are lowercase
and italicised.
• Species-Speci c Context: A species must belong to its de ned genus and family.
4. Example Parse:
• For Felidae: Felis catus:
◦ <Classi cation> → <Family>: <Genus> <Species>
◦ <Family>: Felidae → <Genus>: Felis
◦ <Genus>: Felis → <Species>: catus
• For Canidae: Canis lupus:
◦ <Classi cation> → <Family>: <Genus> <Species>
◦ <Family>: Canidae → <Genus>: Canis
◦ <Genus>: Canis → <Species>: lupus
This grammar showcases how CSGs handle structured, context-sensitive tasks like taxonomic
classi cation with precision.
1. Applications of CSGs:
Proposed by linguist Noam Chomsky, this classi es types of formal grammars based on their complexity
and power
It's a theoretical framework used to understand di erent language classes in computer science,
linguistics, and automata theory.
De nition:
Type 0 grammars (Unrestricted Grammars) have no restrictions
on their production rules, making them the most powerful.
Formal De nition:
Productions are of the form Alpha -> beta, where Alpha and
beta are strings of variables and terminals, with a containing at
least one non-terminal.
Languages Generated:
Type 0 grammars can generate any language that is recursively
enumerable
Computational Model:
These grammars correspond to Turing machines, making them
as powerful as any algorithmically describable process.
De nition: Type 1 grammars (CSGs) have rules where each production must have at least as many
symbols on the right side as on the left, preserving context
Formal De nition: Productions are of the form aAß - ayß, where A is a non-terminal, a and ß are strings of
terminals and non- terminals, and y is a non-empty string.
Languages Generated: These grammars generate context sensitive languages, which include certain
complex patterns not possible in lower levels
Computational Model: CSGs correspond to linear-bounded automata, which are Turing machines with
limited memory.
De nition: Type 2 grammars (CFGs) have a rules where a single non-terminal symbol is transformed
independently of surrounding symbols.
Formal De nition: Productions are of the form A - y, where A is a non-terminal and y is a string of terminals
and/or non-terminals.
De nition:
Type 3 grammars (Regular Grammars) have very strict rules, where productions involve a non-terminal
transforming into terminal optionally followed by a non-terminal
Formal De nition:
Productions are of the form A -> aB or A a, where A and B are non-terminals, and a is a terminal.
Languages Generated: Regular grammars generate regular languages, the simplest language class, which
includes patterns like sequences and repetitions.
Computational Model: Regular languages are recognised by nite automata (both deterministic and non-
deterministic).