0% found this document useful (0 votes)
11 views

Formal Language and Applications notes

The document covers fundamental concepts in formal languages, including alphabets, strings, concatenation, and null strings, as well as the definitions and properties of languages and grammars. It introduces regular expressions, deterministic finite automata (DFA), and the limitations of regular languages, emphasizing the need for context-free grammars (CFGs) for more complex structures. Key topics include operations on languages, the Myhill-Nerode theorem, and the Pumping Lemma, which are essential for understanding the capabilities and limitations of different types of formal languages.

Uploaded by

Varun Shivan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Formal Language and Applications notes

The document covers fundamental concepts in formal languages, including alphabets, strings, concatenation, and null strings, as well as the definitions and properties of languages and grammars. It introduces regular expressions, deterministic finite automata (DFA), and the limitations of regular languages, emphasizing the need for context-free grammars (CFGs) for more complex structures. Key topics include operations on languages, the Myhill-Nerode theorem, and the Pumping Lemma, which are essential for understanding the capabilities and limitations of different types of formal languages.

Uploaded by

Varun Shivan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Week - 1

What are Alphabets?


Alphabets in the formal sense is any, a nite, nonempty set ∑ of symbols
Example:
∑= {0,1}
∑ = { a,b,c }
∑ = { $, #, & }

What are Strings?

Strings are nite sequence of symbols from a given alphabet.


Consider ∑ = {0,1}
Possible Strings?
01010
00001111
0
1

Length of Strings
Number of symbols a string contains.
If w is a string, its length is written as |W|.

Null String
Can we have strings of length 0?
Null String.
We use λ or ε to refer to Null strings.

Concatenation

∑ = {a, b, c, 0, 1}

W1 = abab
W2 =0101
W3 = abab0101

Concatenating a String With Itself!

∑ = {a, b, c, 0, 1}

W = ab
w0= X(null String)
abab= ww = w2
ababab = www = w3
abababab = wwww = w4

Reversing a String

∑ = {a, b, c, 0, 1}
W = ab01
Reverse(WR)= 10ba

Pre x of a String - Example

Pre x
x is a pre x of string y if a string z exists such that, xz=y
x=ab, z=01, y=ab01
fi
fi
fi
fi
fi
Su x of a String - Example

Su x
x is a su x of string y if a string z exists such that, zx=y
x= 01, z =ab, y= ab01

Kleene Star

Let ∑ = (0,1}
∑* contains all possible strings made by selecting & concatenating elements from E

Selecting all individual symbols..


∑* = {λ ,0,1}

Selecting all combinations that make length 2..


∑* = {λ ,0,1 , 00, 01, 10, 11,}

Selecting all combinations that make length 3..


∑* = {λ,0 , 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101. 110, 110}

Selecting all combinations that make length 0…..


∑* = {λ}

∑+ = ∑* - {λ}

Properties on String Operations - A Proof by induction

Proof of |uv| = |u| + |v|

Basis

|uv| = |u| + |v| is true for all u of any length and all v of length 1

Inductive Assumption

|uv| = |u| + |v| is true for all u of any length and all v of length upto n

Inductive Step

|uv| = |uxa| = |ux| + 1

By Inductive Hypothesis, we know


|ux| = |u| + |x|

Therefore,
|uv| = |u| + |v|

What are Languages?

Any set of strings over an alphabet E are called languages.


Let ∑ = {0,1}

L = {0,1, 00, 01, 10, 11}


L = {1, 0, 00, 000, 0000}
L = {1, 0, 00, 01, 10, 11, 000, ….}
Sentence is any string in a Language.

L = {0n 1n : n ≥ 0}

Given a Language L, How do we verify if a sentence s belongs in language L?


ffi
ffi
ffi
Language Acceptors / Automata

Output = Yes/No

Set Operations on Languages

Let ∑ = {0,1}
L1 = {0n 1n : n ≥ 0}
L2 = {1n 0n : n ≥ 0}
L3 = L1 U L2
L3 = ∑* - L1
L3 = (wR : w belong L1)
Using Concatenation
L1 = {00, 000}
L2 = (1, 11,111}
L3 = L1L2 {xy : x € L1, y € L2}
L3 = { 001, 0011, 00111, 0001, 00011, 000111}

Kleene Closure / Star Closure on Languages

Kleene closure on L is de ned as


fi
Languages & Grammar

Grammar in any language tells us if the sentence is well phrased or not!

<sentence> - <noun-phrase> <predicate>


<noun-phrase> - <article> <noun>
<predicate> - <verb>

Rules / Productions
All three rules given, that starts with
<sentence>

Variables
<sentence> <noun-phrase>, <predicate>,
<article>, <noun>, <verb>

Starting Symbol
<sentence>

Terminal Symbols
Any English word we substitute for Variables

De nition of a Grammar

A Grammar G is de ned as a quintuple


G = (V,T,S,P) in which:
V - Variables or Non-Terminals
T - Terminal Symbols
S - Starting Symbol
P - Productions / Rules

Automata

Once the input is completely read and processed, the automaton produces an output yes or no. In this
case, our automaton works as an acceptor. It accepts the sentence if it is valid. At times, the automaton
produces some other useful input or transforms the input given. In other words, it takes its input and then
produces another output, not just yes or no. In such cases, we call an automaton, an automaton works as
a transducer while performing. The emphasis on automata is relatively lesser in this course. The larger
emphasis is on formal languages, their corresponding grammar, properties and applications.
fi
fi
Week - 2

Regular Expressions (Informally)

. = concatenation
*= Repeat
+= OR

Example:
Language L = {0,1}

RE is 0+1

Language L = (01}
RE is 0.1

Language L = {00, 01}


RE = 0.0 + 0.1
= 0.(0+1)

Language L = (lamda, 01, 0101,010101,…..}


RE = (0.1)*

A Note on Operator Precedence in RE

Regular Expressions
Yes

Yes

No

Languages Associated with REs

Example :
We've learned that for every regular language L, one, we will be able to nd a regular expression to
describe it. Two, there is a DFA that accepts this language L.

Regular Languages

The language associated with any regular expression is regular language.

For every regular language, we have the corresponding regular expression.

Regular languages are very useful and important type of formal language. You will learn that regular
expressions can be used to describe languages and if a language is given, you will learn how to nd the
regular expression if possible. Great! The language associated with any regular expression is called as
regular language. This means that whenever we have a regular expression, the set of strings it de nes is a
regular language. Furthermore, for every regular language, there is a corresponding regular expression. So
given a regular language, you should be able to nd the corresponding regular expression. In a sense,
regular expressions and regular languages are two sides of the same coin. Regular languages are quite a
useful class of formal languages. Imagine that I am handing over a document to you, a long document,
and I request you to nd all the phone numbers in the document. You know for sure, the valid phone
numbers form a pattern and you can certainly write a regular expression to search all of them. How do you
think? The text editors or IDEs are highlighting certain kind of text, keywords, etc. They use regular
expressions to nd all the matching patterns and they highlight them. When you declare a variable or a
constant in a programming language, when you write a numerical value or a string value, or when you use
a keyword in a programming language, the compiler understands them correctly by modeling these items
as regular languages. We will see more of this shortly in this course. In network security, they help to
detect patterns that might indicate a cyber attack. In data mining, they help nding useful patterns in large
data sets. Very close. Regular languages are a powerful class of formal languages which has got so many
applications in computer science. I am sure you will learn what regular expressions are and how regular
expressions and regular languages are connected.
fi
fi
fi
fi
fi
fi
fi
Deterministic Finite Automata

Example:
Regular Languages

Let M be DFA.
Language accepted by M, L(M), is regular

A Language L is regular if and only if there exists some DFA such that L = L(M)

Regular languages are exactly the languages that can be accepted by DFA and represented by regular
expressions, making them formal languages with precise structural de nitions.
fi
In RLG, the non-terminal appears on the
right hand side of the production.

Only (whole) left or right linear grammars will


generate regular language.

A mixture of left, right and middle will not


generate regular language
Week- 3
The longest match principle
means that the analyser
keeps matching characters
until the longest valid token
is identi ed.

After that any unrecognized


element is shown as error.

Next aspect that we want to consider in the design


of lexical analyzer is how do we handle the cases
that leads to error. It's always good to backtrack to
the most recent point where I've recognized
something and then go back into the string and
recognize up to that and then say the rest of string is
not expected. So it depends how the language is
being designed and it's important as a designer, you
must know these things.
fi
Week- 4

Indistinguishable States

If states p and q of a DFA are called indistinguishable then the states p and q are equivalent.

Indistinguishability has the properties of an equivalence relation.(re exive, symmetric and transitive)
fl
Mark-Reduce Method for Simplifying DFA

Two subsets in the set:


{ {q1,q2}, {q0,q3} }

Important
Draw table to con rm if the state do belong to the derived answer.
fi
Recognizing languages that can have a DFA and that cannot have DFA

The Myhill-Nerode Theorem


The Myhill-Nerode theorem is fundamental in DFA minimisation as it provides a systematic method to
identify and merge equivalent states, e ectively reducing the number of states in a DFA. Additionally,
the theorem helps determine if a language is regular by showing whether a nite number of
distinguishable states exist for the language.

The Myhill-Nerode theorem states that a language is regular if the equivalence classes are nite.

Pumping Lemma

The Pumping Lemma shows that if a language is regular, strings longer than a certain length can be
"pumped," which leads to contradictions for non-regular languages.

Closure Under Set Operations (closure properties)

•Closure under Union (L1 ∪ L2):


◦If both L1 and L2 are regular languages, their union is also a
regular language.
◦Example: If L1={a,aa} and L2={b,bb}, then L1 ∪
L2={a,aa,b,bb} is regular.

•Closure under Intersection (L1 ∩ L2):


◦The intersection of two regular languages is also regular.
◦Example: If L1=(a+b)* and L2=a*, then L1 ∩ L2 is the
language consisting of only strings with repeated 'a's, which
is regular.

•Closure under Concatenation (L1L2):


◦If L1 and L2 are regular, then their concatenation is also
regular.
◦Example: If L1={a} and L2={b}, then L1L2={ab}.

• Closure under Complementation (L1):


◦ The complement of a regular language is regular.
◦ Example: If L1 is the set of all strings with even length, L1 will contain all strings of odd
length, which is also regular.

• Closure under Star Operation (L1*):


◦ If L1 is regular, the star-closure of the language (zero or more repetitions) is also regular.
◦ Example: If L1 = {a}, then L1* = {ϵ,a,aa,aaa,… }.
ff
fi
fi
Week - 5

1. Limitations of Regular Expressions and Finite Automata:


• Lack of Memory: Regular languages have no memory beyond nite states. Finite Automata
cannot remember sequences of input, which might be necessary to identify certain patterns
Regular expressions and nite automata are powerful for recognising simple patterns and
languages but fail with more complex structures.
• For instance, regular expressions cannot model the language L={anbn : n≥0}, where the
number of 'a's must match the number of 'b's, due to the counting limitation of nite
automata. Regular languages cannot handle counting problems, such as matching the
number of opening and closing parentheses.

2. Example of Unachievable Language Patterns:


• Language anbn: This language requires matching counts of 'a's followed by 'b's. Finite
automata cannot handle this pattern since they lack the memory to count and match
symbols over arbitrary lengths.
• Such languages are non-regular and require context-free grammars for representation.

3. Why Study These Limitations?


• Recognising the limitations of nite automata allows us to understand when to employ more
advanced models like CFGs, which can handle nested structures and dependencies.
• The limitations highlight the boundaries of regular languages and motivate the need for
context-free languages (CFLs) to solve more complex computational problems.

4. Need for Context-Free Grammars (CFGs):


• CFGs are a more powerful formalism that can represent languages with nested structures,
such as balanced parentheses or matched symbols in programming languages and natural
languages.

Regular languages cannot represent languages requiring recursion or nested dependencies, as these
require more computational power than nite automata can provide.

Key Concepts:

1. Grammar:
• A grammar is a formal system that de nes the syntactic structure of a language.
• Formal De nition: A grammar G is de ned as a 4-tuple G=(V,T,S,P), where:
◦ V: Set of non-terminal symbols.
◦ T: Set of terminal symbols (alphabet).
◦ S: Start symbol from which derivation begins.
◦ P: Set of production rules, with each rule in the form A → x, where A ∈ V and x is a
string from (V ∪ T)*.
2. Context-Free Grammar (CFG):
• A CFG is a speci c type of grammar where each production rule has the form A → x, where
A is a single non-terminal, and x is any string of terminals and non-terminals.
• Context-free means that the left-hand side of a production consists of exactly one non-
terminal, with no dependency on surrounding symbols (context).
• CFGs are widely used in programming languages and mathematical expressions, as they
can represent recursive language structures.

Example:
Let us de ne a CFG for Palindromic strings only consisting of characters a and b.
• Non-terminals: V = {S} (Start Symbol)
• Terminals: T = {a, b}
• Production rules:
◦ S → aSb
◦ S → bSa
◦ S→λ
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
1. Language of CFGs
• A language (CFL) is the set of all strings that can be derived from the start symbol of a CFG.
• Starting from the start symbol, repeatedly apply production rules until only terminal symbols of a
string remain. This is the Derivation Process for that string.
• Parse Trees are graphical representations of the derivation of a string, showing how the string is
generated step-by-step.

Key Concepts:

1. Context-Free Grammar Components:


• A CFG consists of terminal symbols, non-terminal symbols, production rules, and a start
symbol.
• These elements work together to de ne all possible strings in the language.

2. Example Language L={ anbn| n≥1}:


• Production Rules: The CFG generates strings with equal numbers of 'a's followed by 'b's.
• Derivation Example: Starting from the start symbol SSS:

Step 1: Apply production S→ aSb.


Step 2: Continue applying production rules to reach strings like “aabb”.

This CFG generates strings with n number of a’s followed by the same number of b’s.
1. String Derivation:
• Each step in the derivation involves substituting a non-terminal with the corresponding rule
until a string of terminals is produced. We can generate the string aabb using the grammar
in this example.

We always start with the start symbol S


1. S
2. S → aSb (Applying Rule 1 - S → aSb )
3. aSb → a(ab)b (Applying Rule 2 - S → ab)
We get the nal string aabb

Key Concepts:

1. Language L={anbmcn | n, m ≥ 1}:


• Production Rules: This language contains strings with equal numbers of 'a's and 'c's, with
any number of 'b's in between.
• Rule Structure: The CFG allows for exible generation of strings with di erent numbers of
'b's.
2. Derivation of Strings:
• Example Derivation: Starting from the start symbol SSS, the derivation of "aabcc" involves
applying rules that ensure equal numbers of 'a's and 'c's, while allowing variation in the
number of 'b's.
• Step-by-Step Substitution: The production rules are applied until the terminals form the
target string.

We always start with the start symbol S


• S
• S → aSc (Applying Rule 1 - S → aSc )
• aSc → a(aBc)c (Applying Rule 2)
• a(aBc)c → aa(b)cc (Applying Rule 4)
• We get the nal string aabcc

1. Pattern Recognition:
• This CFG demonstrates the power of context-free languages to manage complex
dependencies within strings, such as matched pairs.
fi
fi
fi
fl
ff
Key Concepts:

1. Language (L ={w ∈ {0,1}* | String w contains an equal number of 0’s and 1’s}:
• This CFG de nes a language where every valid string has an equal number of 0s and 1s.
• Production Rules: The rules ensure that each derivation leads to strings meeting this
balance condition
◦ Rule 1 - S → 1S0S
◦ Rule 2 - S → 0S1S
◦ Rule 3 - S → λ

2. String Derivation Example:


• Target String: "0110"
• Step-by-Step Derivation: Starting from the start symbol SSS, each application of the
production rules produces a balanced structure until reaching "0110".
◦ S → 0S11S2 (Applying Rule 2 - S → 0S1S)
◦ 0S11S2 → 01S2 (Applying Rule 3 S3 → λ )
◦ 01S2 → 011S30S4 (Applying Rule 1 on S2)
◦ 011S30S4→ 0110S4 (Applying Rule 3 on S3)
◦ 0110S4→ 0110 (Applying Rule 3 on S4)
◦ We get the nal string 0110

3. Balancing Elements in CFGs:


• The CFG guarantees that every derivation results in balanced numbers of 0s and 1s,
demonstrating the CFG’s power in de ning patterns that maintain internal symmetry.

Key Concepts:
1. Language L={a2nbm | n, m ≥ 0}:
2. This language represents strings with an even number of 'a's followed by zero or more 'b's.
3. Production Rules: The CFG uses rules to ensure that every string has the required number of 'a's
and 'b's in the correct order.
• Rule 1 - S → AB
• Rule 2 - A → aaA
• Rule 3 - A → λ
• Rule 4 - B → Bb
• Rule 5 - B → λ

4. Left-Most Derivation:
• In a left-most derivation, the left-most variable is replaced at each step.
• Example Derivation of "aab": Starting from S, apply production rules by always
substituting the left-most non-terminal until reaching "aab".
• Deriving aab using Left Most Derivation
◦ S → AB (Applying Rule 1)
◦ AB → aaAB (Applying Rule 2)
◦ aaAB → aaB (Applying Rule 3)
◦ aaB → aaBb (Applying Rule 4)
◦ aaBb → aab (Applying Rule 5)

5. Right-Most Derivation:
• In a right-most derivation, the right-most variable is replaced at each step.
• Example Derivation of "aab": Begin with SSS and apply production rules, replacing the
right-most variable at each stage to arrive at the same string.
• Deriving aab using Left Most Derivation
• S → AB (Applying Rule 1)
• AB → ABb (Applying Rule 4)
• ABb → Ab (Applying Rule 5)
• Ab → aaAb (Applying Rule 2)
• aaAb → aab (Applying Rule 3)
fi
fi
fi
Key Concepts:
1. Language L {ab2m | m ≥ 1}:
• This language includes strings with a single 'a' followed by an even number of 'b's.
• Production Rules: The CFG ensures that every string adheres to the pattern, with one 'a'
followed by a multiple of two 'b's.
◦ S → aAB
◦ A → bBb
◦ B→A
◦ B→λ

2. Left-Most Derivation:
• Each derivation step replaces the left-most non-terminal, gradually producing the target
string.
• Example Derivation of "abbbb": Starting with SSS, the CFG applies left-most derivation
rules to arrive at the balanced structure.
◦ S → aAB (Applying Rule 1)
◦ aAB → abBbB (Applying Rule 2)
◦ abBbB → abAbB (Applying Rule 3)
◦ abAbB → abbBbbB (Applying Rule 2)
◦ abbBbbB → abbbbB (Applying Rule 4)
◦ abbbbB → abbbb(Applying Rule 4)

3. Right-Most Derivation:
• Each derivation step replaces the right-most non-terminal, producing the string in a di erent
order.
• Example Derivation of "abbbb": Starting with SSS, the right-most derivation method
follows a structured path to reach the nal string.
◦ S → aAB (Applying Rule 1)
◦ aAB → aA (Applying Rule 4)
◦ aA → abBb (Applying Rule 2)
◦ abBb → abAb (Applying Rule 3)
◦ abAb → abbBbb (Applying Rule 2)
◦ abbBbb → abbbb (Applying Rule 4)

Key Concepts:
1. De nition of Context-Free Languages (CFLs):
• A language L is context-free if there exists a CFG G such that L = L(G), where L(G) is the
language generated by the grammar G.
• CFLs are recognised by Pushdown Automata (PDAs), which use a stack to handle recursive
structures.

2. Examples of CFLs:
• Common examples of context-free languages include:
◦ Programming Language Syntax: Most programming languages, like Java, C, and
Python, are de ned by CFGs.
◦ Arithmetic Expressions: CFGs can represent the structure of mathematical
expressions.
◦ Balanced Parentheses: Languages with balanced symbols, such as parentheses,
are CFLs.
◦ Palindromes: Languages consisting of strings that read the same forward and
backwards.

3. Applications of CFLs:
• Programming Language Design: CFGs de ne syntactic rules for compilers, ensuring valid
code structure in languages like Java and Python.
• Natural Language Processing (NLP): CFGs model the syntax of natural languages, helping
in parsing and understanding language structure.
• Formal Veri cation: CFLs ensure conformity to syntactic rules in veri cation tasks.
fi
fi
fi
fi
fi
fi
ff
• XML/HTML Parsing: CFGs are used to parse structured documents like XML and HTML,
ensuring valid document structure.

Key Concepts:
1. Components of a Push-Down Automata (PDA):
• A PDA consists of the following elements:
◦ Q: A nite set of states
◦ Σ: The input alphabet (terminal symbols)
◦ Γ: The stack alphabet (set of symbols that can be pushed/popped onto/from the
stack)
◦ δ: The transition function (which de nes the behaviour based on the current state,
input symbol, and stack symbol)
◦ q₀: The start state (an element of Q)
◦ Z: The start symbol of the stack
◦ qf: The set of accepting states

2. Transitions in a PDA:
• Each transition in a PDA is based on the current input symbol, current state, and top of the
stack.
• The transition may:
◦ Pop the top of the stack.
◦ Push new symbols onto the stack.

3. Formal De nition: The transition function δ takes three inputs:


• Current state (q)
• Current input symbol (a)
• Current stack symbol (X)

A transition δ(q, a, X) → (p, γ) is where:


• q is the current state
• a is the input symbol being processed
• X is the symbol at the top of the stack
• p is the next state
• γ represents the new stack contents (may include pushing or popping symbols)

PDAs and Context-Free Grammars (CFGs):


• Every context-free language can be recognised by some PDA, establishing an equivalence between
CFGs and PDAs.
• PDAs can recognise languages with patterns that nite automata cannot handle, such as balanced
parentheses and palindromic structures.

Push-Down Automata (PDA)

PDAs are computational model used to recognise context-free languages (CFLs)

PDAs are like Non-Deterministic Finite Automata but have an extra component known as stack

The unbounded stack provides additional memory beyond the nite amount available in the control, which
allows them recognise some non-regular languages

PDAs and Context-Free Grammars

1. Every context-free language (CFL) can be recognised by some PDA


2. For every CFG, there is an equivalent PDA that can accept the same language, and vice versa. This
establishes the equivalence of CFGs and PDAs
3. Example languages that a PDA can recognise:
1. Balanced parentheses
2. Palindrome languages
3. L = {anbn | n >= 0}
fi
fi
fi
fi
fi
Key Concepts:
1. Membership Problem:
• De nition: The membership problem asks whether a speci c string www belongs to a
language L de ned by a context-free grammar (CFG).
• Problem Statement: For a given CFG G and string w, determine if w∈L(G).
• Relevance: Membership problems are essential in applications like:
◦ Syntax Analysis: Checking whether a program conforms to a programming
language grammar.
◦ Validation Tasks: Ensuring structured data (e.g., XML, JSON) adheres to speci ed
formats.
• Algorithmic Solution:
◦ Use a pushdown automaton (PDA) or parsing algorithm (e.g., CYK algorithm,
Earley's algorithm) to decide membership.
Example:
• Grammar G:
◦ S→aSb∣ε
• String w=aabb:
◦ The parse tree con rms that w∈L(G), as it follows the rule S→aSb.
2. Equivalence Problem:
• De nition: The equivalence problem determines whether two context-free grammars G1
and G2 generate the same language, i.e., L(G1)=L(G2).

Formally:
∀w1 ∈ L1 and ∀w2 ∈ L2 , is w1=w2
• Challenges:
◦ Undecidability: The equivalence problem for CFGs is undecidable in general, meaning no
algorithm can solve all instances of this problem.
◦ Practical Heuristics: In some restricted cases, equivalence can be checked using heuristic
methods or approximations.
• Applications:
◦ Compiler Optimisation: Determining whether two di erent CFGs describe equivalent
subsets of a programming language.
◦ Grammar Simpli cation: Verifying those simplifying transformations (e.g., removing
redundant rules) preserves language equivalence.
◦ Formal Veri cation: Checking whether certain properties hold for systems (e.g., safety
properties, termination) by deciding whether behaviors of the system are part of a desired
language.
◦ Compiler Design: Membership and equivalence checks are crucial in lexical and syntax
analysis, ensuring programs are valid and equivalent optimizations are performed

Importance of These Problems:


• Membership Problem:
◦ Solvable for context-free languages using parsing algorithms.
◦ Critical in practical applications like validating program syntax or structured data.
• Equivalence Problem:
◦ Intractable for general CFGs due to undecidability.
◦ Requires approximate or domain-speci c solutions in practice.
fi
fi
fi
fi
fi
fi
fi
ff
fi
fi
Week - 6

Membership Problems in CFGs

The membership problem asks if a given string w belongs to the language L generated by a CFG.

If w belong in L, then w can be derived from the start symbol S of the CFG.

Derivation involves applying production rules to the start symbol to obtain a string.

Parsing describes nding a sequence of productions by which a string we L, is derived.

Parsing is essentially the reverse process of derivation.

This membership problem is decidable for CFGs, meaning there exists an algorithm to verify membership
for any given string W.

Membership can be checked using parsing techniques like top-down or bottom-up parsing.

Parse Trees

A Parse tree is a graphical representation of the derivation of a string from a CFG

The structure of a Parse tree involve:


The root of the tree, which is the is start symbol S of the CFG.
The internal nodes represent the non-terminal symbols.
The leaf nodes represent the terminal symbols that form the string W.

Each level of the tree corresponds to an annlication of a production rule

Ambiguity in CFGs

A CFG is ambiguous if a single string W can be


derived in more than one way, i.e using di erent
sequences of productions rules

This also results in the string having multiple parse


trees

However it is still possible to verify the membership


problem.

Ambiguous CFGs

A grammar is ambiguous if there exists at least one string in the language generated by the grammar that
has more than one distinct parse tree (or derivation).

For example, S - aSb|SS|lamda is ambiguous, for string aabb, there are 2 distinct parse trees.
fi
ff
Need for Unambiguous CFGs

Ambiguity can lead to confusion in understanding or processing the structure of a language.

Di erent derivations can imply di erent meanings for the same string.

Ambiguity can also lead to structural problems when designing compilers and parsers for the language.

Unambiguous grammars are critical in many applications where the structure of a string must be
interpreted in a consistent way.

Unambiguous CFG

A grammar is unambiguous if there exists exactly one derivation for each string in the language described
by that grammar.

For example, the grammar, S- aS|a is unambiguous.

Unambiguous grammars have a clear and consistent interpretation of the language structure, and are
easier to parse.

Simplifying CFGs

Simpli cation of CFGs refers to the process of


reducing a context-free grammar to an
equivalent, but simpler form.

We should remove unnecessary rules and


symbols while maintaining the grammar's ability
to generate the same language.

It works as a method to optimise the grammar


and ensure parsing e ciency.

Simplifying CFGs Lemma 1, Lemma 2, Lemma 3, Lemma 4


ff
fi
ffi
ff
What is Normal Form in CFG

Normal Form refers to speci c structured formats of a context-free grammar (CFG) where the productions
follow a standardised pattern.

A CFG is said to be in Normal Form when every production rule of grammar has some speci c form/
restrictions.

The Chomsky Normal Form(CNF) is one of the simplest and most useful normal forms for a Context-Free
Grammar.

Chomsky Normal Form

A context-free grammar is in CNF if every rule is of the form:


1. A -> BC
2. A -> a

Where a is a terminal symbol, while A, B and C are arbitrary non-terminals. B and C may not be the start
variable of the grammar.

We may also permit the rule S -> λ, in case our language derives the empty string.

CNF is particularly valuable because it standardizes the grammar structure, simplifying parsing and
derivations.

Why Use Chomsky Normal Form?

In CNF, every string derivation with n characters requires exactly 2n−1 steps. This regularity allows for
straightforward parsing and derivation checks, enabling easier validation of a string’s membership in a
language.

Every CFG has an equivalent form in CNF, making it a universal tool in grammar normalisation.

Normalization of CFGs simpli es the structure of a grammar, making it easier to analyse and process.

More e cient parsing algorithms can be designed for grammars in a normal form.

Normal forms help in proving properties of languages, such as membership in the context- free language
class.

Advantages of Using Normal Forms:


• Simpli ed Parsing and Analysis: Normal Forms reduce CFG complexity, facilitating easier
analysis and e cient parsing algorithm design.
• Applications in PDAs and Compilers: Many computational applications, such as
pushdown automata (PDAs) and compilers, require CFGs in standardised formats for
processing and error checking.
• Proof of Language Properties: Normal Forms help in proving properties like a language’s
membership in the context-free class.

Limitations of Normal Forms:


• Conversion to Normal Form may lead to a large increase in production rules, potentially
making the grammar more complex.
• Multiple algorithms exist for converting CFGs to Normal Form, so the resulting Normal
Forms are not always unique.
• The process of conversion can add to the structural complexity of the CFG, though it
simpli es parsing.
fi
fi
ffi
ffi
fi
fi
fi
Key Concepts:

1. Purpose of the Pumping Lemma for CFLs:

• The Pumping Lemma for CFLs provides a necessary condition that must be satis ed for any
string within a context-free language. Unlike the pumping lemma for regular languages, the
lemma for CFLs addresses the recursive and nested nature of context-free structures.
• This lemma is particularly useful in demonstrating the limitations of CFLs by proving that
certain languages cannot be generated by a context-free grammar.

2. Statement of the Pumping Lemma for CFLs:


• For any CFL in Chomsky Normal Form (CNF), there exists an integer p>0 (the pumping
length) such that for any string w ϵ L of length ≥ p, there exists a partition of w = uvxyz,
such that
◦ |vwx| ≤ p
◦ |vx| ≥ 1
◦ ∀ i ≥ 0, the string uvⁱwxⁱy ϵ L.

Key Concepts Example:

1. Applying the Pumping Lemma Assumption:


• Consider the language L = {ww: w € {0, 1}*} (This language contains of strings that are made
of two consecutive copies of the same binary string, We will use the Pumping Lemma for
CFLs to prove that L is not context-free.)
• Assume L is context-free and select a string s = 0p1p0p1p that ts the form ww, with each
half identical.

2. Partitioning the String:


• Per the Pumping Lemma, s can be divided into uvwxy such that |vwx| ≤ p and |vx| ≥ 1.
• This restriction implies that vwx can occur within the rst block of 0’s, the 0’s and 1’s in the
rst half, or it might straddle the midpoint of s.

3. Pumping Cases:
• Case 1: vwx consists of only 0’s - Pumping the 0's would either increase or decrease their
count. When we decrease the count to nil, the string s will violate the structure of the
language L.
• Case 2: vwx consists of 0’s and 1’s - Pumping them would either increase or decrease their
count. When we increase the count of them both, the second half will not mirror these
changes, thus the string s will violate the structure of the language L.
• Case 3: vwx straddles the middle - Pumping 0’s and 1’s in the middle would either increase
or decrease their count. When we decrease the count to nil the string s will violate the
structure of the language L.
fi
fi
fi
fi
Conclusion: Since pumping in all cases results in a string that does not have two identical halves, L
cannot be a context-free language.

What are Closure Properties?

A language class is closed under an operation if


applying that operation to languages within the
class results in another language within the
same class.

They help de ne the stability of CFLS under


operations, and help us prove whether or not
certain languages belong to the class of context-
free languages.

CFLs are closed under the following operations:

1. Union (L1 U L2).


2. Concatenation (L1 L2).
3. Kleene Star (L1)*.

Closure Under Union

Closure Under Concatenation

Closure Under Kleene Star


fi
Non-Closure Under Intersection

Non-Closure Under Complementation

1. Example 1 – Non-CFL Language Using Intersection:

• Consider the language L = {w : na(w) = nb(w) = nc(w)}, which includes strings with equal
numbers of a’s, b’s, and c’s.
• Non-CFL Proof Using Intersection: Suppose L is context-free. Then L ∩ L(a*b*c*) = {anbncn: n
≥ 0} would also be context-free, but we know this is not true(proved before). Therefore, L is
not context-free.

1. Example 2 – CFL Language Using Intersection with Regular Language:

• Consider L = {anbn: n ≥ 0, n ≠ 0}, which can be shown as context-free.


• Proof: Let L1 = {a100b100}, a nite (and thus regular) language. Since L = {anbn: n ≥ 0} ∩ L1,
the closure of regular languages under complementation and CFLs under regular
intersection proves LLL is a CFL.

1. Decidable Properties in CFLs:

• A property is decidable if there exists an algorithm that provides a yes/no answer in nite
time. In the context of CFLs, two critical decidable properties are:
◦ Is L(G) empty?
◦ Is L(G) in nite?

2. Deciding If L(G) is Empty:

• A CFG G generates an empty language if no string can be derived from its start symbol.

• Algorithm: To check if L(G)=∅

◦ Remove all non-generating symbols (those that cannot produce terminal strings) and
non-reachable symbols.
◦ After simplifying the grammar, verify if the start symbol SSS can still derive any
terminal string.
fi
fi
fi
◦ Example: For a grammar G with the production rules
▪ S → aSb | λ
▪ A → aA | b

• Step 1: Identify non-generating or non-reachable symbols. A is not reachable, so remove it.

• Step 2: Check if S can generate a string. After removing A, S can still derive the empty string
A which implies that L(G)≠∅.

3. Deciding If L(G) is In nite:

• A CFG generates an in nite language if it can produce in nitely many distinct strings.

• Algorithm: To determine if L(G) is in nite:


◦ Convert the CFG into Chomsky Normal Form (CNF) if needed.
◦ If there is a cycle in the derivation (i.e., a non-terminal can derive a string that
includes itself), L(G) is in nite
◦ If a non-terminal can generate arbitrarily long strings through recursion or repetition,
L(G) is in nite.

• Example: For a grammar G with rule S→aSb∣λ:


◦ Since S appears on both sides of the rule S→ aSb, it can generate strings of
increasing length (e.g., "ab", "aabb", "aaabbb"), indicating that L(G) is in nite.

These decidable properties—whether a CFG generates an empty or in nite language—have practical


implications in language design, parsing, and formal language analysis, ensuring CFGs are e cient and
useful for intended applications.
fi
fi
fi
fi
fi
fi
fi
fi
ffi
Week - 7

What is Parsing?

Parsing is the process of analysing a string of symbols, either in natural or programming languages,
according to the rules of a formal grammar.

It checks if the input string belongs to the language and provides structure to the input.

Used in compilers, interpreters, natural language processing, and data processing systems.

Types of Parsing

Top Down Parsing


Works by expanding non-terminals into terminals based on grammar rules.

Key Method: Recursive Descent Parsing.

A set of recursive procedures that process the input string.


Backtracking can be used, but it can be ine cient.

Bottom Up Parsing

Attempts to reduce the input string into the start symbol by reversing the production rules.

Key Method: Shift-Reduce Parsing.

"Shift" moves input to a stack


"Reduce" replaces a series of symbols in the stack with a non- terminal according to a rule.

LL vs LR Parsers

LL Parsers (Left-to-right, Leftmost derivation):


• Top-down
• Easier to implement
• Handles a subset of context-free grammars

LR Parsers (Left-to-right, Rightmost derivation):


• Bottom-up
• More powerful, handles a larger class of grammars
• Example: LR(0), SLR, LALR, LR(1)

Applications of Parsing

Compilers/Interpreters: Source code parsing for syntax checking and translation.

Natural Language Processing: Parsing sentences in human languages.

Data Validation: Parsing structured data (e.g., XML, JSON).

Recursive Descent Parsing

Recursive descent parsing is a top-down parsing technique where each non-terminal symbol in the
grammar corresponds to a recursive function in the parser. The parser works as follows:

Each non-terminal in the grammar is implemented as a separate recursive function

These functions call each other to match production rules against the input tokens
ffi
The parser consumes input tokens one by one and applies the corresponding grammar production
until all input tokens are processed.

Example Walkthrough

The example grammar includes rules:


▪ S→ AB
▪ A→ aA∣ϵ
▪ B→ bB∣ϵ

Parsing Flow Overview

Parsing Step-By-Step

Input String: "aab"


1. Start with S: Call the rule for S:
1. S->AB

2. Parse A:
Input starts with a: match a and recursively parse A again
Input is a: match again and recursively call A
Input is b: does not match a, stop parsing A (using ϵ)

3. Parse B:
Input is b: match b, recursively parse B
No more input: stop parsing B (using ϵ)

Key Features and Limitations:

◦ Features: Recursive descent parsing directly follows the grammar structure, making it easy
to implement for small, simple grammars.
◦ Limitations: Cannot handle left-recursive or highly ambiguous grammars.
Code Structure

Variables and String Handling

Non-Terminal Functions
Issues in the Expression Grammar Example

Key Concepts:

• Operator Precedence Ambiguity:


◦ The example grammar E -> E + E
▪ |E*E
▪ | (E)
▪ | NUMBER

Does not enforce operator precedence, leading to ambiguity in parsing expressions like "3 + 4 * 5.”

Example: This expression can be parsed as either "3 + (4 * 5)" resulting in 23, or "(3 + 4) * 5" resulting in
35, with both parse trees being valid.

• Left Recursion Issue:


◦ The grammar contains left recursion in rules like E→E+E and E→E∗E, causing recursive
descent parsers to fall into in nite recursion when attempting to parse expressions.

• Lack of Associativity De nition:


◦ Without associativity rules, expressions like "3 + 4 + 5" can be grouped as either "(3 + 4) +
5" or "3 + (4 + 5)," leading to ambiguity even when the results may sometimes be the same.
◦ Associativity is essential for handling expressions consistently in recursive descent parsers.

These issues—multiple parse trees, unde ned operator order, and left recursion—complicate parsing in
expression grammars and highlight the need for grammar modi cations to resolve ambiguity

Left Recursion

Left recursion occurs when a non-terminal in a grammar has a production rule where the non- terminal
appears on the left-hand side of its own de nition. This causes the parser to continuously try expanding
the non- terminal inde nitely, leading to in nite recursion

Example of Left Recursion:

In the rule: E -> E + T | T

The non-terminal E is on the left- hand side of its own production, making this left-recursive

Problem with Left Recursion:

For recursive descent parsers, this a causes a problem since the parser may get stuck trying to expand
the left-recursive rule in nitely, leading to non- termination

Steps to Remove Left Recursion

Identify left-recursive productions:


• Look for rules of the form: A->Aa, where a is a string of terminals and/or non-terminals

Introduce a new non-terminal:


• De ne a new non-terminal A' to remove the recursion

Rewrite the grammar:


• Replace the left-recursive rules with two new rules
A->BA' (start with non-recursive terms, followed by the new non-terminal)
A’->aA' | € (this handles the recursive part without left recursion, using the new non-
terminal)
fi
fi
fi
fi
fi
fi
fi
fi
fi
Left Factoring

Left factoring is a grammar transformation technique used to eliminate ambiguity when two or more
productions a for a non- terminal begin with the same pre x. In such cases, the parser might struggle to
decide which production to apply. By factoring out the common pre xes, the grammar becomes more
suitable for parsers, particularly top-down parsers like recursive descent parsers

Steps to Remove Left Factoring


fi
fi
• Example of Left Factoring:

◦ For the grammar E′→+TE′∣∗TE′∣ϵ:


▪ Both "+" and "*" can be factored as operators followed by TE′.
▪ The factored grammar becomes:
• E′→YE′∣ϵ
• Y→+T∣∗T

◦ The left-factored grammar is now unambiguous and better suited for recursive descent
parsing.

By applying left factoring, common pre xes are removed, resulting in a clearer, more e cient grammar for
top-down parsing techniques like recursive descent parsers.

Useless Symbols

• Recursive Descent Parsing Strategy:

◦ Function for Each Non-Terminal: Each non-terminal is handled by a speci c recursive


function in the parser, which attempts to match tokens in the input string based on grammar
rules.
◦ Example Functions: The parser uses functions like E(),E′(),Y(), and T() to handle the non-
terminals of the grammar.

• Parsing Strategy for Each Non-Terminal:

◦ E Parsing:
▪ Rule: E' → Y E'
▪ Explanation: Begin by parsing T, which may represent a term or a parenthesised
expression. Then, parse E′ to check if there are further operators (e.g., "+" or "*").
◦ E' Parsing:
▪ Rule: E' -> Y E' | ε
▪ Explanation: Parse E′ if there is an operator, recursively handling sequences of
terms. If no operator follows, E′ is ϵ (empty), ending the expression.
◦ Y Parsing:
▪ Rule: Y→+T∣∗ T
▪ Explanation: Match the "+" or "*" operator followed by another term T.
◦ T Parsing:
▪ Rule: T→(E)∣ NUMBER
▪ Explanation: Parse either a number or a full expression enclosed in parentheses.
fi
fi
ffi
• Parsing Example:

◦ For the expression "(3 + 4) * 5":


▪ Begin with parsing T
▪ The next token is (, so:
▪ Match the opening (.
▪ Recursively call E to parse inside the parentheses.
▪ Parse T
▪ Now parse E′ to see if the expression continues.
▪ The next token is +, so:
1. Parse Y (match + and parse T
2. Parse E′ again (but now it’s ϵ as there's no more continuation).
▪ Match the closing ).
▪ After parsing the parenthesised term (3 + 4), we now parse E′.
▪ The next token is *, so:
▪ Parse Y (match * and parse TTT for the number 5).
▪ Parse E′ again (it’s ϵ\epsilonϵ, as there's no more continuation).

Drawbacks of Recursive Parsing

Left Recursion

Problem: Recursive descent parsers cannot handle left-recursive grammars, where a non-terminal refers
to itself as the rst symbol in one of its productions (e.g., A -> A alpha)

Impact: If left recursion is present, the parser will enter an in nite recursive loop, causing the program to
crash

Solution: The grammar must be rewritten to eliminate left recursion, which can be cumbersome and may
not always be straightforward.

Ine ciency with Ambiguous Grammars

Problem:
Recursive descent parsers struggle with ambiguous grammars, where multiple parse trees are possible for
a given input. Ambiguities may cause the parser to backtrack extensively or fail altogether

Impact:
This can result in exponential time complexity due to backtracking, leading to very ine cient parsing

Solution:
Ambiguous grammars need to be carefully refactored or avoided, which might not be possible in all
situations

Backtracking Overhead

Problem:
Recursive descent parsers often rely on backtracking when they encounter choices in the grammar. For
example, if multiple productions are possible, the parser may try one production and backtrack if it fails,
trying the next production

Impact:
Excessive backtracking can make parsing ine cient, especially for grammars that have many possible
paths. This can lead to signi cant performance issues

Solution:
Lookahead techniques like LL(k) parsing, where the parser looks ahead at multiple tokens to decide which
production to apply, can reduce backtracking but add complexity to the parser
ffi
fi
fi
ffi
fi
ffi
Limited to LL(1) Grammars

Problem: Recursive descent parsers are typically designed to work with LL(1) grammars, meaning they
require only one token of lookahead to make parsing decisions

Impact: Many useful programming languages and grammars are not LL(1), meaning they require more than
one token of lookahead or have more complex rules

Solution: The grammar may need to be modi ed (left-factoring, removing ambiguities), or a di erent
parsing strategy (e.g., LR parsers or LL(k) parsers) may be needed.

Error Handling

Problem: Error handling is often poor in recursive descent parsers. When a syntax error is encountered, it
is not always clear how to recover gracefully and continue parsing

Impact: The parser may stop prematurely after encountering an error or provide unclear error messages,
making debugging and diagnostics di cult for large inputs

Solution: Implementing robust error recovery mechanisms (like panic mode or phrase-level recovery) is
challenging in recursive descent parsers and often requires complex logic.
ffi
fi
ff
week - 8

1. De nition of LL(1) Parsing:

• LL(1) Parsing is a deterministic, top-down parsing technique that reads the input from Left
to right and constructs the Leftmost derivation of the input string using 1 lookahead symbol
to predict which production to apply.
• LL(1) parsers are known as predictive parsers because they can decide the next action
based on the lookahead symbol without backtracking.

2. How LL(1) Parsing Works:

• Context-Free Grammar (CFG): LL(1) parsers rely on CFGs that are compatible with LL(1)
rules.
• Parsing Table: A predictive parsing table determines which production rule to apply based
on the input symbol and the top of the stack.
• Stack and Input Processing: The parser uses a stack to keep track of non-terminals and
terminals during the parsing process. The current input symbol and top of the stack guide
the choice of production.

3. Limitations of LL(1) Parsing:

• Cannot Handle Left-Recursive Grammars: LL(1) parsers enter in nite loops if the
grammar contains left-recursive rules, as these rules make it impossible to determine the
next step without backtracking.
• Limited to One Lookahead Symbol: Since LL(1) parsers use only one lookahead symbol,
they cannot handle grammars that require more context.
• Not Suitable for Ambiguous Grammars: LL(1) parsers work only with unambiguous
grammars, as they need a deterministic approach without multiple derivations.

LL(1) parsing is an essential parsing method for simple, deterministic grammars, often used in educational
and compiler design contexts to illustrate predictive parsing principles.

Introducing (Predictive) Parsing Table and Using it to Parse


fi
fi
1. Using the Predictive Parsing Table to Parse:

• Initialise the Stack: Start with the stack containing the start symbol and the end-of-input
symbol ($).
• Lookahead Symbol: Examine the rst input symbol, which is used as the lookahead.
• Top of Stack Check: If the top of the stack is a non-terminal, use the parsing table to nd
the appropriate rule based on the lookahead symbol.
• Apply Rule or Match: If the top of the stack matches the lookahead symbol, pop both from
the stack and move forward in the input. Otherwise, replace the non-terminal with the right-
hand side of the production.

2. Example of Parsing with a Predictive Table: Using a parsing table, the parser applies each rule
based on the current top of the stack and the lookahead symbol, expanding non-terminals and
matching terminals with input symbols.

Construction of Nullable, First and Follow Sets

1. Nullable Set:

• De nition: A non-terminal X is nullable if it can derive the empty string ϵ.


• Purpose: Nullable sets help determine if other productions need to be considered when a
symbol can derive ϵ.
• Algorithm:
• If X→ϵ, mark X as nullable.
• If X→Y1Y2…YnX and all Yi are nullable, then mark X as nullable.

2. First Set:

• De nition: The First set of a symbol X (which could be a terminal, non-terminal, or a


sequence of symbols) includes terminals that can appear as the rst symbol in any string
derived from X.
• Rules for Calculating the First Set:
• Terminals: If X is a terminal, then FIRST(X)={X}
• Non-terminals: If X is a non-terminal and X→Y1Y2…YnX, then:
• Add FIRST(Y₁) (excluding ϵ) to FIRST(X).
• If Y1Y₁Y1 can derive ϵ, check Y2Y₂Y2, and so on.
• Epsilon (Empty String): If X can derive ϵ(directly or indirectly), include ϵ in FIRST(X).

3. Algorithm for First Set:

• For a terminal a: FIRST(a)={a}.


• For a non-terminal X: ▪ X→Y1Y2…YnX, add FIRST(Y₁) to FIRST(X).
• If Y1 is nullable, add FIRST(Y2), and so on
• If X→ϵ, add ϵ to FIRST(X).

These sets are foundational for predictive parsing, helping to ensure that parsers operate e ciently
without backtracking.

1. Follow Set:

• De nition: The Follow set of a non-terminal A is the set of terminals that can appear
immediately after A in any valid string derived from the grammar.
• Rules for Calculating the Follow Set:
• Start Symbol: Always include the end-of-input symbol ($) in FOLLOW of the start
symbol.
• Production Rule A→αBβ If there is a production A→αBβ then: Add FIRST(β)
(excluding ϵ) to FOLLOW(B). If β can derive ϵ, add FOLLOW(A) to FOLLOW(B).
• Production Rule A→αBA: If there is a production where B is at the end, add
FOLLOW(A) to FOLLOW(B).
fi
fi
fi
fi
fi
fi
ffi
2. Algorithm for Follow Set:

• Start by adding $ to the Follow set of the start symbol.


• For each production A→αBβ:
• Add FIRST(β) to FOLLOW(B) (excluding ϵ).
• If β is nullable, add FOLLOW(A) to FOLLOW(B).
• For productions where B is at the end, directly add FOLLOW(A) to FOLLOW(B).

It contains symbols that can appear after the non-terminal in some derivation.

These sets (Nullable, First, and Follow) play a fundamental role in constructing predictive parsers, enabling
LL(1) parsers to function e ciently and without backtracking.

1. Grammar Used for Example:

• Productions:
• S→ABC
• A→a∣ϵ
• B→b∣ϵ
• C→cD∣ϵ
• D→d∣ϵ

2. Nullable Set Construction:

• Rules Applied:
• A→ϵ: A is nullable because it has an ϵ production.
• B→ϵ: B is nullable because it has an ϵ production.
• C→cD∣ϵ C is nullable because it has an ϵ production.
• D→d∣ϵ: D is nullable because it has an ϵ production.
• S→ABC: Since A, B, and C are nullable, S is also nullable.
• Final Nullable Set: Nullable(S)={ϵ}, Nullable(A)={ϵ}, Nullable(B)={ϵ}, Nullable(C)={ϵ},
Nullable(D)={ϵ}.

3. First Set Construction:

• Rules Applied:
• A→a∣ϵ: FIRST(A) = {a} because of the production A→a. Since A is nullable, add ϵ.
• B→b∣ϵ: FIRST(B) = {b} because of the production B→b. Since B is nullable, add ϵ.
• C→cD∣ϵ: FIRST(C) = {c} because C→cD. Since C is nullable, add ϵ.
• D→d∣ϵ: FIRST(D) = {d} because of the production D→dD. Since D is nullable, add ϵ.
• S→ABC: FIRST(S) depends on A, B, and C. Since A, B, and C are all nullable,
FIRST(S) = FIRST(A) ∪ FIRST(B) ∪FIRST(C).FIRST(S) = {a,b,c,ϵ}.
• Final FIRST Set: FIRST(S)={a,b,c,ϵ}, FIRST(A)={a,ϵ}, FIRST(B)={b,ϵ}, FIRST(C)={c,ϵ},
FIRST(D)={d,ϵ}.

• Follow Set Construction:

◦ Rules Applied:
◦ Add $ to FOLLOW(S) since S is the start symbol.
◦ For S → ABC, add FIRST(B) to FOLLOW(A) and FIRST(C) to FOLLOW(B).
◦ Since C is nullable, FOLLOW(S) is added to FOLLOW(C), resulting in FOLLOW(C) =
{ $ }.

◦ Final Follow Set: FOLLOW(S)={$}, FOLLOW(A)={b,c,$}, FOLLOW(B)={c,$}, FOLLOW(C)={$},


FOLLOW(D)={$}.
ffi
1. Grammar Used for Example:

• Productions:
◦ S→AB
◦ A→aA∣b
◦ B→cBd∣ϵ

2. Nullable Set Construction:

• Rules Applied:
◦ A is not nullable since neither production leads to ϵ.
◦ B is nullable due to B→ϵ.
◦ Since A is not nullable, S is also not nullable.
• Final Nullable Set:
◦ Nullable(S)={},Nullable(A)={},Nullable(B)={ϵ}

3. First Set Construction:

• Rules Applied:
◦ A→aA∣bFIRST(A) = {a,b} because of the two productions aA and b.FIRST(A)={a,b}
◦ B→cBd∣ϵ: FIRST(B) = {c,ϵ} because of the two productions cBd and
ϵ.FIRST(B)={c,ϵ}
◦ S→AB: FIRST(S) depends on FIRST(A). Since A→aA∣bB is nullable, so FIRST(S) =
FIRST(A) ∪ FIRST(B).FIRST(S)={a,b}
◦ Final FIRST Set: FIRST(S)={a,b}, FIRST(A)={a,b}, FIRST(B)={c,ϵ}

4. Follow Set Construction:

• Rules Applied:
◦ Add $ to FOLLOW(S) as the end-of-input marker.
◦ For S → AB, add FIRST(B) to FOLLOW(A). Since B is nullable, add FOLLOW(S) to
FOLLOW(A), resulting in { c, $ } for FOLLOW(A).
◦ For B → cBd, add d to FOLLOW(B).
• Final Follow Set:
◦ FOLLOW(S)={$}, FOLLOW(A)={c,$}, FOLLOW(B)={d,$}

These sets allow for the construction of predictive parsing tables necessary for LL(1) parsing, providing the
basis for determining which productions to apply based on lookahead symbols.

LL(1) Table Construction

1. Algorithm Overview:

• Given Grammar: Start with a context-free grammar G=(N,T,P,S):


◦ T: Set of terminals
◦ P: Set of productions
◦ S: Start symbol ▪ N: Set of non-terminals
• The goal is to construct a table with rows for non-terminals and columns for terminals and $
(end-of-input marker).

2. Steps to Construct the LL(1) Parsing Table:

• Step 1: Compute First and Follow Sets:


◦ Calculate the Followset for each non-terminal to identify symbols that can follow it in
derivations.
◦ Calculate the Firstset for all non-terminals to identify the initial terminals in
derivations.
• Step 2: Create an Empty Table: ▪ The table is structured with rows for non-terminals and
columns for terminals and $.
• Step 3: Fill the Table: For each production A→α in the grammar, follow these rules:
◦ For each terminal a∈FIRST(α): Add the production A→α to the table entry [A,a].
This means that if the current input symbol is a, and the top of the stack is A, we
apply the production A→α.
◦ If ϵ∈FIRST(α): For each terminal b∈FOLLOW(A), add A→α to the table entry [A,b]. If
$∈FOLLOW(A), add the production A→α to [A,$]. This allows the parser to accept ϵ
at the end of the input.
• Step 4: Handle Con icts: If any table entry [A, a] contains more than one production, the
grammar is not LL(1).

3. Example Table Construction:

• Grammar:
◦ S→AB
◦ A→a∣ϵ
◦ B→b∣ϵ
• First and Follow Sets:
◦ FIRST(A) = { a, ϵ }
◦ FIRST(B) = { b, ϵ}
◦ FIRST(S) = { a, b, ϵ}
◦ FOLLOW(S) = { $ } (since S is the start symbol)
◦ FOLLOW(A) = { b, $ } (because B follows A in S→AB
◦ FOLLOW(B) = { $ } (because B is at the end of the production)

• Filled Table: ▪ For S→AB: FIRST(AB) contains a and b, so we place S→AB in both [S,a] and
[S,b]. Since B is nullable, also place S→AB[S,$] (from FOLLOW(S)).
◦ For A→a: FIRST(a) contains a, so place A→a [A,a].
◦ For A→ϵ: Since ϵ∈FIRST(A), place A→ϵ[A,b] and [A,$] (from FOLLOW(A)).
◦ For B→b: FIRST(b) contains b, so place B→b [B,b].
◦ For B→ϵ: Since ϵ∈FIRST(B), place B→ϵ[B,$] (from FOLLOW(B)).
fl
Parsing a String Using Table

Example Parsing:

• Grammar: S→ AB, A→a∣ϵ, B→b∣ϵ.


• Input: “ab$"

Step-by-Step Parsing

[b,B,$]
LL(2) Parsers

• LL(2) parsing extends LL(1) parsing by allowing the parser to look ahead 2 input symbols instead of just
1. This gives the parser more context to decide.

Lookahead of 2 Symbols:

• In LL(2), the parser examines the current input symbol and the one that follows it, providing better insight
into which production to apply.

• This helps resolve ambiguities that LL(1) cannot handle, especially when two or more rules have
overlapping FIRST sets but di er on the second symbol.

Con ict Resolution:

By looking two symbols ahead, LL(2) parsers can di erentiate between rules that share common pre xes
but di er afterward. This reduces FIRST set con icts.

LL(k)

Extends LL(2) to more symbols.

Increased Complexity:

• Although LL(k) parsers can handle more complex grammars, they are also more computationally
expensive. The parser's decision-making process becomes more complicated as it has to consider
multiple symbols simultaneously.

• The size of the parsing table grows exponentially with k, making the construction and storage of the
table more challenging for larger values of k.
• S -> a b X
• S -> a b Y
• S -> a c Z
• This needs 2 lookaheads.

Comparison
fl
ff
ff
fl
ff
fi
Week - 9

Weakness 1 - Limited Lookahead

LL(k) parsers rely on a xed number of lookahead tokens (k) to make parsing decisions

For some grammars, even with large k, the parser may not have enough information to resolve ambiguities
or determine which production to apply

This results in parsing failures for languages that require more than k tokens of lookahead.

Weakness 2 - Handling of Left Recursion

LL(k) parsers cannot handle left-recursive grammars, where a non-terminal calls itself on the left side of
the production

The grammar needs to be transformed to remove left recursion, which can complicate grammar design.

Weakness 3 - Language Coverage

LL(k) parsers can only parse a subset of context-free languages called LL(k) grammars

More powerful parsing techniques (such as LR parsers) are required for these languages, which reduces
the utility of LL(k) parsers for modern compiler implementations.

What is a Shift-Reduce Parser?

Shift-Reduce Parsing is a type of bottom-up parsing. The goal is to reduce the entire input to the start
symbol of the grammar

The parser works by either shifting symbols from the input to a stack or reducing the top symbols of the
stack into non-terminals based on the grammar rules

Shift-Reduce Parsing Actions

There are four key actions in Shift-Reduce Parsers

Shift: Move the next input symbol onto the stack

Reduce: Apply a production rule to replace symbols on the stack with a non-terminal

Accept: The parsing is complete when the stack contains the start symbol and the input is empty

Error: Occurs when no valid action (shift or reduce) can be applied, and the parser cannot proceed.
fi
Shift-Reduce Parsing Example

Consider the grammar:


• S -> E
• E -> E + T I T
• T -> T * F I F
• F -> ( E ) | id

Let us try to reduce input sequence id * id + id to the start symbol S using the grammar rules.

Let us look at how the input sequence and stack change during parsing.

Continue shifting and reducing until the stack contains the start symbol S and the input is empty

Parsing successful, the input id id + id is valid according to the grammar.

Advantages of Shift-Reduce Parsers

E cient for many programming languages

Can handle a wide range of grammars, including ambiguous ones with proper con ict resolution

Example: A shift/reduce parser can easily handle arithmetic expressions with proper operator precedence.

Disadvantages of Shift-Reduce Parsers

Requires careful handling of shift/reduce con icts and reduce/reduce con icts.

Parsing tables for larger grammars can be complex and di cult to debug.

What is LR Parsing?

LR Parsers are a type of bottom-up parsers that read input Left-to-right and construct a Rightmost
derivation in reverse

The term "LR" stands for Left-to-right scanning of input and Rightmost derivation in reverse

LR Parsers are stronger models than LL(k) Parsers.

The Role of the Parsing Table

A LR Parsing Table is a precomputed structure that directs the parser's actions whether to shift, reduce,
accept, or declare an error

Without the table, the parser would have no guidance for making decisions and would not be able to
parse e ciently
ffi
ffi
fl
ffi
fl
fl
The parser uses the table to decide its next move based on the current state and the next input symbol.

A LR Parsing Table helps us in:

E cient Decision Making: The parsing table allows the parser to e ciently decide its actions based on a
lookup

Con ict Resolution: Helps manage shift-reduce and reduce-reduce con icts

Predictive Parsing: The table precomputes all possible shifts and reductions, ensuring the parser operates
in a deterministic way.

Advantages of LR Parsing Table

Using L Parsing Tables has many advantages:

Precomputation: Tables are computed once, making parsing faster during runtime
Deterministic Parsing
Allows us to handle Complex Grammars.

What is LR(0) Parsing?

LR Parsers are type of bottom-up parsers that read input Left-to-right and construct a Rightmost
derivation in reverse

LR(0) is the simplest form of R parsing, which uses no lookahead symbols (i.e., 0 lookahead)

It's one of the earliest and most basic algorithms for constructing shift-reduce parsers

LR(0) Parsers use a DFA to guide its shifts and reductions

This means the parser decides whether to shift or reduce without looking at the next input symbol

The parser processes input by moving through a set of states, making decisions based solely on the
current state and the content of the stack

LR(0) Items

An LR(0) item is a production rule with a dot (.) indicating the current parsing position

For a rule E -> E + T. we have items:


E -> . E + T (Before parsing any symbols)
E -> E . + T (After recognizing E)
E -> E +. T (After recognizing E +)

These items track how much of a production rule has been parse.

Advantages and Limitations of LR(0) Parsing

Advantages
Simplest form of LR parsing
Can be implemented e ciently

Limitations
Cannot handle all context-free grammars
Prone to con icts due to the lack of lookahead (especially reduce-reduce con icts)
Only suitable for simple grammars with minimal ambiguity
ffi
fl
fl
ffi
ffi
fl
fl
Example

We will use the following simple grammar:


S -> A
A -> aA
A -> b

This grammar will be used to build the LR(0) parsing table and parse a string.

LR(0) Items for the Grammar

The LR(0) Items for our example grammar would be:


S -> . A
A -> . aA
A -> a. A
A -> . B
A -> . b

LR(0) items track the position of the "dot" in the production rules, showing progress in the parsing
process.

LR(0) Automation for the Grammar

LR(0) Parsing Table

The LR(0) parsing table consists of:


Action Table: Guides shift, reduce, and accept actions
Goto Table: Guides transitions to new states after reductions

The Action Table tells us when to:


Shift on terminal a or b
Reduce based on speci c LR(0) items

The Goto Table tells us how to transition between states based on non-terminals

Let us try to parse the string "aab" using the LR(0) Parsing Table constructed. This string is accepted by
the grammar
fi
Transitions:

Move the dot based on the


input symbol (terminal or
non-terminal).

Example: E→E+.T transitions


to a new state on input +.

Filling the Parsing Table:

Action Table:
For terminals, use
shift if there’s a move;
use reduce if the dot
is at the end of a
production.

Use accept when


S′→S. is reached.

Goto Table:
For non-terminals, transition to the corresponding state based on the grammar.

Parsing
1. Grammar Used:
• S→CC
• C→cC
• C→d

2. Augmented Grammar:
• Add an augmented production: S′→ S.

3. Canonical Collection of Items:


• I0: Closure of S' → . S
• S' → . S
• S→.CC
• C→.cC
• C→.d
• I1: Closure of S' → S .
• S' → S .
• I2: Closure of S → C . C
• S→C.C
• C→.cC
• C→.d
• I3: Closure of C → c . C
• C→c.C
• C→.cC
• C→.d
• I4: Closure of C → d .
• C→d.
• I5: Closure of S → C C .
• S→CC.
• I6: Closure of C → c C .
• C→cC.

4. State Transitions:
• De ne shifts and gotos based on terminals (c, d) and non-terminals (C):
• I0 → c → I3 (Shift on c)
• I0 → d → I4 (Shift on d)
• I2 → c → I3 (Shift on c)
• I2 → d → I4 (Shift on d)
• I3 → c → I3 (Shift on c)
• I3 → d → I4 (Shift on d)
• I0 → C → I2 (Goto on C)
• I2 → C → I5 (Goto on C)
• I3 → C → I6 (Goto on C)

5. LR(0) Parsing Table Construction:


• Action Table: Includes shift/reduce actions for terminals based on transitions.
• Goto Table: Speci es state transitions for non-terminals.

6. Example Parsing Table:


• For S→CC, C→cC, C→d, the table maps:
• [0,c]→Shift 3
• [0,d]→Shift 4
• [3,C]→Goto 6

This process demonstrates how to construct an LR(0) table systematically, enabling deterministic parsing
for simple grammars.
fi
fi
What is LR(1) Parsing and LR(k) Parsing

Key Concepts:

1. De nition of LR(1) Parsing:


• L: Parses the input from left to right.
• R: Constructs a rightmost derivation in reverse.
• 1: Uses 1 lookahead symbol to make parsing decisions, allowing the parser to consider the
current and next input tokens.

2. Components of LR(1):
• States: Represent parsing con gurations at each step.
• Lookahead: A single symbol that helps the parser decide between ambiguous production
rules.
• Action Table: Dictates whether to:
◦ Shift: Move to the next state by reading the input symbol.
◦ Reduce: Apply a production rule to reduce the stack.
◦ Accept: Successfully parse the input.
◦ Error: Detect and report a syntax error.

3. Example of an LR(1) Item:


• [A→α.β,a]
◦ A→α.β is the production rule with a dot indicating the current parsing position.
◦ a is the lookahead symbol, helping resolve ambiguities.
◦ The lookahead helps distinguish between situations where the parser might
otherwise be confused about which production to apply, especially when there are
ambiguities in the grammar
◦ LR(k) di ers from LR(1) in that it has k>1 lookahead symbols for parsing instead of
LR(1)'s single lookahead symbol

4. Comparison with LR(0):


• LR(0): No lookahead symbols, making it less capable of handling ambiguous grammars.
• LR(1): Uses one lookahead symbol to handle more complex constructs.
• LR(k): Extends LR(1) by using k-lookahead symbols (k>1) for even greater disambiguation,
though at increased computational complexity.

LR(1) parsing is widely used in practice and forms the basis for many parser generators like Yacc and
Bison, o ering a balance between complexity and capability.

Key Concepts:

1. LL(k) Parsers:
• De nition: Top-down parsers that scan input left to right and produce a leftmost derivation
using k-lookahead symbols.
• Key Characteristics:
◦ Cannot handle left-recursive grammars.
◦ Simpler to implement but more restrictive.
• Class Inclusion: LL(1)⊆LL(2)⊆...⊆LL(k).
◦ As k increases, LL parsers can handle more complex grammars.

2. LR(0) Parsers:
• De nition: Bottom-up parsers that scan input left to right and construct a rightmost
derivation in reverse without using lookahead symbols.
• Key Characteristics:
◦ Can handle left recursion.
◦ More expressive than LL parsers but limited in resolving parsing con icts.
• Class Inclusion: LL(1)⊆LR(0).
fi
fi
fi
ff
ff
fi
fl
3. LR(1) Parsers:
• De nition: Bottom-up parsers that use 1 lookahead symbol to resolve con icts.
• Key Characteristics:
◦ Can handle all deterministic context-free grammars.
◦ Resolves ambiguities that LR(0) cannot handle.
• Class Inclusion: LL(1)⊆LL(2)⊆…⊆LL(k)⊆LR(0)⊆LR(1).

4. Hierarchy Summary:
• The hierarchy shows that LR(k) parsers are the most powerful, and able to handle a broader
class of grammars compared to LL(k).
• While LL(k) parsers are easier to implement, LR(k) parsers are better suited for complex
grammars, making them essential in real-world applications.

Understanding this hierarchy helps in selecting the appropriate parser type for speci c grammar
complexities.
fi
fl
fi
Week - 10

What is POS Tagging

De nition: POS (Part-of-Speech) Tagging is a process in natural language processing (NLP) that involves
identifying and labeling the parts of speech (nouns, verbs, adjectives, etc.) in a given text.

• Example: For the sentence "The quick brown fox jumps":


◦ Tags: "The" → Determiner, "quick" → Adjective, "brown" → Adjective, "fox" → Noun,
"jumps" → Verb.
• Importance: Provides foundational syntactic information essential for advanced NLP tasks.

Techniques for POS Tagging:

• Rule-Based Approaches: Use prede ned linguistic rules; e ective for structured text but lacks
adaptability for complex scenarios.

• Statistical and Machine Learning Approaches:


◦ Hidden Markov Models (HMMs): Probability-based tagging, considering word sequence
likelihood.
◦ Conditional Random Fields (CRFs): Context-aware statistical models improving accuracy.

• Neural Networks: Leverage word embeddings and deep learning for state-of-the-art performance.
• Transformers (e.g., BERT): Use contextualised embeddings for higher tagging accuracy.

Applications of POS Tagging:

• Syntax Analysis: Lays the groundwork for syntactic parsing and grammar checking.
• Named Entity Recognition (NER): Di erentiates named entities from other nouns.
• Sentiment Analysis: Identi es opinions using adjectives and adverbs.
• Information Retrieval: Improves query understanding by analysing word roles.
• Machine Translation: Enhances translation quality by aligning syntactic structures across
languages.

What is Context-Free Grammars (CFG)?

A grammar that de nes a set of rules for sentence structure in a language

Developed by Noam Chomsky to explain the syntax of sentences without relying on context

Example:
"Colorless green ideas sleep furiously” (Chomsky, 1957) - grammatically correct but semantically
nonsensical

Importance in Computer Science:


Essential for designing parsers that analyze code or natural language
Used in NLP to interpret sentence structures syntactically

Relevance to POS Tagging:


• Helps in deciding parts of speech based on sentence structure, e.g., whether "run" iS noun or verb In
di erent contexts
ff
fi
fi
fi
fi
ff
ff
Basic Structure of CFGs

De ning Rules in CFG:


o A CFG consists of rules de ning how words and phrases combine to form sentences
Example: A sentence (S) = Noun Phrase (NP) + Verb Phrase (VP)

Focus on Syntax:
. CFGs operate at the syntactic level, analyzing structure, not meaning CFG for NLP:
• Helps build a rule-based approach to parse sentences

Parse Trees and Sentence Structure

What is a Parse Tree?

A hierarchical structure representing the grammatical structure of a sentence


Each branch represents a rule from the CFG

Example: Groucho Marx's "I joke "I shot an elephant in my pajamas”


Multiple interpretations show di erent parse trees for the same sentence

Ambiguity in Parse Trees

Understanding Ambiguity:

Example: "I shot an elephant in my pajamas"


• Interpretation 1: I was in my pajamas when shot the elephant
• Interpretation 2: The elephant was in my pajamas

How Parse Trees Handle Ambiguity:

Di erent parse trees represent the distinct syntactic interpretations


Shows the challenges in building accurate parsers

Role of POS Tagging and CFGs in NLP

Using POS Tagging in CFGs:


POS tagging identi es parts of speech, helping In disambiguating meanings
CFG de nes valid combinations for structures (e.g., VP, NP)

Creating a Simple CFG:


Example Grammar Rules:
◦ S→NP VP
◦ NP→Det N ∣ Det N PP ∣ ′I′
◦ VP→V NP ∣ VP PP
◦ Det→′an′ ∣ ′my′
◦ N→′elephant′ ∣ ′pajamas′

Example Sentence and CFG Rules

Example Sentence: "The cat sat on the mat”

CFG rules for POS tagging:


• S -> NP VP (Sentence consists of a Noun Phrase + Verb Phrase)
• NP -> Det N (Noun Phrase consists of Determiner + Noun)
• VP -> V PP (Verb Phrase consists of Verb + Prepositional Phrase)
• PP -> P NP (Prepositional Phrase consists of Preposition + Noun Phrase)
• Tags:
The = Det, cat = N, sat = V, on = P, the = Det, mat = N
ff
fi
fi
fi
fi
ff
The Role of Parsers in POS Tagging

Using Parsers with CFGs:


Parsers analyse sentence structure using CFG rules to generate POS tags

Types of Parsers:
• Top-Down Parsers: Start from the sentence level, breaking down into smaller constituents (e.g., NP,
VP)
• Bottom-Up Parsers: Start with individual words and build up to form the full sentence structure

Classic View:
Treats POS tagging as a parsing problem where parsing rules guide tagging

Parsing and POS Tagging Example

Limitations of POS Tagging as Parsing

Challenges with the Classic Approach:


• Rigidity: CFG rules may not cover all syntactic structures in natural language
• Ambiguity: Certain words (e.g., "run") can be both nouns and verbs; CFG alone may not resolve these
• Complexity: Real-world sentences often involve nuanced structure, making manual CFG creation
labor-intensive

Modern Solutions:
• Use of machine learning and statistical models has largely replaced rule-based POS tagging

Introduction to Context-Sensitive Grammars (CSGs)

What are Context-Sensitive Grammars?

• A type of formal grammar in the Chomsky hierarchy, more powerful than Context- Free Grammars
(CFGs)
• Can capture language structures where the context around a symbol a ects its production rules
• Key Feature:
• The production rules can depend on the surrounding context of symbols, making CSGs more exible
for complex languages
• Context-Sensitive Grammars (CSGs) allow rules that depend on the context around symbols
• Unlike simpler grammars, CSGs can capture relationships between elements that rely on position or
neighboring symbols
• CSGs are more powerful than context-free grammars and describe languages that require context for
certain rules to apply
ff
fl
Context-Sensitive Grammars (CSGs) and POS Tagging

Challenges and Complexity

Computational Complexity:
O Parsing context-sensitive languages is generally more complex than context-free languages, often
requiring non-deterministic linear space

Practical Limitations:
O Few e cient parsers for CSGs exist, making them impractical for real-time applications
• Due to their complexity, many languages modeled by CSGs are approximated by simpler grammar types
where possible

CSGs in Theoretical Studies:


• Often studied in theoretical computer science to understand the limits of computational language
models
ffi
1. Grammar for Type-Checked Arithmetic Expressions (example):
• Non-Terminals:
◦ <Program>: Holds all statements
◦ <Statement>: Assigns a variable to a value
◦ <Expr>: Represents expressions with variables and value
◦ <Var>: Variable names
◦ <Type>: Speci es type annotations (int or str)
• Terminals:
◦ Variable names: x, y, etc
◦ Literal values: Integers (1, 2) and strings ("hello")
◦ Types: int, str
◦ Operators: +
◦ Assignment: =

2. Grammar Rules:
• <Program> → <Statement> <Program> | ε
• <Statement> → <Var> = <Expr> : <Type>
• <Expr>: int → <Expr> + <Expr> : int
• <Expr>: int → 0 | 1 | 2 | ... | 9
• <Expr>: str → "..." (any string literal)
• <Var>: int → x | y | z (integer variables)
• <Var>: str → s | t | u (string variables)
• <Type> → int | str

3. Why is this Grammar Context-Sensitive?


• Rules depend on the type (<Type>) declared for <Var> and <Expr>
• Enforces type compatibility in assignments and operations

4. Example Parse:
• For x = 3 + 5 : int:
◦ <Statement> → x = <Expr> + <Expr> : int
◦ <Expr> + <Expr> : int → 3 + 5 : int
• For s = "hello" : str:
◦ <Statement> → s = <Expr> : str
◦ <Expr> : str → "hello"

5. Comparison with Context-Free Grammars (CFGs):


• CFGs are independent of surrounding symbols, while CSGs adopt rules based on the
context.
• CSGs are more powerful but computationally expensive.

1. Grammar for Taxonomic Classi cation(Example):


• Non-Terminals:
◦ <Classi cation>: Represents the taxonomy structure
◦ <Family>: A family name
◦ <Genus>: A genus name under a family
◦ <Species>: A species name under a genus
• Terminals:
◦ Family names: Felidae, Canidae
◦ Genus names: Felis, Canis
◦ Species names: catus, lupus

2. Grammar Rules:
• <Classi cation> → <Family> : <Genus> <Species> <Classi cation> | ε
• <Family>: Felidae → <Genus>: Felis
• <Genus>: Felis → <Species>: catus | silvestris
• <Family>: Canidae → <Genus>: Canis
• <Genus>: Canis → <Species>: lupus | familiaris
fi
fi
fi
fi
fi
3. Context-Sensitive Rules:
• Latin Naming Conventions: Genus names are capitalised; species names are lowercase
and italicised.
• Species-Speci c Context: A species must belong to its de ned genus and family.

4. Example Parse:
• For Felidae: Felis catus:
◦ <Classi cation> → <Family>: <Genus> <Species>
◦ <Family>: Felidae → <Genus>: Felis
◦ <Genus>: Felis → <Species>: catus
• For Canidae: Canis lupus:
◦ <Classi cation> → <Family>: <Genus> <Species>
◦ <Family>: Canidae → <Genus>: Canis
◦ <Genus>: Canis → <Species>: lupus

5. Comparison with Context-Free Grammars (CFGs):


• CSGs allow context-dependent rules, critical for maintaining hierarchical relationships.
• CFGs are simpler but cannot enforce contextual constraints like species-genus-family
relationships.

This grammar showcases how CSGs handle structured, context-sensitive tasks like taxonomic
classi cation with precision.

1. Applications of CSGs:

• Natural Language Processing (NLP):


◦ Number Agreement: Ensures subject-verb agreement, such as "The dog
runs" (singular) vs. "The dogs run" (plural)
◦ Gender Agreement: Used in languages like French to match adjectives with nouns
based on gender and number

• Biological Sequence Analysis:


◦ Protein Folding: Models hierarchical rules in amino acid sequences, predicting 3D
structures based on speci c context-dependent interactions
◦ Bene t: Facilitates advancements in drug discovery and genetic research

• Markup Language and Formatting:


◦ HTML/XML Validation: Ensures correct nesting of elements, like <div> inside
<body>
◦ Bene t: Guarantees consistency in document structure for automated validation and
generation

• Compilers and Programming Languages:


◦ Variable Scope and Type Checking: Enforces rules for variable declaration and
usage based on context
◦ Bene t: Maintains syntactic and logical correctness in programming, reducing errors

• Hierarchical Data Modeling:


◦ Taxonomies and Classi cation Systems: Ensures species match the correct genus
and family in biological taxonomies
◦ Bene t: Provides consistent data organisation for research and cataloguing

2. Signi cance of CSGs:


• Essential for systems requiring precise, context-based rule enforcement
• Widely used in elds like NLP, computational biology, web development, and compiler
design
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Chomsky Hierarchy

Proposed by linguist Noam Chomsky, this classi es types of formal grammars based on their complexity
and power

It's a theoretical framework used to understand di erent language classes in computer science,
linguistics, and automata theory.

The hierarchy includes four main types of grammars:

• Type 0: Unrestricted Grammars


• Type 1: Context-Sensitive Grammars (CSGs)
• Type 2: Context-Free Grammars (CFGs)
• Type 3: Regular Grammars

Type 0 - Unrestricted Grammars

De nition:
Type 0 grammars (Unrestricted Grammars) have no restrictions
on their production rules, making them the most powerful.

Formal De nition:
Productions are of the form Alpha -> beta, where Alpha and
beta are strings of variables and terminals, with a containing at
least one non-terminal.

Languages Generated:
Type 0 grammars can generate any language that is recursively
enumerable

Computational Model:
These grammars correspond to Turing machines, making them
as powerful as any algorithmically describable process.

Type 1 - Context-Sensitive Grammars (CSGs)

De nition: Type 1 grammars (CSGs) have rules where each production must have at least as many
symbols on the right side as on the left, preserving context

Formal De nition: Productions are of the form aAß - ayß, where A is a non-terminal, a and ß are strings of
terminals and non- terminals, and y is a non-empty string.

Languages Generated: These grammars generate context sensitive languages, which include certain
complex patterns not possible in lower levels

Computational Model: CSGs correspond to linear-bounded automata, which are Turing machines with
limited memory.

Type 2 - Context-Free Grammars (CFG)

De nition: Type 2 grammars (CFGs) have a rules where a single non-terminal symbol is transformed
independently of surrounding symbols.

Formal De nition: Productions are of the form A - y, where A is a non-terminal and y is a string of terminals
and/or non-terminals.

Languages Generated: CFGs generate context-free languages, commonly used in programming


languages and syntactic structure in natural languages.
fi
fi
fi
fi
fi
fi
fi
ff
Computational Model: CFGs are recognised by pushdown automata, which use a stack- a based memory
structure.

Type 3 - Regular Grammars

De nition:
Type 3 grammars (Regular Grammars) have very strict rules, where productions involve a non-terminal
transforming into terminal optionally followed by a non-terminal

Formal De nition:
Productions are of the form A -> aB or A a, where A and B are non-terminals, and a is a terminal.

Languages Generated: Regular grammars generate regular languages, the simplest language class, which
includes patterns like sequences and repetitions.

Computational Model: Regular languages are recognised by nite automata (both deterministic and non-
deterministic).

Visualisation of the Hierarchy:

• Regular Grammars ⊆ CFGs ⊆ CSGs ⊆ Unrestricted Grammars


fi
fi
fi

You might also like