Unit 4 ContextFreeLanguage
Unit 4 ContextFreeLanguage
Formal Definition
A CFG is a 4-tuple, G = (VN , Σ, S, P), where
VN and Σ : Disjoint sets called set of non-terminal (or variables)
and terminal (or alphabets) symbols.
S : Start symbol
P : Grammar rules or productions, a finite set of formulas of the
form A → α, where A ∈ VN and α ∈ (VN ∪ Σ)∗ .
L(G) = {x ∈ Σ∗ |S ⇒∗G x}
L1 ∪ L2 ⊆ L(Gu ) (1)
L(Gu ) ⊆ L1 ∪ L2 (2)
Vc = V1 ∪ V2 ∪ {Sc }
Pc = P1 ∪ P2 ∪ {Sc → S1 .S2 }
V = V1 ∪ {S} where S ∈
/ V1
P = P1 ∪ {S → S1 S|ϵ}
Corollary
Every regular language is a CFL.
Exercise
Construct L = {0i 1j 0k : j > i + k }.
S
S S
( S ) ( S )
( S ) ϵ
ϵ
Theorem
Let G = (V , Σ, S, P) be a context-free grammar and let A ∈ V and
w ∈ Σ∗ . Then th following statements are equivalent:
1 A ⇒∗ w
2 There exists a parse tree with root A and yield w.
L
3 There is a leftmost derivation A ⇒∗ w.
R
4 There is a rightmost derivation A ⇒∗ w.
Ambiguous CFG
A CFG is called ambiguous, if there is atleast one string in L(G) having
two or more distinct derivation trees (or equivalently two or more
distinct leftmost derivations.)
For example:
Consider a + a ∗ a in the grammar of algebraic expressions.
Consider
L = {an bn c m d m | n ⩾ 1, m ⩾ 1} ∪ {an bm c m d n | n ≥ 1, m ≥ 1} .
S
C
S a C d
A B a D d
a A b c B d b D c
a b c d b c
Since the are two distinct leftmost derivations for the same string, we
can see that all grammars for L must be ambiguous. Any string with
equal number of a, b, c and d can be generated by two different ways:
One in which the a’s and b’s are generated to be equal and the c ’s
and d ’s are generated to be equal, while in a second way, where the a
’s and d ’s are generated to be equal and likewise the b’s and c’s.
Dr. Lavanya Selvaganesh CSO322 24 / 60
An unambiguous cfg for algebraic expressions
Theorem:
For each grammar G = (V , Σ, S, P) and string in w ∈ Σ∗ , w has two
distinct parse trees if and only if w has two leftmost derivations from S.
Consider:
G1 : S1 → S1 + T |T
T →T ∗F |F
F → (S1 ) | a.
Theorem:
Let G be a context-free grammar with productions
S → S + S | S ∗ S | (S) | a and let G1 be the grammar as above. Then
L(G) = L(G1 ).
The above result claims that both the grammar are equivalent.
To prove this, we first show L(G) ⊆ L (G1 ) and next L (G1 ) ⊆ L(G).
Dr. Lavanya Selvaganesh CSO322 25 / 60
Proof I
Claim 1: To show L(G) ⊆ L (G1 ).
Proof is by induction on the length of a string in L (G1 ). The basis step
is to show that a ∈ L(G), which is true.
In the induction step, we assume that k ≥ 1 and that every y ∈ L (G1 ),
satisfying |y | ≤ k is in L(G).
To show: x ∈ L (G1 ) and |x| = k + 1, then x ∈ L(G).
Since x ̸= a, any derivation of x ∈ G1 must begin in one of the three
ways:
S1 ⇒ S1 + T
S1 ⇒ T ⇒ T ∗ F
S1 ⇒ T ⇒ F ⇒ (S1 )
If x has a derivation beginning S1 ⇒ S1 + T , then x = y + z, where
S1 ⇒∗G1 y and T ⇒∗G1 z. Since S1 ⇒∗G1 T , it follows that S1 ⇒∗G1 z.
Therefore, since |y | and |z| must be ≤ k , the induction hypothesis
implies that y and z are both in L(G). Since G contains the production
S −→ S + S, the string y + z is derivable from S in G, and hence
x ∈ L(G).
Dr. Lavanya Selvaganesh CSO322 26 / 60
Proof II
Claim 2: To show that L(G) ⊆ L(G1 ).
Induction on |x|: Let k ≥ 1 and that for every y ∈ L(G) with |y | ≤ k ,
y ∈ L(G1 );
To show that if x ∈ L(G) and |x| = k + 1, then x ∈ L(G1 ).
Simplest case is when x has a derivation in G beginning S → (S).
In this case, x = (y ), for some y in L(G) and it follows from the
inductive hypothesis that y ∈ L (G1 ).
Therefore, we can derive x in G1 , by the derivation
S1 ⇒ T ⇒ F ⇒ (S1 ) and then derive y from S1
1 Suppose x has a derivation in G that begins S → S + S. Then just
as before, the induction hypothesis tells that x = y + z, where y
and z are both in L(G1 ).
Now to conclude x ∈ L(G1 ), we need z to be derivable from T ;
In other words, we would like z to be a single term, the last of the
terms whose sum is x. Let x = x1 + x2 + · · · + xn , where each
xi ∈ L(G1 ) and n ≥ 2 is as large as possible.
Dr. Lavanya Selvaganesh CSO322 27 / 60
Proof III
Hence, none of the xi ’s can have a derivation in G1 that begins
S1 ⇒ S1 + T ; therefore, every xi can be derived from T in G1 . Let
y = x1 + x2 + · · · + xn−1 , z = xn .
Then y can be derived from S1 , since S1 ⇒∗G1 T + · · · + T and z can
| {z }
(n−1terms)
be derived from T .
It follows that x ∈ L (G1 ), since we can start with the productions
S1 → S1 + T ,
2 Finally, suppose that every derivation of x in G begins with
S ⇒ S ∗ S. Then for some y and z in L(G), x = y ∗ z. This time
we let x = x1 ∗ x2 ∗ . . . ∗ xn , where each xi ∈ L(G) and n is as large
as possible. Then by the inductive hypothesis, each xi ∈ L (G1 ).
2 Suppose xi has a derivation in G1 , that begins S1 ⇒ T ⇒ T ∗ F . If
this were true, xi would be of the form yi ∗ zi for some
yi , zi ∈ L (G1 ), since we know α (G1 ) ⊆ L(G), this would contradict
the maximality of the number.
Dr. Lavanya Selvaganesh CSO322 28 / 60
Suppose that some xi had a derivation in G1 , beginning S1 ⇒ S1 + T .
Then xi = yi + zi , for some yi , zi ∈ L (G1 ) ⊆ L(G). In this case,
x = x1 ∗ x2 ∗ . . . xi−1 ∗ yi + zi ∗ xi+1 ∗ . . . ∗ xn
Theorem
S1 → S1 + T | T
The CFG G1 with productions T → T ∗ F | F is unambiguous.
F → (S1 ) | a.
Proof: Claim: String in L (G1 ) has only one left most derivation from
S1 .
The proof will be by induction on |x|, and it will actually be easier to
prove something apparently stronger:
For any x derivable from one of the variables S1 , T or F , x has only
one leftmost derivation from that variable.
Basic step: ’ a ’ can be derived from any of the three variables and
that in each case there is only one derivation.
Nullable Variable
A nullable variable in a CFG G = (V , Σ, S, P) is defined as
Any variable A for which P contains the production A → ϵ is
nullable.
If P contains the production of the form A → B1 B2 . . . Bn , where
each B1 , B2 , . . . Bn are nullable variables, then A is nullable.
No other variable is nullable.
A → CD | C | D, B → Cb | b, C → a, D → bD | b
.
Dr. Lavanya Selvaganesh CSO322 35 / 60
Algorithm: FindNULL
Finding the nullable variables in a CFG G = (V , Σ, S, P) .
N0 = {A ∈ V : P contains the production A → ϵ}.
i = 0;
do
{i = i + 1;
∗
Ni = Ni−1 ∪ {A : P contains A → α, for some α ∈ Ni−1 }
}
while Ni ̸= Ni−1 ;
Ni is the set of nullable variables.
By applying the algorithm, to the previous example, we get
N0 = {C, D}
N1 = {A, C, D}
Since no other productions have right sides in {A, C, D}∗ , these three
are only the nullable variables.
Dr. Lavanya Selvaganesh CSO322 36 / 60
Finding an equivalent CFG with no ϵ-productions
Xi ⇒∗G1 xi .
Therefore, A ⇒∗G1 x.
Next, we show the converse, that for any n ≥ 1, if A ⇒nG1 x, then
A ⇒∗G x by induction on n.
For basic step: If A ⇒G1 x, then A → x is a production in P1 . This
means that A → α is a production in P, where x is obtained from
α by deleting one or more nullable variables.
It follows that A ⇒∗G x, because we can begin a derivation with a
production A → α and proceed by deriving ϵ from each nullable
variable that was deleted.
In the induction step, we assume that for k ≥ 1 and that any string
other than ϵ is derivable from A in k or fewer steps in G1 is
derivable from A in G.
Dr. Lavanya Selvaganesh CSO322 40 / 60
Proof - III
We next show that if x ̸= ϵ and A ⇒kG+1
1
x then A ⇒∗G x.
Suppose that the first step in the k + 1-step derivation of x in G1 is
A → X1 X2 · · · Xm , where each Xi is either a variable or a terminal.
Let x = x1 x2 · · · xm , where each xi is equal to Xi or is derivable
from Xi in k or less steps in G1 .
By induction hypothesis, Xi ⇒∗G xi , for each i.
By definition of G, there is a production A → α in P so that
X1 X2 · · · Xm can be obtained from α by deleting certain nullable
variables.
Since
A ⇒∗G X1 X2 · · · Xm ,
we can derive x from A in G, by first deriving X1 X2 · · · Xm and then
deriving each xi from the corresponding Xi .
Hence the proof.
S →S+T |T
T →T ∗F |F
F → (S) | a.
S → S + T | T ∗ F | (S) | a,
T → T ∗ F | (S) | a,
F → (S) | a.
Lemma
Let G be a CFG such that L(G) ̸= ∅. Let G′ be the context-free
grammar as constructed above from the context-free grammar G by
eliminating the useless productions. Then L(G′ ) = L(G).
Exercise
What happens if we first eliminate the non-reachable symbols and then
non-generating symbols? Will the resulting CFG be without useless
symbols?
[Hint: Carry out the steps on the CFG with productions S → a, A → b.]
Elimination of useless symbols is not only a matter of elegance;
It also helps in saving time while converting a CFG to its normal
forms.
The normal forms we will be discussing in the next section are not
only helpful for allowing nice progress indicators in a derivation,
but also they help in proving certain results in relatively shorter
way.
A → BC, A → a
A → a, A → α, α ∈ (V ∪ Σ)∗ , |α| ≥ 2
So consider, A → α:
For every a ∈ Σ appearing in α, introduce a new variable Xa and a
production Xa → a.
Replace a by Xa in every production it appears (except of the form
A → a).
For ex: A → aAb and B → ab
Replacing: A → Xa AXb , B → Xa Xb , and adding the productions
Xa → a and Xb → b.
The only production from Xa is a implies G2 is equivalent to G1 .
Consider, A → BCDBCE
would be replaced by
A →BY1
Y1 →CY2
Y2 →DY3
Y3 →BY4
Y4 →CE
Theorem
For any CFG G = (V , Σ, S, P), there is a CFG G′ = (V ′ , Σ, S, P ′ ) in
Chomsky Normal Form so that L(G′ ) = L(G) \ {ϵ}.
Dr. Lavanya Selvaganesh CSO322 53 / 60
Example
Consider
A → ax,
Example
The grammar S → aS | bSS | c is an s-grammar.
The grammar S → aS | bSS | aSS | c is not an s-grammar because
the pair (S, a) occurs in the two productions S → aS and S → aSS.
Definition
A context-free grammar is said to be in Greibach normal form if all
productions have the form A → ax, where a ∈ T and x ∈ V ∗ .