mod 2
mod 2
Processing carried out at word level, including methods for characterizing word sequences,
identifying morphological variants, detecting and correcting misspelled words, and identifying
correct part-of-speech of a word.
Some simple regular expressions: First instance of each match is underlined in table
1
/book/ Reporters, who do not read the stylebook, should not criticize their
editors.
/face/ Not everything that is faced can be changed. But nothing can be
changed until it is faced.
/a/ Reason, Observation, and Experience-the Holy Trinity of Science.
Characters are grouped by square brackets, matching one character from the class. For
example, /[abcd]/ matches a, b, c, or d, and /[0123456789]/ matches any digit. A dash specifies a
range, like /[5-9]/ or /[m-p]/. The caret at the start of a class negates the match, as in /[^x]/, which
matches any character except x. The caret is interpreted literally elsewhere.
• Regular expressions are case-sensitive (e.g., /s/ matches 's', not 'S').
• Use square brackets to handle case differences, like /[sS]/.
o /[sS]ana/ matches 'sana' or 'Sana'.
• The question mark (?) makes the previous character optional (e.g., /supernovas?/).
• The * allows zero or more occurrences (e.g., /b*/).
• /[ab]*/ matches zero or more occurrences of 'a' or 'b'.
• The + specifies one or more occurrences (e.g., /a+/).
• /[0-9]+/ matches a sequence of one or more digits.
• The caret (^) anchors the match at the start, and $ at the end of a line.
o /^The nature\.$/ will search exactly for this line.
• The dot (.) is a wildcard matching any single character (e.g., /./).
o Expression /.at/ matches with any of the string cat, bat, rat, gat, kat, mat, etc.
Special characters
RE Description
. The dot matches any single character.
\n Matches a new line character (or CR+LF combination).
2
\t Matches a tab (ASCII 9).
\d Matches a digit [0-9].
\D Matches a non-digit.
\w Matches an alphanumeric character.
\W Matches a non-alphanumberic character.
\s Matches a whitespace character.
\S Matches a non-whitespace character.
\ Use \ to escape special characters. For example, l. matches a dot, \* matches a * and
\ matches a backslash.
• The wildcard symbol can count characters, e.g., /.....berry/ matches ten-letter strings
ending in "berry".
• This matches "strawberry", "sugarberry", but not "blueberry" or "hackberry".
• To search for "Tanveer" or "Siddiqui", use the disjunction operator (|), e.g.,
"Tanveer|Siddiqui".
• The pipe symbol matches either of the two patterns.
• Sequences take precedence over disjunction, so parentheses are needed to group patterns.
• Enclosing patterns in parentheses allows disjunction to apply correctly.
Pattern Description
^[A-Za-z0-9_\.- ]+ Match a positive number of acceptable characters at the start of
the string.
@ Match the @ sign.
[A-Za-z0-9_\ .- ]+ Match any domain name, including a dot.
[A-Za-z0-9_] [A-Za-z0-9_] $ Match two acceptable characters but not a dot. This ensures that
the email address ends with .xx, .xxx, .xxxx, etc.
3. Finite-State Automata
• Game Description: The game involves a board with pieces, dice or a wheel to generate random
numbers, and players rearranging pieces based on the number. There’s no skill or choice; the game
is entirely based on random numbers.
3
• States: The game progresses through various states, starting from the initial state (beginning
positions of pieces) to the final state (winning positions).
• Machine Analogy: A machine with input, memory, processor, and output follows a similar
process: it starts in an initial state, changes to the next state based on the input, and eventually
reaches a final state or gets stuck if the next state is undefined.
• Finite Automaton: This model, with finite states and input symbols, describes a machine that
automatically changes states based on the input, and it’s deterministic, meaning the next state is
fully determined by the current state and input.
Let ∑ = {a, b, c}, the set of states = {q0, q1, q2, q3, q4} with q0 being the start state and q4 the final state,
we have the following rules of transition:
1. From state q0 and with input a, go to state q1.
2. From state q1 and with input b, go to state q2.
3. From state q1 and with input c go to state q3.
4. From state q2 and with input b, go to state q4.
5. From state q3 and with input b, go to state q4.
• The nodes in this diagram correspond to the states, and the arcs to transitions.
4
• A deterministic finite-state automaton (DFA) is defined as a 5-tuple (Q, Σ, δ, S, F), where Q is
a set of states, Σ is an alphabet, S is the start state, F ⃀ Q is a set of final states, and δ is a transition
function.
• The transition function δ defines mapping from QxΣ to Q. That is, for each state q and symbol a,
there is at most one transition possible
Non-Deterministic Automata:
• For each state, there can be more than one transition on a given symbol, each leading to a different
state.
• This is shown in Figure, where there are two possible transitions from state q0 on input symbol a.
• The transition function of a non-deterministic finite-state automaton (NFA) maps Q× (Σ Ս {ε})
to a subset of the power set of Q.
Example:
1. Consider the deterministic automaton described in above example and the input, “ac”.
• We start with state q0 and input symbol a and will go to state
q1.
• The next input symbol is c, we go to state q3.
• No more input is left and we have not reached the final state.
• Hence, the string ac is
not recognized by the automaton.
2. Now, consider the input “acb”
• we start with state q0 and go to state q1
• The next input symbol is c, so we go to state q3.
• The next input symbol is b, which leads to state q4.
• No more input is left and we have reached to final state.
• The string acb is a word of the language defined by the automaton.
State-transition table
• The rows in this table represent states and the columns correspond to input.
• The entries in the table represent the transition corresponding to a given state-input pair.
5
• A ɸ entry indicates missing transition.
• This table contains all the information needed by FSA.
Input
State a b c
Start: q0 q1 ɸ ɸ
q1 ɸ q2 q3
q2 ɸ q4 ɸ
q3 ɸ q4 ɸ
Final: q4 ɸ ɸ ɸ
Deterministic finite -state automaton (DFA) The state-transition table of DFA Example
• Consider a language consisting of all strings containing only a’s and b’s and ending with baa.
• We can specify this language by the regular expression /(a|b)*baa$/.
• The NFA implementing this regular expression is shown & state-transition table for the NFA is as
shown below.
Input
State a b
q2 {q3} ɸ
Final: q3 ɸ ɸ
NFA for /(a|b)*baa$/ State transition table
4. Morphological Parsing
• It is a sub-discipline of linguistics
• It studies word structure and the formation of words from smaller units (morphemes).
• The goal of morphological parsing is to discover the morphemes that build a given word.
• A morphological parser should be able to tell us that the word 'eggs' is the plural form of the noun
stem 'egg'.
6
Example:
The word 'bread' consists of a single morpheme.
'eggs' consist of two morphemes: the egg and -s
4.1 Two Broad classes of Morphemes:
1. Stems – Main morpheme, contains the central meaning.
2. Affixes – modify the meaning given by the stem.
o Affixes are divided into prefix, suffix, infix, and circumfix.
1. Prefix - morphemes which appear before a stem. (un-happy, be-waqt)
Compounding: The process of merging two or more words to form a new word.
Morphological analysis and generation deal with inflection, derivation and compounding process in
word formation and essential to many NLP applications:
1. Spelling corrections to machine translations.
2. In Information retrieval – to identify the presence of a query word in a document in spite of
different morphological variants.
4.3 Morphological parsing:
It converts inflected words into their canonical form (lemma) with syntactical and morphological tags
(e.g., tense, gender, number).
Morphological generation reverses this process, and both parsing and generation rely on a dictionary
of valid lemmas and inflection paradigms for correct word forms.
A morphological parser uses following information sources:
1. Lexicon: A lexicon lists stems and affixes together with basic information about them.
7
2. Morphotactics: Ordering among the morphemes that constitute a word, It describes the way
morphemes are arranged or touch each other. Ex. Rest-less-ness is a valid word, but not Rest-
ness-less.
3. Orthographic rules: Spelling rules that specify the changes that occur when two given
morphemes combine. Ex. 'easy' to 'easier' and not to 'easyer'. (y → ier spelling rule)
Morphological analysis can be avoided if an exhaustive lexicon is available that lists features for all the
word-forms of all the roots.
4.4 Stemmers:
• The simplest morphological systems
• Collapse morphological variations of a given word (word-forms) to one lemma or stem.
• Stemmers do not use a lexicon; instead, they make use of rewrite rules of the form:
o ier → y (e.g., earlier → early) o
ing → ε (e.g., playing → play)
• Two widely used stemming algorithms have been developed by Lovins (1968) and Porter (1980).
8
To transform word such as 'rotational' into 'rotate'.
Limitations:
• It is difficult to use stemming with morphologically rich languages.
• E.g. Transformation of the word 'organization' into 'organ'
• It reduces only suffixes and prefixes; Compound words are not reduced. E.g. “toothbrush” or
“snowball” can’t be broken.
A more efficient two-level morphological model – Koskenniemi (1983)
• Morphological parsing is viewed as a mapping from the surface level into morpheme and feature
sequences on the lexical level.
• The surface level represents the actual spelling of the word.
• The lexical level represents the concatenation of its constituent morphemes.
e.g. 'playing' is represented in the
Surface level play + V + PP
Lexical level 'play' followed by the morphological information +V +PP
Surface Level p l a y i n g
Lexical Level
Similarly, 'books' is p l a y +V +PP represented in the lexical form as
'book + N + PL'
This model is usually implemented with a kind of finite-state automata, called finite-state transducer
(FST).
Finite-state transducer (FST)
• FST maps an input word to its morphological components (root, affixes, etc.) and can also
generate the possible forms of a word based on its root and morphological rules.
• An FST can be thought of as a two-state automaton, which recognizes or generates a pair of
strings.
E.g. Walking
Analysis (Decomposition):
The analyzer uses a transducer that:
• Identifies the base form ("walk") from the surface form ("walking").
• Recognizes the suffix ("-ing") and removes it.
Generation (Synthesis):
The generator uses another transducer that:
• Identifies the base form ("walk") and applies the appropriate suffix to generate different surface
forms, like "walked" or "walking".
A finite-state transducer is a 6-tuple (Σ1, Σ2, Q, δ, S, F), where Q is set of states, S is the initial state, and
F ⃀ Q is a set of final states, Σ1 is input alphabet, Σ2 is output alphabet, and δ is a function mapping Q x
(Σ1 Ս {€}) x (Σ2 Ս {€}) to a subset of the power set of Q.
9
δ: Q × (Σ1 {ε}) × (Σ2 {ε}) → 2Q
Thus, an FST is similar to an NFA except in that transitions are made on strings rather than on symbols
and, in addition, they have outputs. FSTs encode regular relations between regular languages, with the
upper language on the top and the lower language on the bottom. For a transducer T and string s, T(s)
represents the set of strings in the relation. FSTs are closed under union, concatenation, composition, and
Kleene closure, but not under intersection or complementation.
Two-level morphology using FSTs involves analyzing surface forms in two steps.
Step1: Words are split into morphemes, considering spelling rules and possible splits (e.g., "boxe + s" or
"box + s").
Step2: The output is a concatenation of stems and affixes, with multiple representations possible for each
word.
We need to build two transducers: one that maps the surface form to the intermediate form and another
that maps the intermediate form to the lexical form.
A transducer maps the surface form "lesser" to its comparative form, where ɛ represents the empty string.
This bi-directional FST can be used for both analysis (surface to base) and generation (base to surface).
E.g. Lesser
• The plural form of regular nouns usually ends with -s or -es. (not necessarily be the plural form –
class, miss, bus).
• One of the required translations is the deletion of the 'e' when introducing a morpheme boundary.
o E.g. Boxes, This deletion is usually required for words ending in xes, ses, zes.
• This is done by below transducer – Mapping English nouns to the intermediate form:
10
Bird+s
Box+e+s
Quiz+e+s
• The next step is to develop a transducer that does the mapping from the intermediate level to the
lexical level. The input to transducer has one of the following forms:
• Regular noun stem, e.g., bird, cat
• Regular noun stem + s, e.g., bird + s
• Singular irregular noun stem, e.g., goose
• Plural irregular noun stem, e.g., geese
• In the first case, the transducer has to map all symbols of the stem to themselves and then output
N and sg.
• In the second case, it has to map all symbols of the stem to themselves, but then output N and
replaces PL with s.
• In the third case, it has to do the same as in the first case.
• Finally, in the fourth case, the transducer has to map the irregular plural noun stem to the
corresponding singular stem (e.g., geese to goose) and then it should add Nand PL.
The mapping from State 1 to State 2, 3, or 4 is carried out with the help of a transducer encoding a lexicon.
The transducer implementing the lexicon maps the individual regular and irregular noun stems to their
correct noun stem, replacing labels like regular noun form, etc.
This lexicon maps the surface form geese, which is an irregular noun, to its correct stem goose in the
following way:
g:g e:o e:o s:s e:e
Mapping for the regular surface form of bird is b:b i:i r:r d:d. Representing pairs like a:a with a single
letter, these two representations are reduced to g e:o e:o s e and b i r d respectively.
Composing this transducer with the previous one, we get a single two-level transducer with one input tape
and one output tape. This maps plural nouns into the stem plus the morphological marker + pl and singular
nouns into the stem plus the morpheme + sg. Thus a surface word form birds will be mapped to bird + N
+ pl as follows.
11
b:b i:i r:r d:d + ε:N + s:pl
Each letter maps to itself, while & maps to morphological feature +N, and s maps to morphological feature
pl. Figure shows the resulting composed transducer.
Typing mistakes: single character omission, insertion, substitution, and reversal are the most common
typing mistakes.
OCR errors: Usually grouped into five classes: substitution, multi-substitution (or framing), space
deletion or insertion, and failures.
Substitution errors: Caused due to visual similarity (single character) such as c→e, 1→l, r→n.
The same is true for multi-substitution (two or more chars), e.g., m→rn.
Failure occurs when the OCR algorithm fails to select a letter with sufficient accuracy. Solution:
These errors can be corrected using 'context' or by using linguistic structures.
Phonetic errors:
• Non-word error:
o Word that does not appear in a given lexicon or is not a valid orthographic word form.
o The two main techniques to find non-word errors: n-gram analysis and dictionary lookup.
• Real-word error:
o It occurs due to typographical mistakes or spelling errors.
12
o E.g. piece for peace or meat for meet.
o May cause local syntactic errors, global syntactic errors, semantic errors, or errors at
discourse or pragmatic levels.
o Impossible to decide that a word is wrong without some contextual information
Spelling correction: consists of detecting and correcting errors. Error detection is the process of finding
misspelled words and error correction is the process of suggesting correct words to a misspelled one.
These sub-problems are addressed in two ways:
Isolated-error detection and correction: Each word is checked separately, independent of its context.
Context dependent error detection and correction methods: Utilize the context of a word. This requires
grammatical analysis and is thus more complex and language dependent. the list of candidate words must
first be obtained using an isolated-word method before making a selection depending on the context.
Minimum edit distance The minimum edit distance between two strings is the minimum number of
operations (insertions, deletions, or substitutions) required to transform one string into another.
Similarity key techniques The basic idea in a similarity key technique is to change a given string into a
key such that similar strings will change into the same key.
n-gram based techniques n-gram techniques usually require a large corpus or dictionary as training data,
so that an n-gram table of possible combinations of letters can be compiled. In case of real-word error
detection, we calculate the likelihood of one character following another and use this information to find
possible correct word candidates.
Neural nets These have the ability to do associative recall based on incomplete and noisy data. They can
be trained to adapt to specific spelling error patterns. Note: They are computationally expensive.
Rule-based techniques In a rule-based technique, a set of rules (heuristics) derived from knowledge of a
common spelling error pattern is used to transform misspelled words into valid words.
The minimum edit distance is the number of insertions, deletions, and substitutions required to change
one string into another.
13
For example, the minimum edit distance between 'tutor' and 'tumour' is 2: We substitute 'm' for 't' and
insert 'u' before 'r'.
Edit distance can be viewed as a string alignment problem. By aligning two strings, we can measure the
degree to which they match. There may be more than one possible alignment between two strings.
Alignment 1:
tuto-rtumour
The best possible alignment corresponds to the minimum edit distance between the strings. The alignment
shown here, between tutor and tumour, has a distance of 2.
A dash in the upper string indicates insertion. A substitution occurs when the two alignment symbols do
not match (shown in bold).
The Levensthein distance between two sequences is obtained by assigning a unit cost to each operation,
therefore distance is 2.
Alignment 2:
Another possible alignment for these sequences is
tut-o-rtu-mour
• This matrix has one row for each symbol in the source string and one column for each matrix in
the target string.
• The (i, j)th cell in this matrix represents the distance between the first i character of the source
and the first j character of the target string.
• Each cell can be computed as a simple function of its surrounding cells. Thus, by starting at the
beginning of the matrix, it is possible to fill each entry iteratively.
• The value in each cell is computed in terms of three possible paths.
• The substitution will be 0 if the ith character in the source matches with jth character in the target.
• The minimum edit distance algorithm is shown below.
14
• How the algorithm computes the minimum edit distance between tutor and tumour is shown in
table.
Input: Two strings, X and Y
Output: The minimum edit distance between X and Y
m « length(X) n«length(Y) for i = 0 to m do dist[i,0]
«i
for j = 0 to n do dist[0,j]
«j
for i = 0 to m do for j = 0 to n do dist[i,j]=min{ dist[i-1,j]+insert_cost, dist[i-1,j-
1]+subst_cost(Xi,Yj), dist[ij-1]+ delet_cost}
# t u m o u r
# 0 1 2 3 4 5 6
t 1 0 1 2 3 4 5
u 2 1 0 1 2 3 4
t 3 2 1 1 2 3 4
o 4 3 2 2 1 2 3
r 5 4 3 3 2 2 2
Minimum edit distance algorithms are also useful for determining accuracy in speech recognition systems.
Table shows some of the word classes in English. Lexical categories and their properties vary from
language to language.
Word classes are further categorized as open and closed word classes.
• Open word classes constantly acquire new members while closed word classes do not (or only
infrequently do so).
15
• Nouns, verbs (except auxiliary verbs), adjectives, adverbs, and interjections are open word
classes.
e.g. computer, happiness, dog, run, think, discover, beautiful, large, happy, quickly, very, easily
• Prepositions, auxiliary verbs, delimiters, conjunction, and Interjections are closed word classes.
e.g. in, on, under, between, he, she, it, they, the, a, some, this, and, but, or, because, oh, wow, ouch
3. Part-of-Speech Tagging
• The process of assigning a part-of-speech (such as a noun, verb, pronoun, preposition, adverb,
and adjective), to each word in a sentence.
• Input to a tagging algorithm: Sequence of words of a natural language sentence and specified tag
sets.
• Output: single best part-of-speech tag for each word.
• Many words may belong to more than one lexical category:
o I am reading a good book book: Noun
o The police booked the snatcher book: verb o 'sona' may mean
'gold' (noun) or 'sleep' (verb)
Tag set:
Consider,
16
• Penn Treebank tag set contains 45 tags & C7 uses 164
• TOSCA-ICE for the International Corpus of English with 270 tags (Garside 1997).
• TESS with 200 tags.
• English, which is not morphologically rich, the C7 tagset is too big yield too many mistagged
words.
Tags from Penn Treebank tag set Possible tags for the word to eat
Rule-based taggers use hand-coded rules to assign tags to words. These rules use a lexicon to obtain a
list of candidate tags and then use rules to discard incorrect tags.
Hybrid taggers combine features of both these approaches. Like rule- based systems, they use rules to
specify tags. Like stochastic systems, they use machine-learning to induce rules from a tagged training
corpus automatically. E.g. Brill tagger.
• A two-stage architecture.
17
• The first stage: A dictionary look-up procedure, which returns a set of potential tags (parts-of-
speech) and appropriate syntactic features for each word.
• The second stage: A set of hand-coded rules to discard contextually illegitimate tags to get a
single part-of-speech for each word.
IF word ends in -ing and preceding word is a verb THEN label it a verb (VB).
Rule-based taggers use capitalization to identify unknown nouns and typically require supervised training.
Rules can be induced by running untagged text through a tagger, manually correcting it, and feeding it
back for learning.
TAGGIT (1971) tagged 77% of the Brown corpus using 3,300 rules. ENGTWOL (1995) is another rule-
based tagger known for speed and determinism.
While rule-based systems are fast and deterministic, they require significant effort to write rules and need
a complete rewrite for other languages. Stochastic taggers are more flexible, adapting to new languages
with minimal changes and retraining. Thus, rule-based systems are precise but labor-intensive, while
stochastic systems are more adaptable but probabilistic.
The unigram model requires a tagged training corpus to gather statistics for tagging data. It assigns tags
based solely on the word itself. For example, the tag JJ (Adjective) is frequently assigned to "fast"
because it is more commonly used as an adjective than as a noun, verb, or adverb. However, this can lead
to incorrect tagging, as seen in the following examples:
18
3. Those who were injured in the accident need to be helped fast — Here, "fast" is an adverb.
In these cases, a more accurate prediction could be made by considering additional context. A bi-gram
tagger improves accuracy by incorporating both the current word and the tag of the previous word. For
instance, in sentence (1), the sequence "DT NN" (determiner, noun) is more likely than "DT JJ"
(determiner, adjective), so the bi-gram tagger would correctly tag "fast" as a noun. Similarly, in sentence
(3), a verb is more likely to be followed by an adverb, so the bi-gram tagger assigns "fast" the tag RB
(adverb).
In general, n-gram models consider both the current word and the tags of the previous n-1 words. A tri-
gram model, for example, uses the previous two tags, providing even richer context for more accurate
tagging. The context considered by a tri-gram model is shown in Figure, where the shaded area represents
How the HMM tagger assigns the most likely tag sequence to a given sentence:
We refer to this model as a Hidden Markov Model (HMM) because it has two layers of states:
While tagging input data, we can observe the words, but the tags (states) are hidden. The states are visible
during training but not during the tagging process.
As mentioned earlier, the HMM uses lexical and bi-gram probabilities estimated from a tagged training
corpus to compute the most likely tag sequence for a given sentence. One way to store these probabilities
is by constructing a probability matrix. This matrix includes:
• The n-gram analysis (for example, in a bi-gram model, the probability that a word of class X
follows a word of class Y).
During tagging, this matrix is used to guide the HMM tagger in predicting the tags for an unknown
sentence. The goal is to determine the most probable tag sequence for a given sequence of words.
19
T'= argmaxT P(T|W)
Applying Bayes Rule, P(T|W) can be estimated using the expression:
P(T|W) = P(W|T) * P(T)/P(W)
P(W), remains the same for each tag sequence, we can drop it. The expression for the most likely tag
sequence becomes:
20
Example Consider the sentence “The bird can fly”. and
the tag sequence DT NNP MD VB
Using bi-gram approximation, the probability
can be computed as
= P(DT) × P(NNP|DT) * P(MD|NNP) × P(VB|MD) × P(the/DT) x P(bird|NNP) x P(can|MD) ×
P(fly|VB)
21
• These approaches use rules to assign tags to words.
• While leveraging machine learning techniques to automatically generate rules from data.
Figure illustrates the TBL process, which is a supervised learning technique. The algorithm starts by
assigning the most likely tag to each word using a lexicon. Transformation rules are then applied
iteratively, with the rule that improves tagging accuracy most being selected each time. The process
continues until no significant improvements are made.
The output is a ranked list of transformations, which are applied to new text by first assigning the most
frequent tag and then applying the transformations.
22
Change NNP to VB if the previous tag is TO. As
the contextual condition is satisfied, this rule will change fish/NN to fish/VB:
Most part-of-speech tagging research focuses on English and European languages, but the lack of
annotated corpora limits progress for other languages, including Indian languages. Some systems, like
Bengali (Sandipan et al., 2004) and Hindi (Smriti et al., 2006), combine morphological analysis with
tagged corpora.
Tagging Urdu is more complex due to its right-to-left script and grammar influenced by Arabic and
Persian. Before Hardie (2003), little work was done on Urdu tag sets, with his research part of the
EMILLE project for South Asian languages.
Unknown words, which do not appear in a dictionary or training corpus, pose challenges during tagging.
Solutions include:
• Assigning the most frequent tag from the training corpus or initializing unknown words with an
open class tag and disambiguating them using tag probabilities.
• Another approach involves using morphological information, such as affixes, to predict the tag
based on common suffixes or prefixes in the training data, similar to Brill's tagger.
Syntactic Analysis
1. Introduction:
• Syntactic parsing deals with the syntactic structure of a sentence.
• 'Syntax' refers to the grammatical arrangement of words in a sentence and their relationship with
each other.
• The objective of syntactic analysis is to find the syntactic structure of the sentence.
• This structure is usually depicted as a tree, as shown in Figure.
o Nodes in the tree represent the phrases and leaves correspond to the words.
o The root of the tree is the whole sentence.
23
• Identifying the syntactic structure is useful in determining the
meaning of the sentence.
• Syntactic parsing can be considered as the process of assigning
'phrase markers' to a sentence.
• Two important ideas in natural language are those of constituency
and word order.
o Constituency is about how words are grouped together.
o Word order is about how, within a constituent, words are
ordered and also how constituents are ordered with respect
to one another.
• A widely used mathematical system for modelling constituent structure in natural language is
context-free grammar (CFG) also known as phrase structure grammar.
2. Context-free Grammar:
• Context-free grammar (CFG) was first defined for natural language by Chomsky (1957).
• Consists of four components:
1. A set of non-terminal symbols, N
2. A set of terminal symbols, T
3. A designated start symbol, S, that is one of the symbols from N.
4. A set of productions, P, of the form: A→α o Where A € N and α is a string
consisting of terminal and non-terminal symbols.
o The rule A → α says that constituent A can be rewritten as α. This is also called the phrase
structure rule. It specifies which elements (or constituents) can occur in a phrase and in
what order.
o For example, the rule S → NP VP states that S consists of NP followed by VP, i.e., a
sentence consists of a noun phrase followed by a verb phrase.
CFG as a generator:
• Above can be derived from S. The representation of this derivation is shown in Figure.
• Sometimes, a more compact bracketed notation is used to represent a parse tree.
[s [NP [N Hena]] [vp [v reads] [NP [Det a] [Nbook]]]]
24
• The parse tree in Figure can be represented using this notation as follows:
3. Constituency:
• Words in a sentence are not tied together as a sequence of part-of-speech.
• Language puts constraints on word order.
• Words group together to form constituents (often termed phrases), each of which acts as a single
unit. They combine with other constituents to form larger constituents, and eventually, a sentence.
• Constituents combine with others to form a sentence constituent.
• For example: the noun phrase, The bird, can combine with the verb phrase, flies, to form the
sentence, The bird flies.
• Different types of phrases have different internal structures.
3.1 Phrase Level Constructions
Noun Phrase, Verb phrase, Prepositional Phrase, Adjective Phrase, Adverb Phrase
Noun Phrase:
• A noun phrase is a phrase whose head is a noun or a pronoun, optionally accompanied by a set of
modifiers. It can function as subject, object, or complement.
• The modifiers of a noun phrase can be determiners or adjective phrases.
• Phrase structure rules are of the form: A → B C NP → Pronoun
NP → Det Noun
NP → Noun
NP → Adj Noun
NP → Det Adj Noun
• We can combine all these rules in a single phrase structure rule as follows:
NP → (Det) (Adj) Noun|Pronoun
• A noun phrase may include post-modifiers and more than one adjective. NP → (Det) (AP) Noun
(PP)
Few examples of noun phrases:
They NP
25
The foggy morning Adj Noun
Example:
The foggy damped weather disturbed the match. noun phrase acts as a subject I
would like a nice cold banana shake. noun phrase acts as an object Kula botanical
garden is a beautiful location. noun phrase acts as predicate
Verb Phrase:
• Headed by a verb
• The verb phrase organizes various elements of the sentence that depend syntactically on the verb.
Examples of verb phrases:
Khushbu slept. VP → Verb
The boy kicked the ball VP →Verb NP
Khushbu slept in the garden. VP → Verb PP
The boy gave the girl a book. VP → Verb NP NP
The boy gave the girl a book with blue cover. Verb NP NP PP
In general, the number of NPs in a VP is limited to two, whereas it is possible to add more than two
PPs. VP → Verb (NP) (NP) (PP)*
Things are further complicated by the fact that objects may also be entire clauses as in the sentence, I
know that Taj is one of the seven wonders. Hence, we must also allow for an alternative phrase statement
rule, in which NP is replaced by S.
VP → Verb S
Prepositional Phrase:
Prepositional phrases are headed by a preposition. They consist of a preposition, possibly followed by
some other constituent, usually a noun phrase.
PP → Prep (NP)
26
Adjective Phrase:
The head of an adjective phrase (AP) is an adjective. APs consist of an adjective, which may be preceded
by an adverb and followed by a PP.
The four commonly known structures are declarative structure, imperative structure, yes-no question
structure, and wh-question structure.
Grammar rule: S → NP VP
Example: Please pass the salt, Look at the door, Show me the latest design.
Structure: usually begin with a verb phrase and lack subject.
Grammar rule: S → VP
Structure: usually begin with an auxiliary verb, followed by a subject NP, followed by a VP.
4. Wh-question structure: Asks for specific information using words like who, what, or where.
27
Example: Where are you going?
S→ NP VP
S→ VP
S→ Aux NP VP
S→ Wh-NP VP
S→ Wh-NP Aux NP VP
NP → (Det) (AP) Nom (PP)
VP → Verb (NP) (NP) (PP)*
VP → Verb S
AP → (Adv) Adj (PP)
PP → Prep (NP)
Nom →
Note:
Coordination:
Refers to conjoining phrases with conjunctions like 'and', 'or', and 'but'.
For example,
A coordinate noun phrase can consist of two other noun phrases separated by a conjunction.
I ate [NP [NP an apple] and [NP a banana]].
Similarly, verb phrases and prepositional phrases can be conjoined as follows:
It is [VP [VP dazzling] and [VP raining]].
Not only that, even a sentence can be conjoined.
[S [S I am reading the book] and [S I am also watching the movie]]
28
VP → VP and VP
S → S and S
Agreement:
Most verbs use two different forms in present tense-one for third person, singular subjects, and the other
for all other kinds of subjects. Subject and verb must agree.
Examples: Demonstrate how the subject NP affects the form of the verb.
Does [NP Priya] sing?
Do [Np they] eat?
The -es form of 'do', i.e. 'does' is used. The second sentence has a plural NP subject. Hence, the
form 'do' is being used. Sentences in which subject and verb do not agree are ungrammatical.
29
For example, the number property of a noun phrase can be represented by NUMBER feature. The value
that a NUMBER feature can take is SG (for singular) and PL (for plural).
Feature structures are represented by a matrix-like diagram called attribute value matrix (AVM).
The feature structure can be used to encode the grammatical category of a constituent and the features
associated with it. For example, the following structure represents the third person singular noun phrase.
The CAT and PERSON feature values remain the same in both structures, illustrating how feature
structures support generalization while maintaining necessary distinctions. Feature values can also be
other feature structures, not just atomic symbols. For instance, combining NUMBER and PERSON into
a single AGREEMENT feature makes sense, as subjects must agree with predicates in both properties.
This allows a more streamlined representation.
4. Parsing
• A phrase structure tree constructed from a sentence is called a parse.
• The syntactic parser is thus responsible for recognizing a sentence and assigning a syntactic
structure to it.
• The task that uses the rewrite rules of a grammar to either generate a particular sequence of words
or reconstruct its derivation (or phrase structure tree) is termed parsing.
• It is possible for many different phrase structure trees to derive the same sequence of words.
• Sentence can have multiple parses This phenomenon is called syntactic ambiguity.
• Processes input data (usually in the form of text) and converts it into a format that can be
easily understood and manipulated by a computer.
o Input: The first constraint comes from the words in the input sentence. A valid parse is
one that covers all the words in a sentence. Hence, these words must constitute the leaves
of the final parse tree.
30
o Grammar: The second kind of constraint comes from the grammar. The root of the final
parse tree must be the start symbol of the grammar.
• Starts its search from the root node S and works downwards towards the leaves.
• Find all sub-trees which can start with S: Expand the root node using all the grammar rules with
S on their left-hand side.
• Likewise, each non-terminal symbol in the resulting sub-trees is expanded next using the grammar
rules having a matching non-terminal symbol on their left-hand side.
• The right-hand side of the grammar rules provide the nodes to be generated, which are then
expanded recursively.
• The tree grows downward and eventually reaches a state where the bottom of the tree consists
only of part-of-speech categories.
• A successful parse corresponds to a tree which matches exactly with the words in the input
sentence.
Example: Consider the grammar shown in Table and the sentence “Paint the door”.
S → NP VP VP → Verb NP
S→ VP VP → Verb
NP → Det Nominal PP → Preposition NP
NP → Noun Det → this | that | a | the
NP → Det Noun PP Verb → sleeps | sings | open | saw | paint
Nominal → Noun Preposition → from | with | on | to
Nominal → Noun Nominal Pronoun → she | he | they
31
1. The first level (ply) search tree consists of a single node labelled
S.
2. The grammar in Table has two rules with S on their left hand side.
S NP VP & S VP
3. These rules are used to expand the tree, gives us two partial trees
at the second level search.
4. The third level is generated by expanding the non-terminal at the
bottom of the search tree in the previous.
4.2 Bottom-Up Parsing
A bottom-up parser starts with the words in the input sentence and attempts to construct a parse tree
in an upward direction towards the root.
• Start with the input words – Begin with the words in the sentence as the leaves of the parse tree.
• Look for matching grammar rules – Search for rules where the right-hand side matches parts
of the input.
• Apply reduction using the left-hand side – Replace matched portions with non-terminal
symbols from the left-hand side of the rule.
• Construct the parse tree upwards – Build the parse tree by moving upward toward the root.
32
• Repeat until the start symbol is reached – Continue reducing until the entire sentence is reduced
to the start symbol.
• Successful parse – The parsing is successful if the input is fully reduced to the start
symbol, completing the parse tree.
• Top-Down Parsing: Starts from the start symbol and generates trees, avoiding paths that lead to a
different root, but it may waste time exploring inconsistent trees before seeing the input.
• Bottom-Up Parsing: Starts with the input and ensures only matching trees are explored, but may
waste time generating trees that won't lead to a valid parse tree (e.g., incorrect assumptions about
word types).
• Top-Down Drawback: It can explore incorrect trees that eventually do not match the input,
resulting in wasted computation.
Basic Search Strategy: Combines top-down tree generation with bottom-up constraints to filter out
bad parses, aiming to optimize the parsing process.
• Start with Depth-First Search (DFS): Use a depth-first approach to explore the search tree
incrementally.
• Left-to-Right Search: Expand nodes from left to right in the tree.
33
• Incremental Expansion: Expand the search space one state at a time.
• Select Left-most Node for Expansion: Always select the left-most unexpanded node for
expansion.
• Expand Using Grammar Rules: Expand nodes based on the relevant grammar rules.
• Handle Inconsistent State: If a state is inconsistent with the input, it is flagged.
• Return to Recent Tree: The search then returns to the most recently unexplored tree to continue.
1. Initialize agenda
2. Pick a state, let it be curr_state, from agenda
3. If (curr_state) represents a successful parse then return parse tree else if
curr_stat is a POS then
if category of curr_state is a subset of POS associated with curr_word then
apply lexical rules to current state
else reject
else generate new states by applying grammar rules and push them into agenda
4. If (agenda is empty) then return failure else select a node from agenda for
expansion and go to step 3.
Figure shows the trace of the algorithm on the sentence, Open the door.
• The algorithm begins with the node S and input word "Open."
• It first expands S using the rule S → NP VP, then expands NP with NP → Det Nominal.
• Since "Open" cannot be derived from Det, the parser discards this rule and tries NP → noun,
which also fails.
• The next agenda item corresponds to S → VP.
• Expanding VP using VP → Verb NP matches the first input word successfully.
• The algorithm then continues in a depth-first, left-to-right manner to match the remaining words.
34
Left corner for each grammar category
1. Inefficiency: It may explore many unnecessary branches of the parse tree, especially if the input
does not match the grammar well, leading to high computational overhead.
2. Backtracking: If a rule fails, the parser often needs to backtrack to a previous state and try
alternative expansions, which can significantly slow down parsing.
3. Left Recursion Issues: Top-down parsers struggle with left-recursive grammars because they can
lead to infinite recursion.
4. Lack of Lookahead: Basic top-down parsers generally lack lookahead capabilities, meaning they
might make incorrect decisions early on without enough information, leading to errors.
5. Ambiguity Handling: They may have difficulty handling ambiguities in the grammar, often
exploring all possible alternatives without any way of pruning inefficient branches.
6. Limited Error Recovery: Basic top-down parsers typically have poor error recovery and can fail
immediately when encountering an unexpected input.
Dynamic programming algorithms can solve these problems. These algorithms construct a table
containing solutions to sub-problems, which, if solved, will solve the whole problem.
There are three widely known dynamic parsers-the Cocke-Younger-Kasami (CYK) algorithm, the
Graham-Harrison-Ruzzo (GHR) algorithm, and the Earley algorithm.
35
o The states in each entry provide the following information.
▪ A sub-tree corresponding to a grammar rule.
▪ Information about the progress made in completing the
sub-tree.
▪ Position of the sub-tree with respect to input.
Earley Parsing
Input: Sentence and the Grammar
Output: Chart chart[0] + S' → S, [0,0] n length (sentence) //
number of words in the sentence for i = 0 to n do for each
state in chart[i] do
if (incomplete (state) and next category is not a part of speech) then
predictor (state)
else if (incomplete (state) and next category is a part of speech)
scanner (state)
else completer
(state)
end-if
end-if
end for end
for
return:
.
Procedure predictor (A → X1 ... B ... Xm,. [i, j])
for each rule (B → α) in G do insert the state B
.
Insert the state B → word [j] , [j, j + 1] to chart [j + 1]
End
Procedure Completer (A → X1 ............. , [j, k])
.
for each B → X1 .... A .... ,[i, j] in chart[j] do
insert the state B → X1 ... A ....... [i, k] to chart[k]
End
Steps:
1. Prediction
36
If the dot (•) is before a non-terminal in a rule, add all rules expanding that non-terminal
to the state set.
The predictor generates new states representing potential expansion of the non-terminal
in the left-most derivation.
A predictor is applied to every state that has a non-terminal to the right of the dot.
Results in the creation of as many new states as there are grammar rules for the non-
terminal
Their start and end positions are at the point where the generating state ends. If
.
A → X1 ... B ... Xm, [i, j]
Then for every rule of the form B → α , the operation adds to chart [j], the state
For example, when the generating state is S → . NP VP, [0,0], the predictor adds the following states
to chart [0]:
NP →· Det Nominal, [0,0]
NP →· Noun, [0,0]
NP →· Pronoun, [0,0]
NP →· Det Noun PP, [0,0]
2. Scanning
A scanner is used when a state has a part-of-speech category to the right of the dot.
The scanner examines the input to see if the part-of-speech appearing to the right of the dot
matches one of the part-of-speech associated with the current input.
If yes, then it creates a new state using the rule that allows generation of the input word with
this part-of-speech.
If the dot (•) is before a terminal that matches the current input symbol, move the dot to the
right.
Example:
When the state NP → . Det Nominal, [0,0] is processed, the parser finds a part-of-speech category next
to the dot.
37
It checks if the category of the current word (curr_word) matches with the expectation in the current state.
.
If yes, then it adds the new state Det → curr_word , [0,1] to the next chart entry.
3. Completion
• If the dot reaches the end of a rule, find and update previous rules that were waiting for this rule
to complete.
• The completer identifies all previously generated states that expect this grammatical category at
this position in the input and creates new states by advancing the dots over the expected category.
Example:
Since John is a valid NP, we scan it. The next word is "sees", which matches V.
38
The sequence of states for “Paint the door” created by the parser is shown in Figure
Det → an | the
Table contains entries after a complete scan of the algorithm. The entry in the [1, n]th cell contains a start
symbol which indicates that S* => W1n i.e., the parse is successful.
39
• Columns represent substrings of increasing length.
• Fill Base Case (Single Words): Find matching grammar rules for each word
• Fill Table for Larger Substrings: Now, we combine smaller segments.
• Check for Start Symbol (S): Since S
appears in T[1,5], the sentence is valid
under this grammar!
Algorithm:
Let w =w1 w2 w3 wi ... wj ... wn
and wij= wi ... wi+j-1 //
Initialization step for i :=
1 to n do for all rules A→
wi do chart [i,1] = {A}
// Recursive step for j=
2 to n do for i = 1 to n-
j+1 do
begin chart [i, j]=ø
for k= 1 to j -1
do
chart [i, j] := chart[i, j] U{A | A →BC is a production and
B € chart[i, k] and C € chart [i+k, j-k]}
end
if S € chart[1, n] then accept else reject
5. Probabilistic Parsing
• Statistical parser, requires a corpus of hand-parsed text.
40
• The Penn tree-bank is a large corpus – consists Penn tree-bank tags, parsed based on simple set
of phrase structure rules, Chomsky's government and binding syntax.
• The parsed sentences are represented in the form of properly bracketed trees.
Given a grammar G, sentence s, and a set of possible parse trees of s which we denote by ꞇ(s), a
probabilistic parser finds the most likely parse ‘φ’ of s as follows:
φ = argmaxφ € ꞇ(s) P(φ | s) % where φ belongs to a feasible set T(s) - conditional probability.
= argmaxφ € ꞇ(s) P(φ, s) % φ within the feasible set T(s) that maximizes the joint probability P(ϕ,s).
= argmaxφ € ꞇ(s) P(φ) % φ within the feasible set T(s) that maximizes the prior probability P(ϕ).
We have to first find all possible parses of a sentence, then assign probabilities to them, and finally return
the most probable parse probabilistic context-free grammars (PCFGs).
• A probabilistic parser helps resolve parsing ambiguity (multiple parse trees) by assigning
probabilities to different parse trees, allowing selection of the most likely structure.
• It improves efficiency by narrowing the search space, reducing the time required to determine the
final parse tree.
Probabilistic context-free grammar (PCFG):
Example: PCFG is shown in Table, for each non-terminal, the sum of probabilities is 1.
41
Det→this 0.2 Preposition→on 0.2
Det→that 0.2 Preposition→to 0.25
Det→a 0.25 Pronoun→she 0.35
Det→the 0.35 Pronoun→he 0.35
Noun→paint 0.25 Pronoun→they 0.25
f(S→ NP VP) + f(S→ VP)=1 f(NP→ Det Noun) + f(NP→ Noun)+ f(NP → Pronoun)
+ f(NP→ Det Noun PP) = 1 f(VP → Verb NP) + f(NP → Verb) + f(VP → VP PP) =
1.0 f(Det→this) +f(Det→that)+f(Det→a)+f(Det→ the)=1.0
f(Noun→paint)+f(Noun→door)+f(Noun→bird) + f(Noun→ hole) = 1.0
If our training corpus consists of two parse trees (as shown in Figure), we will get the estimates as shown
in Table for the rules.
Figure: Two Parse trees Table: MLE for grammar rules considering two parse trees
42
NP→Pronoun 0.2 Verb→sings 0.2
NP→Det Noun PP 0.2 Verb→open 0.2
VP→Verb NP 0.5 Verb→saw 0.2
VP→Verb 0.3 Verb→paint 0.2
VP→VP PP 0.2 Preposition→from 0.3
PP→Preposition NP 1.0 Preposition→with 0.25
Det→this 0.2 Preposition→on 0.2
Det→that 0.2 Preposition→to 0.25
Det→a 0.25 Pronoun→she 0.35
Det→the 0.35 Pronoun→he 0.35
Noun→paint 0.25 Pronoun→they 0.25
Where n is a node in the parse tree Ҩ and r is the rule used to expand n.
The probability of the two parse trees of the sentence Paint the door with the hole (shown in Figure)
using PCFG table can be computed as follows:
P(t1) = 0.2 * 0.5 * 0.2 * 0.2 * 0.35 * 0.25 * 1.0 * 0.25 * 0.4 * 0.35 * 0.25 = 0.0000030625 P(t2)
= 0.2* 0.2 * 0.5 * 0.2 * 0.4 * 0.35 * 0.25 * 1 * 0.25 * 0.4 * 0.35 * 0.25 = 0.000001225
The first tree has a higher probability leading to correct interpretation.
43
We can calculate probability to a sentence s by summing up probabilities of all possible parses associated
with it.
Given a PCFG, a probabilistic parsing algorithm assigns the most likely parse Ҩ to a sentence s.
• The rest of the steps follow those of basic CYK parsing algorithm.
44
o Example: Pronouns occur more frequently as subjects rather than objects.
o These dependencies are not captured by a PCFG. o Expanding an NP as a
pronoun versus a lexical NP o NP appears as a subject or an object.
• Lack of sensitivity to lexical information.
o Two structurally different parses that use the same rules will have the same probability
under a PCFG.
Solution: This however, requires a model which captures lexical dependency statistics for different
words. Lexicalization
Lexicalization
6. Indian Languages
• Some of the characteristics of Indian languages that make CFG unsuitable.
• Paninian grammar can be used to model Indian languages.
1. Indian languages are free word order.
The CFG we used for parsing English is basically positional, but it fails to model free word order
languages.
2. Complex predicates (CPs) is another property that most Indian languages have in common. • A
complex predicate combines a light verb with a verb, noun, or adjective, to produce a new
verb.
45
• For example:
(b) सबा आ गयी। (Saba a gayi.) Saba come went. Saba arrived.
(c) सबा आ पडी। Saba a pari. Saba come fell. Saba came (suddenly).
The use of post-position case markers and the auxiliary verbs in this sequence provide information about
tense, aspect, and modality.
Paninian grammar provides a framework to model Indian languages. It focuses on the extraction of Karak
relations from a sentence.
Bharti and Sangal (1990) described an approach for parsing of Indian languages based on Paninian
grammar formalism. Their parser works in two stages.
• Word ladkiyan forms one unit, the words maidaan and mein are grouped together to form a noun
group, and the word sequence khel rahi hein forms a verb group.
2nd stage:
• The parser takes the word groups formed during first stage and identifies (i) Karaka relations
among them, and (ii) senses of words.
• Karaka chart is created to store additional information like Karaka-Vibhakti mapping.
• Constraint graph for sentence: The Karaka relation between a verb group and a noun group can
be depicted using a constraint graph.
46
• A parse of the sentence:
Each sub-graph of the constraint graph that satisfies the following constraints yields a parse of the
sentence.
1. It contains all the nodes of the graph.
2. It contains exactly one outgoing edge from a verb group for each of its mandatory Karakas. These
edges are labelled by the corresponding Karaka.
3. For each of the optional Karaka in Karaka chart, the sub-graph can have at most one outgoing
edge labelled by the Karaka from the verb group.
4. For each noun group, the sub-graph should have exactly one incoming edge.
Question Bank
1 2 3 4 5
In this table, 1 is a protocol, 2 is name of a server, 3 is the directory, and 4 is the name
of a document. Suppose you have to write a program that takes a URL and returns the
protocol used, the DNS name of the server, the directory and the document name.
Develop a regular expression that will help you in writing this program.
47
6. How can unknown words be handled in the tagging process?
7. Give two possible parse trees for the sentence, Stolen painting found by tree.
8. Identify the noun and verb phrases in the sentence, My soul answers in music.\
10. Discuss the disadvantages of the basic top-down parser with the help of an appropriate
example.
11. Tabulate the sequence of states created by CYK algorithm while parsing, The sun rises in
the east. Augment the grammar in section 4.4.5 with appropriate rules of lexicon.
13. What does lexicalized grammar mean? How can lexicalization be achieved? Explain with
the help of suitable examples.
14. List the characteristics of a garden path sentence. Give an example of a garden path
sentence and show its correct parse.
S NP VP S VP NP Det Noun
NP Noun NP NP PP VP VP PP
VP → Verb VP → VP NP PP Preposition NP
Give two possible parse of the sentence: 'Pluck the flower with the stick. Introduce lexicon
rules for words appearing in the sentence. Using these parse trees obtain maximum likelihood
estimates for the grammar rules used in the tree. Calculate probability of any one parse tree
using these estimates.
Lab Exercises
1. Write a program to find minimum edit distance between two input strings.
2. Use any tagger available in your lab to tag a text file. Now write a program to find the
most likely tag in the tagged text.
3. Write a program to find the probability of a tag given previous two tags, i.e., P(t3/t2 t1).
4. Write a program to extract all the noun phrases from a text file. Use the phrase structure
rule given in this chapter.
48
5. Write a program to check whether a given grammar is context free grammar or not.
49