0% found this document useful (0 votes)
206 views

NLP StudyMaterial

This document provides an overview of natural language processing (NLP) techniques including rule-based NLP, statistical model-based NLP using Penn Treebank and Conditional Random Fields (CRFs), and modern approaches like Word2Vec, sequence-to-sequence models, and Transformers. It discusses tasks like named entity recognition, part-of-speech tagging, parsing and compares different NLP methods.

Uploaded by

tegeje5009
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views

NLP StudyMaterial

This document provides an overview of natural language processing (NLP) techniques including rule-based NLP, statistical model-based NLP using Penn Treebank and Conditional Random Fields (CRFs), and modern approaches like Word2Vec, sequence-to-sequence models, and Transformers. It discusses tasks like named entity recognition, part-of-speech tagging, parsing and compares different NLP methods.

Uploaded by

tegeje5009
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 540

Natural Language

Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 1
Syllabus

Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4

Regular Expressions-Basic Regular Expression


Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction

Topic: Introduction
NLP
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction
between computers and humans through natural language. NLP enables computers to understand,
interpret, and generate human language in a way that is both meaningful and valuable.

CSA4006-Dr. Anirban Bhowmick


Why Study NLP?
Ubiquity of Language: Language is a fundamental
medium of communication among humans. NLP allows
machines to understand and process this
communication, enabling a wide range of applications.

Real-World Applications: NLP is used in various real-


world applications such as virtual assistants, sentiment 9
analysis, language translation, information retrieval, and
more.

Data Explosion: The digital age has led to an


explosion of textual data. NLP provides the tools to
extract insights and information from this data.

CSA4006-Dr. Anirban Bhowmick


Brief History of NLP
Early Foundations (1950s-1970s)

1950s: The field of AI is born, and early attempts


at machine translation (MT) using rule-based
systems.
10
1960s: ELIZA, a computer program capable of
simulating human conversation, is developed by
Joseph Weizenbaum.

1970s: Rule-based approaches dominate NLP,


but they struggle with the complexity and ELIZA, a computer program
ambiguity of language.
Note: ELIZA simulated conversation by using a pattern matching
and substitution methodology that gave users an illusion of
understanding on the part of the program

CSA4006-Dr. Anirban Bhowmick


Brief History of NLP
Statistical NLP (1980s-2000s)

1980s: Introduction of statistical methods,


Hidden Markov Models (HMMs), and
probabilistic context-free grammars.

1990s: The use of large corpora and the 11

development of the Penn Treebank


revolutionize NLP. Introduction of part-of-speech
tagging and syntactic parsing.

2000s: More sophisticated statistical models like


Conditional Random Fields (CRFs) and word
embeddings (Word2Vec, GloVe) emerge. Shift
towards data-driven approaches.
CSA4006-Dr. Anirban Bhowmick
Brief History of NLP
Deep Learning and Modern NLP (2010s-Present)

2010s: Deep Learning redefines NLP with neural network


architectures like Recurrent Neural Networks (RNNs) and
Convolutional Neural Networks (CNNs).

2013: Introduction of Word2Vec by Mikolov et al., which learns word


embeddings from large text corpora. 12

2014: "Sequence to Sequence" models enable breakthroughs in


machine translation.

2018: Transformers, exemplified by the BERT model, revolutionize


NLP tasks by learning contextualized word representations.

Present: State-of-the-art models like GPT-3.5 achieve remarkable


performance across a wide range of NLP tasks using massive
amounts of data and computation.
CSA4006-Dr. Anirban Bhowmick
NLP-Rule based
Rule-based Natural Language Processing (NLP) is an approach to language processing that relies on a
set of predefined rules and patterns to analyze and extract information from text data. It contrasts with
machine learning-based NLP, which uses algorithms and models to learn patterns and make predictions
from data

Rule: If a text contains a date in the format


"dd/mm/yyyy" or "dd-mm-yyyy," extract it.
13

Example Text: "The project deadline is 25/09/2023,


and the meeting is scheduled for 30-09-2023."

Rule-Based NLP Output:

Extracted Date: "25/09/2023"


Extracted Date: "30-09-2023"

CSA4006-Dr. Anirban Bhowmick


NLP- Statistical model based
Statistical model-based Natural Language Processing (NLP) relies on the use of statistical techniques
and machine learning algorithms to analyze and understand text data. Unlike rule-based NLP, which relies
on predefined rules and patterns, statistical model-based NLP learns patterns and relationships from data

Task: Text Classification

Statistical Model: Support Vector Machine (SVM)


14

Example: Sentiment Analysis

CSA4006-Dr. Anirban Bhowmick


NLP-Penn Treebank based
The Penn Treebank is a widely used dataset in Natural Language Processing (NLP) that provides
annotated syntactic and structural information for English text. It uses a tree structure to represent the
grammatical and syntactic relationships within sentences. One common application of Penn Treebank-
based NLP is parsing sentences to analyze their grammatical structure

Task: Sentence Parsing Part-of-Speech (POS) Tagging: Each


token is assigned a POS tag that
"The quick brown fox jumps over the lazy dog." represents its grammatical category (e.g., 15

Tokenization: The sentence is first tokenized into noun, verb, adjective). Here is an example
individual words and punctuation marks. In this case, of the sentence with POS tags:
the sentence is tokenized as follows: [("The", "DT"), ("quick", "JJ"), ("brown",
["The", "quick", "brown", "fox", "jumps", "over", "the", "JJ"), ("fox", "NN"), ("jumps",
"lazy", "dog", "."]

CSA4006-Dr. Anirban Bhowmick


NLP-Penn Treebank based
Parsing: The Penn Treebank-based NLP system uses syntactic rules and information to parse the
sentence into a tree structure that represents its grammatical and syntactic relationships. The resulting
parse tree for the example sentence might look like this:

(S
(NP (DT The) (JJ quick) (JJ brown) (NN fox))
(VP (VBZ jumps) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))))
16
(. .))

In this parse tree, "S" represents the sentence, "NP" represents a noun phrase, "VP" represents a verb
phrase, "DT" represents a determiner, "JJ" represents an adjective, "NN" represents a noun, "VBZ"
represents a verb, and "IN" represents a preposition. The tree structure captures the hierarchical
relationships between the words in the sentence.

CSA4006-Dr. Anirban Bhowmick


NLP-CRFs
Conditional Random Fields (CRFs) are a popular machine learning model used in Natural Language
Processing (NLP) for sequence labeling tasks, such as named entity recognition (NER), part-of-speech
tagging (POS), and chunking. CRFs are particularly effective at capturing dependencies between adjacent
labels in a sequence.
Example Sentence:
"Apple Inc. is headquartered in Cupertino, California."
Label Sequence (NER Tags): 17

["B-ORG", "I-ORG", "O", "O", "B-LOC", "I-LOC", "I-LOC"]


In this example, the labels indicate the following:

"B-ORG": Beginning of an organization name.


"I-ORG": Inside an organization name.
"B-LOC": Beginning of a location name.
"I-LOC": Inside a location name.
"O": Represents words that are not part of any named entity

CSA4006-Dr. Anirban Bhowmick


NLP-State of art
Word2Vec, Sequence-to-Sequence (Seq2Seq), and Transformers are all important techniques in
Natural Language Processing (NLP), but they serve different purposes and have different characteristics.
Let's compare them based on several key aspects:
Parameter Word2vec Seq2Seq Transformers
Objective Used for word Designed for sequence Initially designed for
embedding to sequence tasks, such seq2seq but have 18
as MT, TS become fundamental to
NLP
Model Architecture Shallow NN, uses Encoder and decoder, Uses self attention
CBOW uses RNN or LSTM mechanism and FFNN

Training Large corpus Parallel seq of input and Massive corpora using
target sequence self-supervised and then
fine tuned
Parallelism Inherently parallelizable Less parallelizable Highly parallelizable

CSA4006-Dr. Anirban Bhowmick


Numerical

19

CSA4006-Dr. Anirban Bhowmick


20

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 2
Syllabus

Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4

Regular Expressions-Basic Regular Expression


Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction

Topic: Introduction
Applications of NLP
Communication With Machines

CSA4006-Dr. Anirban Bhowmick


Applications of NLP
Conversational Agents Conversational agents contain:
Building AI systems that can engage ● Speech recognition
in natural-sounding conversations ● Language analysis
with users. Used in customer ● Dialogue processing
support, virtual companions, and ● Information retrieval
mental health apps. ● Text to speech

Question Answering Text Generation


Developing systems that can Creating human-like text using
understand and answer questions models like OpenAI's GPT-3.
posed in natural language. Used Applications range from creative
in chatbots, virtual assistants, and writing to chatbots.
information retrieval.

CSA4006-Dr. Anirban Bhowmick


Applications of NLP
Machine Translation Sentiment Analysis
Automatically translating text from Analyzing text to determine the
one language to another. Google sentiment (positive, negative, neutral)
Translate and other translation expressed by the author. Applications
services heavily rely on NLP include brand monitoring, customer
techniques. feedback analysis, and social media
sentiment tracking.
10

Information Retrieval Named Entity Recognition (NER)


Improving search engines by Identifying entities like names, dates,
understanding user queries and locations, and more within a text. Used in
retrieving relevant information from a information extraction, chatbots, and language
large dataset. translation.

CSA4006-Dr. Anirban Bhowmick


Level Of Linguistic Knowledge
1. Phonetics and Phonology
At this level, NLP systems consider the sounds of speech. It involves understanding the
phonemes (distinct speech sounds) and the rules governing their pronunciation, as well as the
intonation patterns and stress in spoken language.

2. Morphology
Morphology deals with the internal structure of words and how they are formed from smaller units
11
called morphemes. Morphological analysis helps in tasks like stemming (reducing words to their
base form) and lemmatization (reducing words to their dictionary form).

3. Syntax
Syntax involves the rules governing the structure of sentences. It includes understanding how
words combine to form phrases and sentences, and the relationships between different parts of
speech. Parsing techniques are used to analyze sentence structure.

CSA4006-Dr. Anirban Bhowmick


Level Of Linguistic Knowledge
4. Semantics
Semantics is the study of meaning in language. NLP systems at this level aim to understand the
meaning of individual words, phrases, and sentences. This can involve tasks like word sense
disambiguation (determining the correct meaning of a word based on context) and semantic role
labeling (identifying the roles of words in a sentence, e.g., subject, object).
12
5. Pragmatics
Pragmatics refers to the use of language in context. It involves understanding implied meaning,
indirect speech acts, and the intentions behind statements. This level is crucial for understanding
sarcasm, irony, and other forms of figurative language.

6. Discourse
Discourse refers to the structure and organization of connected text or speech. NLP systems at this level
consider how sentences relate to each other and form coherent paragraphs or dialogues. Coreference
resolution (identifying which words refer to the same entity) is an important task in discourse analysis.

CSA4006-Dr. Anirban Bhowmick


Why NLP is Hard?

1. Ambiguity
2. Scale
3. Sparsity
4. Variation
13
5. Expressivity
6. Unmodeled Variables
7. Unknown representations

CSA4006-Dr. Anirban Bhowmick


Ambiguity
Ambiguity at multiple levels

Word senses: bank (finance or river ?)


Part of speech: chair (noun or verb ?)
Syntactic structure: I can see a man with a telescope 14

Multiple: I made her duck


Semantic: Time flies like an arrow; fruit flies like a banana
Phonological: I scream, you scream, we all scream for ice cream."
(The words "I scream" and "ice cream

CSA4006-Dr. Anirban Bhowmick


Ambiguity

15
These different meanings are caused by a number of ambiguities.
First, the words duck and her are morphologically or syntactically ambiguous in their part-of-
speech. Duck can be a verb or a noun, while her can be a dative pronoun or a possessive
pronoun. Second, the word make is semantically ambiguous; it can mean create or cook. Finally,
the verb make is syntactically ambiguous in a different way. Make can be transitive, that is, taking
a single direct object, or it can be ditransitive, that is, taking two objects, meaning that the first
object (her) was made into the second object (duck). Finally, make can take a direct object and a
verb, meaning that the object (her) was caused to perform the verbal action (duck). Furthermore, in
a spoken sentence, there is an even deeper kind of ambiguity; the first word could have been eye or
the second word maid.
CSA4006-Dr. Anirban Bhowmick
Ambiguity
We often introduce the models and
algorithms we present throughout the book
as ways to resolve or disambiguate these
ambiguities. For example, deciding whether
duck is a verb or a noun can be solved by
part-of-speech tagging. Deciding whether 16
make means “create” or “cook” can be
solved by word sense disambiguation.
Resolution of part-of-speech and word
sense ambiguities are two important kinds of
lexical disambiguation

Note: Word Sense Disambiguation (WSD) is a natural language


processing (NLP) task that focuses on determining the correct meaning or
sense of a word in a given context.

CSA4006-Dr. Anirban Bhowmick


Scale
Scale in NLP refers to the challenges and opportunities posed by the vast amounts of linguistic data
available for analysis. The scale of data in NLP presents both technical and computational challenges,
but it also enables the development of more sophisticated models and applications.

Challenges of Scale
Data Collection: Gathering and annotating large-scale linguistic data is resource-intensive and time-
consuming. 17

Computational Resources: Processing and analyzing massive datasets require significant


computational power and memory.

Model Complexity: More data often leads to larger and more complex models, which may require
specialized hardware and efficient training techniques.

Noise and Quality: As datasets grow, ensuring data quality becomes crucial, as noise can negatively
impact model performance.

CSA4006-Dr. Anirban Bhowmick


Scale
Opportunities of Scale
Improved Models: Large datasets enable the training
of more accurate and robust NLP models that can
capture subtle linguistic nuances.

Generalization: Models trained on extensive data have


the potential to generalize better across various 18
domains and languages.

Transfer Learning: Pretrained models on massive


datasets can be fine-tuned for specific tasks, reducing
the need for extensive task-specific data.

Multilingualism: Large-scale data allows models to


learn from multiple languages, enabling multilingual
applications.

CSA4006-Dr. Anirban Bhowmick


Sparsity
Sparsity is a common challenge in Natural Language Processing (NLP) that arises due to the vast and
diverse nature of human language. In NLP, sparsity refers to the phenomenon where the data space is
extremely large, but the actual data available for any specific point in that space is very limited. This can
have significant implications for various NLP tasks and models.

Causes of Sparsity in NLP


Vocabulary Size: Natural languages have extensive vocabularies with numerous words, many of which are 19
rare or domain-specific. The majority of words appear infrequently in any given text corpus.

Long Tail Distribution: The frequency distribution of words follows a "long tail" pattern, where a few
common words appear frequently, while the majority of words occur rarely.

Named Entities: Entities like names, locations, dates, and specialized terms are sparse in most text data.

Word Combinations: The number of possible word combinations is astronomically large, but most of these
combinations are never observed in real-world text

CSA4006-Dr. Anirban Bhowmick


20

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 3
Syllabus

Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4

Regular Expressions-Basic Regular Expression


Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction

Topic: Regular Expression


Review

CSA4006-Dr. Anirban Bhowmick


Review

CSA4006-Dr. Anirban Bhowmick


Variation
Suppose we train a part of speech tagger or a parser on the Wall Street Journal

10

What will happen if we try to use this tagger/parser for social media?

“ikr smh he asked fir yo last name so he can add u on fb lololol”

CSA4006-Dr. Anirban Bhowmick


POS Tagging

11

CSA4006-Dr. Anirban Bhowmick


Expressivity
Not only can one form have different meanings (ambiguity) but the same meaning can be expressed with
different forms:

12

CSA4006-Dr. Anirban Bhowmick


Unmodeled Variables

13

World knowledge
I dropped the glass on the floor and it broke
I dropped the hammer on the glass and it broke

CSA4006-Dr. Anirban Bhowmick


Unmodeled Representation
Unmodeled representations in NLP refer to aspects of language and meaning that are not fully
captured by existing language models, resulting in situations where models struggle to understand the
nuances and complexities of human communication. Here are some examples of unmodeled
representations:

Example: "She's as busy as a bee." Example: "He's the Einstein of our group."

In this metaphor, the phrase "busy as a bee" implies This expression assumes knowledge about 14

that she is very industrious, but this meaning is not who Einstein was and what he symbolizes.
directly related to bees being busy insects. A model lacking this cultural context might
miss the intended comparison.
Example: "Oh great, another flat tire!"
This statement might be used in a situation where
someone is frustrated about a recurring problem, and
the words imply sarcasm despite the literal words
expressing annoyance.

CSA4006-Dr. Anirban Bhowmick


Factors Changing NLP Landscape

1. Increases in computing power


2. The rise of the web, then the social web
3. Advances in machine learning 15

4. Advances in understanding of language in social


context

CSA4006-Dr. Anirban Bhowmick


Regular Expressions
Regular expressions (regex) are powerful tools used in Natural Language Processing
(NLP) to match and manipulate text patterns. They provide a concise and flexible way to
search, extract, and manipulate textual data.

Imagine you needed to search a string for a term, such as “Phone”.


16
“phone” in “Is the phone here?”
>>> True

Imagine you needed to search a Phone number, “91-98765-43210”, we can do the same:

“91-98765-43210” in “Her phone number is 91-98765-43210”


>>> True

CSA4006-Dr. Anirban Bhowmick


Regular Expression
But if you don’t know the exact number. Or you need to search all the phone
numbers that is there in the text.

We need to use regular expressions to search through the document for


this pattern
17

Regular expressions allow for pattern searching in a text document.

r’\d{2}-\d{5}-\d{5}’

\d = digits have the placeholder pattern code

CSA4006-Dr. Anirban Bhowmick


Regular Expressions: Disjunctions
Letters inside square brackets []

18

Ranges [A-Z]

CSA4006-Dr. Anirban Bhowmick


Regular Expressions: Negation in
Disjunction

Negations [^Ss] Carat means negation only when first in []

19

CSA4006-Dr. Anirban Bhowmick


Regular Expression
Pattern Matches
colou?r Optional previous char color colour

oo*h! 0 or more of previous char oh! ooh! oooh! ooooh!

o+h! 1 or more of previous char oh! ooh! oooh! ooooh!

baa+ baa baaa baaaa baaaaa


20
beg.n any character between beg and n begin begun begun beg3n

Regular Expressions: Anchors ^ $

CSA4006-Dr. Anirban Bhowmick


Advanced Operators

21
A range of numbers can also be specified; so /{n,m}/ specifies from n to m occurrences of the previous
char or expression, while /{n,}/ means at least n occurrences of the previous expression

CSA4006-Dr. Anirban Bhowmick


Error
Find me all instances of the word “the” in a text.

The ----- Misses capitalized


[tT]he ----- Incorrectly returns other or theology

[^a-zA-Z][tT]he[^a-zA-Z] ---matches the correct one


22
The process we just went through was based on fixing In NLP we are always dealing with these
two kinds of errors Matching strings that we should not kinds of errors.
have matched (there, then, other) Reducing the error rate for an application
False positives (Type I) often involves two antagonistic efforts:
Not matching things that we should have matched Increasing accuracy or precision
(The) (minimizing false positives)
False negatives (Type II) Increasing coverage or recall (minimizing
false negatives).

CSA4006-Dr. Anirban Bhowmick


23

EEE1001-Dr. Anirban Bhowmick


Regular Expressions
Lecture 11b
Larry Ruzzo
Outline

• Some string tidbits


• Regular expressions and pattern matching
Strings Again

’abc’
”abc”
a b c
’’’abc’’’
r’abc’
Strings Again

’abc\n’
”abc\n” a b c newline
’’’abc
’’’ }
r’abc\n’
a b c \ n
Why so many?
’ vs ” lets you put the other kind inside
’’’ lets you run across many lines
all 3 let you show “invisible” characters (via \n, \t, etc.)
r’...’ (raw strings) can’t do invisible stuff, but avoid problems
with backslash
open(’C:\new\text.dat’) vs
open(’C:\\new\\text.dat’) vs
open(r’C:\new\text.dat’)
RegExprs are
Widespread
• shell file name patterns (limited)
• unix utility “grep” and relatives
• try “man grep” in terminal window
• perl
• TextWrangler →

• Python
Patterns in Text
• Pattern-matching is frequently useful
• Identifier: A letter followed by >= 0 letters or digits.

count1 number2go, not 4runner


• TATA box: TATxyT where x or y is A

TATAAT TATAgT TATcAT, not TATCCT


• Number: >=1 digit, optional decimal point, exponent.
3.14 6.02E+23, not 127.0.0.1
Regular Expressions
• A language for simple patterns, based on 4 simple
primitives
• match single letters
• this OR that
• this FOLLOWED BY that
• this REPEATED 0 or more times
• A specific syntax (fussy, and varies among pgms...)
• A library of utilities to deal with them
• Key features: Search, replace, dissect
Regular Expressions
• Do you absolutely need them in Python?
• No, everthing they do, you could do yourself
• BUT pattern-matching is widely needed,
tedious and error-prone. RegExprs give you a
flexible, systematic, compact, automatic way to
do it. A common language for specifications.
• In truth, it’s still somewhat error-prone, but in
a different way.
Examples
(details later)

• Identifier: letterfollowed by ≥0 letters or digits.


[a-z][a-z0-9]* i count1 number2go
• TATA box: TATxyT where x or y is A
TAT(A.|.A)T TATAAT TATAgT TATcAT
• Number: one or more digits with optional
decimal point, exponent.
\d+\.?\d*(E[+-]?\d+)? 3.14 6.02E+23
Another Example
Repressed binding sites in regular Python

# assume we have a genome sequence in string variable myDNA


for index in range(0,len(myDNA)-20) :
if (myDNA[index] == "A" or myDNA[index] == "G") and
(myDNA[index+1] == "A" or myDNA[index+1] == "G") and
(myDNA[index+2] == "A" or myDNA[index+2] == "G") and
(myDNA[index+3] == "C") and
(myDNA[index+4] == "C") and
# and on and on!
(myDNA[index+19] == "C" or myDNA[index+19] == "T") :
print "Match found at ",index
break

6
Example

re.findall(r"[AG]{3,3}CATG[TC]{4,4}[AG]{2,2}C[AT]TG[CT][CG][TC]", myDNA)
RegExprs in Python

http://docs.python.org/library/re.html
Simple RegExpr Testing
>>> import re
>>> str1 = 'what foot or hand fell fastest'
>>> re.findall(r'f[a-z]*', str1)
['foot', 'fell', 'fastest'] Definitely
recommend trying
>>> str2 = "I lack e's successor" this with examples
>>> re.findall(r'f[a-z]*',str2) to follow, & more
[]

Returns list of all matching substrings.


Exercise: change it to find strings
starting with f and ending with t
Exercise: In honor of the
winter Olympics, “-ski-ing”
• download & save war_and_peace.txt
• write py program to read it line-by-line, use
re.findall to see whether current line contains
one or more proper names ending in “...ski”;
print each. ['Bolkonski']
['Bolkonski']
['Bolkonski']
• mine begins: ['Bolkonski']
['Bolkonski']
['Razumovski']
['Razumovski']
['Bolkonski']
['Spasski']
...
['Nesvitski', 'Nesvitski']
RegExpr Syntax

They’re strings
Most punctuation is special; needs to be
escaped by backslash (e.g., “\.” instead of “.”) to
get non-special behavior
So, “raw” string literals (r’C:\new\.txt’) are
generally recommended for regexprs
Unless you double your backslashes judiciously
Patterns “Match” Text

Pattern: TAT(A.|.A)T [a-z][a-z0-9]*

Text: RATATaAT TAT! count1


RegExpr Semantics, 1
Characters

RexExprs are patterns; they “match” sequences


of characters
Letters, digits (& escaped punctuation like ‘\.’)
match only themselves, just once
r’TATAAT’ ‘ACGTTATAATGGTATAAT’
RegExpr Semantics, 2
Character Groups
Character groups [abc], [a-zA-Z], [^0-9] also
match single characters, any of the characters
in the group.
Shortcuts (2 of many):
. – (just a dot) matches any letter (except newline)
\s ≡ [ \n\t\r\f\v] (“s” for “space”)

r’T[AG]T[^GC].T’‘ACGTTGTAATGGTATnCT’
Matching one of several alternatives

• Square brackets mean that any of the listed characters will do

• [ab] means either ”a” or ”b”

• You can also give a range:

• [a-d] means ”a” ”b” ”c” or ”d”

• Negation: caret means ”not”

[^a-d] # anything but a, b, c or d

8
RegExpr Semantics, 3:
Concatenation, Or, Grouping
You can group subexpressions with parens
If R, S are RegExprs, then
RS matches the concatenation of strings matched
by R, S individually
R | S matches the union–either R or S

?
r’TAT(A.|.A)T’’TATCATGTATACTCCTATCCT’
RegExpr Semantics, 4
Repetition
If R is a RegExpr, then
R* matches 0 or more consecutive strings
(independently) matching R
R+ 1 or more
R{n} exactly n
R{m,n} any number between m and n, inclusive
R? 0 or 1
Beware precedence (* > concat > |) ?
r’TAT(A.|.A)*T’‘TATCATGTATACTATCACTATT’
RegExprs in Python

By default
Case sensitive, line-oriented (\n treated specially)
Matching is generally “greedy”
Finds longest version of earliest starting match
Next “findall()” match will not overlap

r".+\.py" "Two files: hw3.py and upper.py."

r"\w+\.py" "Two files: hw3.py and UPPER.py."


Exercise 3

Suppose “filenames” are upper or lower case


letters or digits, starting with a letter, followed
by a period (“.”) followed by a 3 character
extension (again alphanumeric). Scan a list of
lines or a file, and print all “filenames” in it,
without their extensions. Hint: use paren
groups.
Solution 3

import sys
import re
filename = sys.argv[1]
filehandle = open(filename,"r")
filecontents = filehandle.read()
myrule = re.compile(
r"([a-zA-Z][a-zA-Z0-9]*)\.[a-zA-Z0-9]{3}")
#Finds skidoo.bar amidst 23skidoo.barber; ok?
match = myrule.findall(filecontents)
print match
Basics of regexp construction

• Letters and numbers match themselves

• Normally case sensitive

• Watch out for punctuation–most of it has special meanings!

7
Wild cards

• ”.” means ”any character”

• If you really mean ”.” you must use a backslash

• WARNING:
– backslash is special in Python strings
– It’s special again in regexps
– This means you need too many backslashes
– We will use ”raw strings” instead
– Raw strings look like r"ATCGGC"

9
Using . and backslash

• To match file names like ”hw3.pdf” and ”hw5.txt”:

hw.\....

10
Zero or more copies

• The asterisk repeats the previous character 0 or more times

• ”ca*t” matches ”ct”, ”cat”, ”caat”, ”caaat” etc.

• The plus sign repeats the previous character 1 or more times

• ”ca+t” matches ”cat”, ”caat” etc. but not ”ct”

11
Repeats

• Braces are a more detailed way to indicate repeats

• A{1,3} means at least one and no more than three A’s

• A{4,4} means exactly four A’s

12
simple testing

>>> import re
>>> string = 'what foot or hand fell fastest'
>>> re.findall(r'f[a-z]*', string)
['foot', 'fell', 'fastest']
Practice problem 1

• Write a regexp that will match any string that starts with ”hum” and
ends with ”001” with any number of characters, including none, in
between

• (Hint: consider both ”.” and ”*”)

13
Practice problem 2

• Write a regexp that will match any Python (.py) file.

• There must be at least one character before the ”.”

• ”.py” is not a legal Python file name

• (Imagine the problems if you imported it!)

14
Using the regexp

First, compile it:

import re
myrule = re.compile(r".+\.py")
print myrule
<_sre.SRE_Pattern object at 0xb7e3e5c0>

The result of compile is a Pattern object which represents your regexp

15
Using the regexp

Next, use it:

mymatch = myrule.search(myDNA)
print mymatch
None
mymatch = myrule.search(someotherDNA)
print mymatch
<_sre.SRE_Match object at 0xb7df9170>

The result of match is a Match object which represents the result.

16
All of these objects! What can they do?

Functions offered by a Pattern object:

• match()–does it match the beginning of my string? Returns None or a


match object

• search()–does it match anywhere in my string? Returns None or a


match object

• findall()–does it match anywhere in my string? Returns a list of


strings (or an empty list)

• Note that findall() does NOT return a Match object!

17
All of these objects! What can they do?

Functions offered by a Match object:

• group()–return the string that matched


group()–the whole string
group(1)–the substring matching 1st parenthesized sub-pattern
group(1,3)–tuple of substrings matching 1st and 3rd parenthesized
sub-patterns

• start()–return the starting position of the match

• end()–return the ending position of the match

• span()–return (start,end) as a tuple

18
A practical example

Does this string contain a legal Python filename?

import re
myrule = re.compile(r".+\.py")
mystring = "This contains two files, hw3.py and uppercase.py."
mymatch = myrule.search(mystring)
print mymatch.group()
This contains two files, hw3.py and uppercase.py
# not what I expected! Why?

19
Matching is greedy

• My regexp matches ”hw3.py”

• Unfortunately it also matches ”This contains two files, hw3.py”

• And it even matches ”This contains two files, hw3.py and uppercase.py”

• Python will choose the longest match

• I could break my file into words first

• Or I could specify that no spaces are allowed in my match

20
A practical example

Does this string contain a legal Python filename?

import re
myrule = re.compile(r"[^ ]+\.py")
mystring = "This contains two files, hw3.py and uppercase.py."
mymatch = myrule.search(mystring)
print mymatch.group()
hw3.py
allmymatches = myrule.findall(mystring)
print allmymatches
[’hw3.py’,’uppercase.py’]

21
Practice problem 3

• Create a regexp which detects legal Microsoft Word file names

• The file name must end with ”.doc” or ”.DOC”

• There must be at least one character before the dot.

• We will assume there are no spaces in the names

• Print out a list of all the legal file names you find

• Test it on testre.txt (on the web site)

22
Practice problem 4

• Create a regexp which detects legal Microsoft Word file names that do
not contain any numerals (0 through 9)

• Print out the start location of the first such filename you encounter

• Test it on testre.txt

23
Practice problem

• Create a regexp which detects legal Microsoft Word file names that do
not contain any numerals (0 through 9)

• Print out the “base name”, i.e., the file name after stripping of the .doc
extension, of each such filename you encounter. Hint: use parenthesized
sub patterns.

• Test it on testre.txt

24
Practice problem 1 solution

Write a regexp that will match any string that starts with ”hum” and ends
with ”001” with any number of characters, including none, in between

myrule = re.compile(r"hum.*001")

25
Practice problem 2 solution

Write a regexp that will match any Python (.py) file.

myrule = re.compile(r".+\.py")

# if you want to find filenames embedded in a bigger


# string, better is:
myrule = re.compile(r"[^ ]+\.py")
# this version does not allow whitespace in file names

26
Practice problem 3 solution

Create a regexp which detects legal Microsoft Word file names, and use it
to make a list of them

import sys
import re
filename = sys.argv[1]
filehandle = open(filename,"r")
filecontents = filehandle.read()
myrule = re.compile(r"[^ ]+\.[dD][oO][cC]")
matchlist = myrule.findall(filecontents)
print matchlist

27
Practice problem 4 solution

Create a regexp which detects legal Microsoft Word file names which do
not contain any numerals, and print the location of the first such filename
you encounter

import sys
import re
filename = sys.argv[1]
filehandle = open(filename,"r")
filecontents = filehandle.read()
myrule = re.compile(r"[^ 0-9]+\.[dD][oO][cC]")
match = myrule.search(filecontents)
print match.start()

28
Regular expressions summary

• The re module lets us use regular expressions

• These are fast ways to search for complicated strings

• They are not essential to using Python, but are very useful

• File format conversion uses them a lot

• Compiling a regexp produces a Pattern object which can then be used


to search

• Searching produces a Match object which can then be asked for


information about the match

29
Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 4
Syllabus

Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4

Regular Expressions-Basic Regular Expression


Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction

Topic: Regular Expression


Review

CSA4006-Dr. Anirban Bhowmick


Regular Expression
Split the string at every white-space character Split the string at the first white-space character

txt = "The rain in Spain" txt = "The rain in Spain"


x = re.split("\s", txt) x = re.split("\s", txt, 1)
print(x) print(x)
9

['The', 'rain', 'in', 'Spain'] ['The', 'rain in Spain']


Replace all white-space characters with the digit "9"
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)
The9rain9in9Spain

CSA4006-Dr. Anirban Bhowmick


html_text = """
Regular Expression <!DOCTYPE html>
<html>
Write a Python program that removes all HTML tags <head>
from an HTML document. Create a function that takes <title>Sample HTML
an HTML string as input and returns the text content Document</title>
without any HTML tags. Use regular expressions to </head>
accomplish this, taking into account different tag <body>
attributes and formats. <h1>Welcome to my website!</h1>
<p>This is a sample HTML
10
document.</p>
</body>
</html>
"""
html_tag_pattern = r'<[^>]*>'
clean_text =
re.sub(html_tag_pattern, '',
html_text)
print(clean_text)

CSA4006-Dr. Anirban Bhowmick


import re
def validate_email_addresses(email_list):
Regular Expression # Regular expression pattern for a valid
email address
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-
Write a Python program that validates a list of email zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
addresses. Create a function that takes a list of # List to store valid email addresses
email addresses as input and returns a list of valid valid_emails = []
email addresses. Use regular expressions to for email in email_list:
if re.match(email_pattern, email):
validate each email address according to common valid_emails.append(email)
email address patterns. return valid_emails
# Example usage: 11
email_list = [
"[email protected]",
"[email protected]",
"invalid-email",
"another@example",
]

valid_emails =
validate_email_addresses(email_list)
print("Valid Email Addresses:")
for email in valid_emails:
print(email)
CSA4006-Dr. Anirban Bhowmick
Finite-state Automata
The regular expression is more than just a convenient
metalanguage for text searching. First, a regular
expression is one way of describing a finite- state
automaton (FSA). Finite-state automata are the
theoretical foundation of a good deal of the
computational work. Any regular expression can be
implemented as a finite-state automaton.
12
Symmetrically, any finite-state automaton can be
described with a regular expression. Second, a
regular expression is one way of characterizing a
particular kind of formal language called a regular
language. Both regular expressions and finite-state
automata can be used to describe regular languages.
A third equivalent method of characterizing the
regular languages, the regular grammar

CSA4006-Dr. Anirban Bhowmick


Finite-state Automata
Finite automata are simple abstract machines used to recognize patterns. Finite automata are also known
as a finite-state machines. It is a mathematical model of a system with discrete inputs, output, states, and a
set of transitions from state to state that occurs on input alphabets symbols. In simple words, we can say It
has a set of states and rules for moving from one state to the next, but it is dependent on the input symbol
used.

Q: Finite set of states represented by vertices. 13


Σ: set of Input Symbols.
𝑞0 : Initial state represented by empty incoming arc.
F: set of Final States represented by double circle.
δ: Transition Function represented by arcs.

CSA4006-Dr. Anirban Bhowmick


14

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 5
Syllabus

Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4

Regular Expressions-Basic Regular Expression


Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction

Topic: Regular Expression


Review

CSA4006-Dr. Anirban Bhowmick


Finite-state Automata
The regular expression is more than just a convenient
metalanguage for text searching. First, a regular
expression is one way of describing a finite- state
automaton (FSA). Finite-state automata are the
theoretical foundation of a good deal of the
computational work. Any regular expression can be
implemented as a finite-state automaton.
9
Symmetrically, any finite-state automaton can be
described with a regular expression. Second, a
regular expression is one way of characterizing a
particular kind of formal language called a regular
language. Both regular expressions and finite-state
automata can be used to describe regular languages.
A third equivalent method of characterizing the
regular languages, the regular grammar

CSA4006-Dr. Anirban Bhowmick


Finite-state Automata
Finite automata are simple abstract machines used to recognize patterns. Finite automata are also known
as a finite-state machines. It is a mathematical model of a system with discrete inputs, output, states, and a
set of transitions from state to state that occurs on input alphabets symbols. In simple words, we can say It
has a set of states and rules for moving from one state to the next, but it is dependent on the input symbol
used.

Q: Finite set of states represented by vertices. 10


Σ: set of Input Symbols.
𝑞0 : Initial state represented by empty incoming arc.
F: set of Final States represented by double circle.
δ: Transition Function represented by arcs.

CSA4006-Dr. Anirban Bhowmick


Determinism and Non-Determinism
Deterministic: A Deterministic Finite Automaton (DFA) is a
mathematical model and computational device used to recognize
and accept a set of strings over a finite alphabet. It is a type of
finite state machine characterized by its deterministic nature,
meaning that for each state and input symbol, there is exactly
one defined transition to another state.
11
Non-deterministic: There is a choice of several transitions that
can be taken given a current state and input symbol. (The
machine doesn’t specify how to make the choice.)

Potential solutions:
• Save backup states at each choice point
• Look-ahead in the input before making choice
• Pursue alternatives in parallel
• Determinize our NFSAs (and then minimize)

CSA4006-Dr. Anirban Bhowmick


Using an FSA to Recognize Sheeptalk
Let’s begin with the “sheep language”, the sheep
language as any string from the following (infinite) set:
baa!
baaa!
baaaa!
baaaaa!
baaaaaa! 12

Directed graph with labeled nodes and arc transitions


Five states: q0 the start state, q4 the final state, 5
transitions

CSA4006-Dr. Anirban Bhowmick


Formally

13

State Transition Table for SheepTalk

CSA4006-Dr. Anirban Bhowmick


Recognition and Rejection
The machine starts in the start state (q0), and iterates the following process: Check the next letter of
the input. If it matches the symbol on an arc leaving the current state, then cross that arc, move to the
next state, and also advance one symbol in the input. If we are in the accepting state (q4) when we
run out of input, the machine has successfully recognized an instance of sheeptalk. If the machine
never gets to the final state, either because it runs out of input, or it gets some input that doesn’t match
an arc or if it just happens to get stuck in some non-final state, we say the machine rejects or fails to
accept an input 14

Tape metaphor: a rejected input

CSA4006-Dr. Anirban Bhowmick


D-Recognize
The algorithm is called D-RECOGNIZE for
“deterministic recognizer”. D-RECOGNIZE
begins by setting the variable index to the
beginning of the tape, and current-state to
the machine’s initial state. D-RECOGNIZE
then enters a loop that drives the rest of the
15
algorithm. It first checks whether it has
reached the end of its input. If so, it either
accepts the input (if the current state is an
accept state) or rejects the input
(if not).

CSA4006-Dr. Anirban Bhowmick


D-Recognize
Before examining the beginning of the tape, the machine is in
state q0. Finding a b
on input tape, it changes to state q1 as indicated by the contents
of transition-table[q0,b]. It then finds an a and switches to state q2,
another a puts it in state q3, a third a leaves it in state q3, where it
reads the “!”, and switches to state q4. Since there is no more
input, the End of input condition at the beginning of the loop is
satisfied for the first time and the machine halts in q4. State q4 is 16
an accepting state, and so the machine has accepted the string
baaa! as a sentence in the sheep language. The algorithm will fail
whenever there is no legal transition for a given combination of
state and input. The input abc will fail to be recognized since there
is no legal transition out of state q0 on the input a. Even if the
automaton had allowed an initial a it would have certainly failed on
c, since c isn’t even in the sheeptalk alphabet! We can think of
these “empty” elements in the table as if they all pointed at one
“empty” state, which we might call the fail state or sink state.

CSA4006-Dr. Anirban Bhowmick


Formal Language
A formal language is a set of strings, each string composed of symbols from a finite
symbol-set called an alphabet (the same alphabet used above for defining an
automaton!). The alphabet for the sheep language is the set ∑ = {a,b, !}. Given a model m
(such as a particular FSA), we can use L(m) to mean “the formal language characterized
by m”. So the formal language defined by our sheeptalk automaton m in is the infinite set:
17

The usefulness of an automaton for defining a language is that it can express an infinite
set (such as this one above) in a closed form. Formal languages are not the same as
natural languages, which are the kind of languages that real people speak. In fact, a
formal language may bear no resemblance at all to a real language (e.g., a formal
language can be used to model the different states of a soda machine). But we often use
a formal language to model part of a natural language, such as parts of the phonology,
morphology, or syntax. The term generative grammar is sometimes used in linguistics to
mean a grammar of a formal language; the origin of the term is this use of an automaton
to define a language by generating all possible strings
CSA4006-Dr. Anirban Bhowmick
Another Example
we can also
have a higher level alphabet consisting of words. In this way we can write finite-state automata
that model facts about word combinations. For example, suppose we wanted to build an FSA that
modeled the subpart of English dealing with amounts of money. Such a formal language would
model the subset of English consisting of phrases like ten cents, three dollars, one dollar thirty-five
cents and so on.

18

CSA4006-Dr. Anirban Bhowmick


Example

Fifty one dollars twenty two cents


𝑞0 𝑞1 𝑞2 𝑞4 𝑞5 𝑞6 𝑞7

19

CSA4006-Dr. Anirban Bhowmick


Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 20

Parsing with Finite-State Transducers-


Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Introduction
Morphological parsing is a linguistic process that involves breaking down words into their constituent
morphemes. Morphemes are the smallest units of meaning in a language and can be individual words or
meaningful parts of words, such as prefixes, suffixes, and roots. Morphological parsing is an essential
aspect of linguistic analysis, especially in languages with complex inflectional and derivational
morphology, like many Indo-European languages.
It must be able to distinguish between orthographic rules and morphological rules.

21
Orthographic rules are general rules used when breaking a word into its stem and modifiers. An
example would be: singular English words ending with -y, when pluralized, end with -ies. Contrast this
to morphological rules which contain corner cases to these general rules. Both of these types of rules
are used to construct systems that can do morphological parsing

Morphological rules tell us the plural of goose is formed by changing the vowel.

CSA4006-Dr. Anirban Bhowmick


Morphemes
Morphemes: Morphemes are the smallest units of meaning in a language.

For example the word fox consists of a single morpheme (the morpheme fox) while the word cats
consists of two the morpheme cat and the morpheme s

Types of Morpheme:
22

One Morpheme –Nation


 Two Morpheme –National (nation, al)
 Three Morpheme –Nationalize (nation, al, ize)
 Four Morpheme –Denationalize (de, nation, al, ize)

CSA4006-Dr. Anirban Bhowmick


Morphemes
Morphemes can be classified into two main categories:

Free Morphemes (stem): These are complete words that can stand alone and carry meaning on their
own (e.g., "book," "run").
Bound Morphemes (affixes): These are meaningful units that cannot stand alone and must be attached
to a free morpheme to convey meaning. Bound morphemes include prefixes (e.g., "un-" in "undo"),
suffixes (e.g., "-ed" in "walked"), and infixes (inserted inside a word, like in some Tagalog verb forms). 23

Bounded morphemes, when it is added to a morpheme Free Morphemes stand alone as a word
it gives meaning Eg: Girl, cat, dog, little, Book, Bag
s in walks re in replay er in cheaper im in impossible
en in enlighten un in unable

Prefixes: impossible, reply, unhappy, confirm, compress


Suffixes: Passion, Ambition, Unity, Walking
Circumfixes: enlighten, embolden

CSA4006-Dr. Anirban Bhowmick


Concatenative Morphology & Non Concatenative
Morphology
Prefixes and suffixes are often called concatenative morphology since a word is composed of a
number of morphemes concatenated together
 Circumfixes (Not in English)
 Eg: In German, for example
 The past participle of some verbs formed by adding ge to the beginning of the stem and t to the
end
 so the past participle of the verb sagen (to say) is gesagt (said). 24

A number of languages have extensive non concatenative morphology, in which morphemes are
combined in more complex ways
 Another kind of non concatenative morphology is called templatic morphology or root and pattern
morphology This is very common in Arabic, Hebrew, and other Semitic languages

CSA4006-Dr. Anirban Bhowmick


Non Concatenative Morphology
In Hebrew, for example, a verb is constructed using two components a root, consisting usually of three
consonants ( and carrying the basic meaning, and a template, which gives the ordering of consonants
and vowels and specifies more semantic information about the resulting verb, such as the semantic
voice (e g active, passive, middle)

The Hebrew tri consonantal root lmd meaning ‘learn’ or ‘study’ can be combined with the active voice 25
CaCaC template to produce the word lamad,‘he studied’
 The intensive CiCeC template to produce the word limed, ‘he taught’
 The intensive passive template CuCaC to produce the word lumad ‘he was taught’

CSA4006-Dr. Anirban Bhowmick


26

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 6
Syllabus

Syllabus
3
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 4

Parsing with Finite-State Transducers-


Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 1:
Introduction

Topic: Regular Expression


Review

CSA4006-Dr. Anirban Bhowmick


Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 9

Parsing with Finite-State Transducers-


Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Introduction
Morphological parsing is a linguistic process that involves breaking down words into their constituent
morphemes. Morphemes are the smallest units of meaning in a language and can be individual words or
meaningful parts of words, such as prefixes, suffixes, and roots. Morphological parsing is an essential
aspect of linguistic analysis, especially in languages with complex inflectional and derivational
morphology, like many Indo-European languages.
It must be able to distinguish between orthographic rules and morphological rules.

10
Orthographic rules are general rules used when breaking a word into its stem and modifiers. An
example would be: singular English words ending with -y, when pluralized, end with -ies. Contrast this
to morphological rules which contain corner cases to these general rules. Both of these types of rules
are used to construct systems that can do morphological parsing

Morphological rules tell us the plural of goose is formed by changing the vowel.

CSA4006-Dr. Anirban Bhowmick


Morphemes
Morphemes: Morphemes are the smallest units of meaning in a language.

For example the word fox consists of a single morpheme (the morpheme fox) while the word cats
consists of two the morpheme cat and the morpheme s

Types of Morpheme:
11

One Morpheme –Nation


 Two Morpheme –National (nation, al)
 Three Morpheme –Nationalize (nation, al, ize)
 Four Morpheme –Denationalize (de, nation, al, ize)

CSA4006-Dr. Anirban Bhowmick


Morphemes
Morphemes can be classified into two main categories:

Free Morphemes (stem): These are complete words that can stand alone and carry meaning on their
own (e.g., "book," "run").
Bound Morphemes (affixes): These are meaningful units that cannot stand alone and must be attached
to a free morpheme to convey meaning. Bound morphemes include prefixes (e.g., "un-" in "undo"),
suffixes (e.g., "-ed" in "walked"), and infixes (inserted inside a word, like in some Tagalog verb forms). 12

Bounded morphemes, when it is added to a morpheme Free Morphemes stand alone as a word
it gives meaning Eg: Girl, cat, dog, little, Book, Bag
s in walks re in replay er in cheaper im in impossible
en in enlighten un in unable

Prefixes: impossible, reply, unhappy, confirm, compress


Suffixes: Passion, Ambition, Unity, Walking
Circumfixes: enlighten, embolden

CSA4006-Dr. Anirban Bhowmick


Concatenative Morphology & Non Concatenative
Morphology
Prefixes and suffixes are often called concatenative morphology since a word is composed of a
number of morphemes concatenated together
 Circumfixes (Not in English)
 Eg: In German, for example
 The past participle of some verbs formed by adding ge to the beginning of the stem and t to the
end
 so the past participle of the verb sagen (to say) is gesagt (said). 13

A number of languages have extensive non concatenative morphology, in which morphemes are
combined in more complex ways
 Another kind of non concatenative morphology is called templatic morphology or root and pattern
morphology This is very common in Arabic, Hebrew, and other Semitic languages

CSA4006-Dr. Anirban Bhowmick


Non Concatenative Morphology
In Hebrew, for example, a verb is constructed using two components a root, consisting usually of three
consonants ( and carrying the basic meaning, and a template, which gives the ordering of consonants
and vowels and specifies more semantic information about the resulting verb, such as the semantic
voice (e g active, passive, middle)

The Hebrew tri consonantal root lmd meaning ‘learn’ or ‘study’ can be combined with the active voice 14
CaCaC template to produce the word lamad,‘he studied’
 The intensive CiCeC template to produce the word limed, ‘he taught’
 The intensive passive template CuCaC to produce the word lumad ‘he was taught’

CSA4006-Dr. Anirban Bhowmick


Morphemes
Two broad classes of ways to form words from morphemes:
– Inflection: the combination of a word stem with a grammatical morpheme, usually resulting in a
word of the same class as the original stem, and usually filling some syntactic function like agreement

For example, English has the inflectional morpheme -s for marking the plural on nouns, and the
inflectional morpheme -ed for marking the past tense on verbs.

15
The meaning of the resulting word is easily predictable

– Derivation: the combination of a word stem with a grammatical morpheme, usually resulting in a
word of a different class, often with a meaning hard to predict exactly.

For example the verb computerize can take the derivational suffix -ation to produce the noun
computerization.

CSA4006-Dr. Anirban Bhowmick


Inflection
In English, only nouns, verbs, and sometimes adjectives can be inflected, and the number of affixes
is quite small.

English nouns have only two kinds of inflection: an affix that marks plural and an affix that marks
possessive. For example, many (but not all) English nouns can either appear in the bare stem or
singular form, or take a plural suffix. Here are examples of the regular plural suffix -s (also spelled -es),
and irregular plurals: 16

CSA4006-Dr. Anirban Bhowmick


Inflection

17

The irregular verbs are those that


have some more or less
idiosyncratic forms of inflection

CSA4006-Dr. Anirban Bhowmick


Inflection
An irregular verb can inflect in the past form (also called the preterite) by changing its vowel (eat/ate), or
its vowel and some consonants (catch/caught), or with no ending at all (cut/cut).

 The s form is used in the ‘habitual present’ form to distinguish the 3rd person singular ending (She jogs
every Tuesday) from the other choices of person and number (I/you/we/they jog every Tuesday)
 The stem form is used in the infinitive form, and also after certain other verbs (I’d rather walk home, I 18
want to walk home)
 The ing participle is used when the verb is treated as a noun called a gerund use
 Eg: Fishing is fine if you live near water

The ed participle is used in


Perfect construction He’s eaten lunch already) or
Passive construction (The verdict was overturned yesterday

CSA4006-Dr. Anirban Bhowmick


Inflection
A single consonant letter is doubled before adding the ing and ed suffixes (beg/begging/begged)
 If the final letter is c the doubling is spelled ck (picnic/picnicking/picnicked)
 If the base ends in a silent e it is deleted before adding ing and ed (merge/merging/merged)
 Just as for nouns, the s ending is spelled es after verb stems ending in -s (toss/tosses) –z
(waltz/waltzes) -sh (wash/washes) –ch (catch/catches) and sometimes -x (tax/taxes)
 Also like nouns, verbs ending in y preceded by a consonant change the y to i (try/tries)
19

CSA4006-Dr. Anirban Bhowmick


Derivation Morphology

 Derivation in English is quite complex Its is the Combination of a word stem with a grammatical
morpheme, usually resulting in a word of a different class, often with a meaning hard to predict
exactly
 A very common kind of derivation in English is the formation of new nouns, often from verbs or
adjectives This process is called nominalization
20
 For example, the suffix -ation produces nouns from verbs ending often in the suffix -ize
(computerize/ computerization)

Adjectives can also be derived from nouns and verbs

CSA4006-Dr. Anirban Bhowmick


Derivation Morphology
Derivation in English is more complex than
inflection because
– Generally less productive
A nominalizing affix like –ation can
not be added to absolutely every verb. eatation(*)
– There are subtle and complex meaning
differences among nominalizing suffixes. 21
For example, sincerity has a subtle
difference in meaning from sincereness.

CSA4006-Dr. Anirban Bhowmick


Morphological parsing
Breaking down words into components and building a
structured representation.
– English:
● cats  cat +N +Pl
● caught  catch +V +Past
– Spanish:
● vino (came)  venir +V + Perf +3P + Sg 22
● vino (wine)  vino +N + Masc + Sg

Importance:
Information retrieval
– Normalize verb tenses, plurals, grammar cases
● Machine translation
– Translation based on the stem

CSA4006-Dr. Anirban Bhowmick


Finite States Morphological Parsing

23

Parsing English morphology

CSA4006-Dr. Anirban Bhowmick


Finite States Morphological Parsing
We need at least the following to build a morphological parser:
1. Lexicon: the list of stems and affixes, together with basic information about them (Noun stem or
Verb stem, etc.)
2. Morphotactics: the model of morpheme ordering that explains which classes of morphemes can
follow other classes of morphemes inside a word. E.g., the rule that English plural morpheme follows
the noun rather than preceding it.
3. Orthographic rules: these spelling rules are used to model the changes that occur in a word, 24
usually when two morphemes combine (e.g., the y→ie spelling rule changes city + -s to cities).

CSA4006-Dr. Anirban Bhowmick


The Lexicon and Morphotactic
A lexicon is a repository for words.
– The simplest one would consist of an explicit list of every word of the language. Inconvenient or
impossible!
– Computational lexicons are usually structured with
• a list of each of the stems and
• Affixes of the language together with a representation of morphotactics telling us how they can fit
together. 25
– The most common way of modeling morphotactics is the finite-state automaton.

CSA4006-Dr. Anirban Bhowmick


The Lexicon and Morphotactic

26

CSA4006-Dr. Anirban Bhowmick


27

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 7
Syllabus

Syllabus
3
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 4

Parsing with Finite-State Transducers-


Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 2:
Morphology

Topic: FSA and FST


Review

CSA4006-Dr. Anirban Bhowmick


Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 9

Parsing with Finite-State Transducers-


Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
The Lexicon and Morphotactic
English derivational morphology is more complex than English inflectional morphology, and so
automata of modeling English derivation tends to be quite complex
– Some even based on CFG
• A small part of morphosyntactic of English adjectives

10

In a Finite State Automaton (FSA), epsilon (ε) transitions are used to


represent transitions between states without consuming any input symbol.
These transitions are also known as "null" or "empty" transitions. When you
have two states with an epsilon transition between them, it means that you
can move from one state to the other without reading any input symbol.

CSA4006-Dr. Anirban Bhowmick


The Lexicon and Morphotactic
The FSA#1 recognizes all the listed
adjectives, and ungrammatical forms
like unbig, redly, and realest.
• Thus #1 is revised to become #2.
• The complexity is expected from English
derivational.
11

CSA4006-Dr. Anirban Bhowmick


The Lexicon and Morphotactic
We can now use these FSAs to solve the
problem of morphological recognition:
– Determining whether an input string of
letters makes up a legitimate English word or
not
– We do this by taking the morphotactic
FSAs, and plugging in each “sub-lexicon” into 12
the FSA.
– The resulting FSA can then be defined as
the level of the individual letter.

CSA4006-Dr. Anirban Bhowmick


Finite-state transducers (FST)
FST is a type of FSA which maps between two sets of
symbols.
● It is a two-tape automaton that recognizes or
generates pairs of strings, one from each type.
● FST defines relations between sets of strings

13

Given the input, for example, cats, we would like to produce cat +N +PL.
• Two-level morphology, by Koskenniemi (1983)
– Representing a word as a correspondence between a lexical level
• Representing a simple concatenation of morphemes making up a word, and
– The surface level
• Representing the actual spelling of the final word.
• Morphological parsing is implemented by building mapping rules that maps letter sequences like cats on
the surface level into morpheme and features sequence like cat +N +PL on the lexical level.
CSA4006-Dr. Anirban Bhowmick
Finite-state transducers (FST)

The automaton we use for performing the mapping between these two
14
levels is the finite-state transducer or FST.
– A transducer maps between one set of symbols and another;
– An FST does this via a finite automaton.
• Thus an FST can be seen as a two-tape automaton which recognizes or generates pairs of
strings.
• The FST has a more general function than an FSA:
– An FSA defines a formal language
– An FST defines a relation between sets of strings.
• Another view of an FST:
– A machine reads one string and generates another.
CSA4006-Dr. Anirban Bhowmick
FST
FST as recognizer:
– a transducer that takes a pair of strings as input and output accept if the
string-pair is in the string-pair language, and a reject if it is not.
FST as generator:
– a machine that outputs pairs of strings of the language. Thus the output is
a yes or no, and a pair of output strings.
15
FST as transducer:
– A machine that reads a string and outputs another string.
FST as set relater:
– A machine that computes relation between sets.

CSA4006-Dr. Anirban Bhowmick


FST

A formal definition of FST (based on the Mealy machine extension to


a simple FSA):
– Q: a finite set of N states q0, q1,…, qN
– Σ: a finite alphabet of complex symbols. Each complex symbol is
composed of an input-output pair i : o; one symbol I from an input 16
alphabet I, and one symbol o from an output alphabet O, thus Σ ⊆ I×O. I
and O may each also include the epsilon symbol ε.
– q0: the start state
– F: the set of final states, F ⊆ Q
– δ(q, i:o): the transition function or transition matrix between states. Given
a state q ∈ Q and complex symbol i:o ∈ Σ, δ(q, i:o) returns a new state q’ ∈ Q. δ is
thus a relation from Q × Σ to Q.

CSA4006-Dr. Anirban Bhowmick


FST
• FSAs are isomorphic to regular languages, FSTs are isomorphic to regular relations.
• Regular relations are sets of pairs of strings, a natural extension of the regular language,
which are sets of strings.
• FSTs are closed under union, but generally they are not closed under difference,
complementation, and intersection.
• Two useful closure properties of FSTs:
– Inversion: If T maps from I to O, then the inverse of T, 𝑇 −1 maps from O 17
to I.
– Composition: If 𝑇1 is a transducer from 𝐼1 to 𝑂1 and 𝑇2 a transducer from
𝑂1 to 𝑂2 , then 𝑇1 ∙ 𝑇2 maps from 𝐼1 to 𝑂2

• Inversion is useful because it makes it easy to convert a FST-as-parser into an FST-as-


generator.
• Composition is useful because it allows us to take two transducers than run in series and replace
them with one complex transducer.
– 𝑇1 ∙ 𝑇2 (S) = 𝑇2 (𝑇1 (S) )

CSA4006-Dr. Anirban Bhowmick


FST

18

The composition of [a:b]+ with [b:c]+ to produce


[a:c]+.

CSA4006-Dr. Anirban Bhowmick


19

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 8
Syllabus

Syllabus
3
Module 1:
Introduction: Knowledge in Speech and Language
Processing- Ambiguity- Models and Algorithms-
Language, Thought, and Understanding- The
State of the Art and the Near-Term Future – 4

Regular Expressions-Basic Regular Expression


Patterns- Disjunction, Grouping, and
Precedence- Using an FSA to Recognize
Sheeptalk- Formal
Languages.
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 5

Parsing with Finite-State Transducers-


Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 6
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
7
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 2:
Morphology

Topic: FSA and FST


FST
FST as recognizer:
– a transducer that takes a pair of strings as input and output accept if the
string-pair is in the string-pair language, and a reject if it is not.
FST as generator:
– a machine that outputs pairs of strings of the language. Thus the output is
a yes or no, and a pair of output strings.
9
FST as transducer:
– A machine that reads a string and outputs another string.
FST as set relater:
– A machine that computes relation between sets.

CSA4006-Dr. Anirban Bhowmick


FST

A formal definition of FST (based on the Mealy machine extension to


a simple FSA):
– Q: a finite set of N states q0, q1,…, qN
– Σ: a finite alphabet of complex symbols. Each complex symbol is
composed of an input-output pair i : o; one symbol I from an input 10
alphabet I, and one symbol o from an output alphabet O, thus Σ ⊆ I×O. I
and O may each also include the epsilon symbol ε.
– q0: the start state
– F: the set of final states, F ⊆ Q
– δ(q, i:o): the transition function or transition matrix between states. Given
a state q ∈ Q and complex symbol i:o ∈ Σ, δ(q, i:o) returns a new state q’ ∈ Q. δ is
thus a relation from Q × Σ to Q.

CSA4006-Dr. Anirban Bhowmick


FST
• FSAs are isomorphic to regular languages, FSTs are isomorphic to regular relations.
• Regular relations are sets of pairs of strings, a natural extension of the regular language, which are
sets of strings.
• FSTs are closed under union, but generally they are not closed under difference, complementation,
and intersection.
• Two useful closure properties of FSTs:
– Inversion: If T maps from I to O, then the inverse of T, 𝑇 −1 maps from O 11
to I.
– Composition: If 𝑇1 is a transducer from 𝐼1 to 𝑂1 and 𝑇2 a transducer from
𝑂1 to 𝑂2 , then 𝑇1 ∙ 𝑇2 maps from 𝐼1 to 𝑂2

• Inversion is useful because it makes it easy to convert a FST-as-parser into an FST-as-


generator.
• Composition is useful because it allows us to take two transducers than run in series and replace
them with one complex transducer.
– 𝑇1 ∙ 𝑇2 (S) = 𝑇2 (𝑇1 (S) )

CSA4006-Dr. Anirban Bhowmick


FST

12

The composition of [a:b]+ with [b:c]+ to produce


[a:c]+.

CSA4006-Dr. Anirban Bhowmick


FST

13

A transducer for English nominal


number inflection 𝑇𝑛𝑢𝑚

CSA4006-Dr. Anirban Bhowmick


FST

14

The transducer 𝑇stems , which maps roots to their root-


class

CSA4006-Dr. Anirban Bhowmick


FST

15

A fleshed out English nominal inflection FST


𝑇lex = 𝑇num ∙ 𝑇stems

CSA4006-Dr. Anirban Bhowmick


Orthographic Rules and FSTs

16

These spelling changes can be thought as taking as input a simple concatenation of morphemes and
producing as output a slightly-modified concatenation of morphemes

CSA4006-Dr. Anirban Bhowmick


Orthographic Rules and FSTs
We note that concatenating the morphemes can work to parse the words like “dog”, “cat”, “fox”, but this
simple method does not work when there is spelling change, like “foxes” is to be parsed into lexicons “fox
+N +PL” or “cats” is to be parsed into “cat +N +PL”, etc. This requires introduction of spelling rules
(also called orthographic rules). To account for the spelling rules, we introduce another tape, called
intermediate tape, which produces the output slightly modified, thus going from 2-level to 3-level
morphology. Such a rule maps from intermediate tape to surface tape. For plural nouns, the rule states,
“insert e on the surface tape just when lexical tape has a morpheme ending in x or z or s and next 17
morpheme is -s”. The examples are ox to oxes, and fox to foxes. The rule is stated as,

The above equation is called Chomsky and Hall notation. A rule of the form a → b/c − d means rewrite
a as b, when it occurs between c and d. Since symbol " is null, replacing it means inserting some thing.
The symbol ∧ indicates morpheme boundary. These boundaries are deleted by including the symbol ∧ :
" in the default pairs for the transducer.

CSA4006-Dr. Anirban Bhowmick


Orthographic Rules and FSTs

● Lexical: foxes +N +Pl


● Intermediate: fox^s# 18
● Surface: foxes

The transducer for the E insertion rule

CSA4006-Dr. Anirban Bhowmick


Combining FST Lexicon and Rules
These multi-level FSTs in sequence between different
tapes, as well as through parallel transducers for
spelling checks, we are able to parse those words
whose morphological analysis is simple.

However, considering the sentence “The police books


the right culprit”, here it is not clear as per above rules
19

that whether the lexical parser’s output is “book +N


+PL” or it Is “book +V +3SG” ! However, to human it is
not difficult to infer that it is the second. This is due to
the ambiguity in the word, which may be a noun or a
verb, depending on its position in a sentence. This type
of ambiguity is called

CSA4006-Dr. Anirban Bhowmick


Combining FST Lexicon and Rules

20

CSA4006-Dr. Anirban Bhowmick


Lexicon-Free FSTs: the Porter Stemmer
• Information retrieval
• One of the mostly widely used stemming algorithms is
the simple and efficient Porter (1980) algorithm, which
is based on a series of simple cascaded rewrite rules.
– ATIONAL → ATE (e.g., relational → relate)
– ING → ε
– if stem contains vowel (e.g., motoring → motor) 21
• Problem:
– Not perfect: error of commission, omission
• Experiments have been made
– Some improvement with smaller documents
– Any improvement is quite small

CSA4006-Dr. Anirban Bhowmick


Introduction
Psychological studies to learn how multi morphemic words are represented in the minds of speakers
of English.
For example, consider the word walk and its inflected forms walks, and walked. Are all three in the
human lexicon? Or merely walk along with -ed and -s?

How about the word happy and its derived forms happily and happiness?
The full listing hypothesis proposes that all words of a language are listed in the mental lexicon
without any internal morphological structure

• Morphological structure is simply an epiphenomenon, and walk, walks, walked, happy, and happily
are all separately listed in the lexicon
The minimum redundancy hypothesis suggests that only the constituent morphemes are
represented in the lexicon, and when processing walks (whether for reading, listening, or talking) we
must access both morphemes (walk and s) and combine them

CSA4006-Dr. Anirban Bhowmick


Introduction
Some of the earliest evidence that the human lexicon represents at least some morphological
structure comes from speech errors
easy enoughly (for “easily enough”)

More recent experimental evidence suggests that neither the full listing nor the minimum redundancy
hypotheses may be completely true. Instead, it’s possible that some but not all morphological
relationships are mentally represented

For eg found that derived forms ( happily) are stored separately from their stem ( but that regularly
inflected forms ( are not distinct in the lexicon from their stems)
Marslen Wilson et al. (1994) found that spoken derived words can prime their stems, but only if the
meaning of the derived form is closely related to the stem.
• For example government primes govern, but department does not prime depart

CSA4006-Dr. Anirban Bhowmick


SPEECH SOUNDS AND PHONETIC TRANSCRIPTION
• The fundamental insights and algorithms necessary to understand modern speech recognition and
speech synthesis technology, and the related branch of linguistics called computational phonology

• Core task – speech recognition acoustic waveform  output a string of words


–Text to speech synthesis
Sequence of text words  output an acoustic waveform

A speech recognition system needs to have a pronunciation for every word it can recognize, and a
text-to-speech system needs to have a pronunciation for every word it can say

CSA4006-Dr. Anirban Bhowmick


Contd.
• The science of phonetics aims to describe all the sounds of all the world’s languages

– Acoustic phonetics: focuses on the physical properties of the sounds of language

– Auditory phonetics: focuses on how listeners perceive the sounds of language

– Articulatory phonetics: focuses on how the vocal tract produces the sounds of language

 Phonetic alphabets: Pronunciation part of the field of phonetics


 Articulatory phonetics: Produced by articulators in the mouth
 Phonological rules: Systematic way of sounds are differently realized
 Computational phonology: Study of computational mechanisms for modeling phonological rules.
 Phonological learning: How phonological rules can be automatically induced by machine
learning algorithms.

CSA4006-Dr. Anirban Bhowmick


IPA and ARPABET-vowel
The International Phonetic Alphabet (IPA) and the
ARPABET are two systems used to represent the
sounds of spoken language. They provide a
standardized way to transcribe the sounds of speech,
which can be useful for linguists, phoneticians, and
language learners. Here are examples of both IPA and
ARPABET transcriptions for English words:

CSA4006-Dr. Anirban Bhowmick


IPA and ARPABET-consonant

CSA4006-Dr. Anirban Bhowmick


The Vocal Organs
 Articulatory phonetics the study of how phones are produced, as the various organs in the mouth,
throat, and nose modify the airflow from the lungs
 Sound is produced by the rapid movement of air
 Most sounds in human languages are produced by expelling air from the lungs through the windpipe
(technically the trachea) and then out the mouth or nose
 As it passes through the trachea, the air passes through the larynx, commonly known as the Adam’s
apple or voicebox

 The larynx contains two small folds of muscle, the vocal folds (often referred to non technically as the
vocal cords) which can be moved together or apart

 The space between these two folds is called the glottis

CSA4006-Dr. Anirban Bhowmick


Vocal Organ Most speech sounds are produced by pushing air through the
vocal cords
– Glottis = the opening between the vocal cords
– Larynx = ‘voice box’

– Pharynx = tubular part of the throat above the larynx

– Oral cavity = mouth

– Nasal cavity = nose and the passages connecting it to the


throat and sinuses
Phones are divided into two main classes:
– Consonants are made by restricting or blocking the airflow in
some way, and may be voiced or unvoiced
– Vowels have less obstruction, are usually voiced, and are
generally louder and longer lasting than consonants
• Both kinds of sounds are formed by the motion of air through
the mouth, throat or nose

CSA4006-Dr. Anirban Bhowmick


30

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 9
Syllabus

Syllabus
3
Module 2:
Morphology And Finite-State Transducers:
Inflectional Morphology -Derivational
Morphology- Finite-State Morphological Parsing-
The Lexicon and Morphotactics - Morphological 4

Parsing with Finite-State Transducers-


Combining FST Lexicon and Rules- Lexicon-free
FSTs: The Porter Stemmer- Human
Morphological Processing- Speech Sounds
and Phonetic Transcription- The Phoneme and
Phonological Rules
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 2:
Morphology

Topic: FSA and FST


Vocal Organ Most speech sounds are produced by pushing air through the
vocal cords
– Glottis = the opening between the vocal cords
– Larynx = ‘voice box’

– Pharynx = tubular part of the throat above the larynx

– Oral cavity = mouth

– Nasal cavity = nose and the passages connecting it to the


throat and sinuses
Phones are divided into two main classes:
– Consonants are made by restricting or blocking the airflow in
some way, and may be voiced or unvoiced
– Vowels have less obstruction, are usually voiced, and are
generally louder and longer lasting than consonants
• Both kinds of sounds are formed by the motion of air through
the mouth, throat or nose

CSA4006-Dr. Anirban Bhowmick


Consonants: Place of Articulation
Consonants are sounds produced with some restriction or closure in the vocal tract
• Consonants are classified based in part on where in the vocal tract the airflow is being restricted (the
place of articulation)
• The major places of articulation are bilabial, labiodental, interdental, alveolar, palatal, velar, uvular,
and glottal

CSA4006-Dr. Anirban Bhowmick


Consonants: Place of Articulation
1.Bilabial: The airflow is obstructed by bringing both lips together.
Example: /p/ in "pat," /b/ in "bat," /m/ in "mat."
2.Labiodental: The airflow is obstructed by placing the upper teeth against the lower lip.
Example: /f/ in "fan," /v/ in "van."
3.Interdental: The airflow is obstructed by placing the tip of the tongue between the teeth.
Example: /θ/ in "think," /ð/ in "this."
4.Alveolar: The airflow is obstructed by raising the front part of the tongue to the alveolar ridge, which is the bony
ridge just behind the upper front teeth.
Example: /t/ in "top," /d/ in "dog," /s/ in "sock."
5.Alveopalatal (or Palatoalveolar): The airflow is obstructed by raising the front part of the tongue to the area just
behind the alveolar ridge.
Example: /ʃ/ in "shoe," /ʒ/ in "measure," /tʃ/ in "cheese," /dʒ/ in "judge."
6.Palatal: The airflow is obstructed by raising the middle part of the tongue to the hard palate, which is the roof of the
mouth right behind the alveolar ridge.
Example: /j/ in "yes," /ʎ/ in some dialects of Spanish.
7.Velar: The airflow is obstructed by raising the back part of the tongue to the soft part of the palate (the velum).
Example: /k/ in "cat," /g/ in "go," /ŋ/ in "sing."
8.Glottal: The airflow is obstructed by closing or nearly closing the space between the vocal cords in the larynx.
Example: /h/ in "hat," the glottal stop /ʔ/ in some dialects, as in "uh-oh."

CSA4006-Dr. Anirban Bhowmick


Consonants: Manner of Articulation
Consonants can also be classified by their manner of articulation, which describes how the airflow is
obstructed or modified as they are produced. Here are some common manners of articulation for
consonants with examples:

Plosive (or Stop): These consonants are produced by a complete closure of the vocal tract, causing a
momentary halt in the airflow before releasing it.

Example: /p/ in "pat," /b/ in "bat," /t/ in "top," /d/ in "dog," /k/ in "cat," /g/ in "go.“

Fricative: Fricatives are produced by narrowing the vocal tract, creating turbulent airflow and a continuous,
hissing sound.

Example: /f/ in "fan," /v/ in "van," /s/ in "sock," /z/ in "zebra," /ʃ/ in "shoe," /ʒ/ in "measure."

Affricate: Affricates begin with a stop-like closure and then transition into a fricative sound.

Example: /tʃ/ in "cheese," /dʒ/ in "judge."

CSA4006-Dr. Anirban Bhowmick


Contd.
Nasal: Nasal consonants are produced by lowering the velum (soft part of the roof of the mouth),
allowing air to flow through the nasal cavity.

Example: /m/ in "mat," /n/ in "net," /ŋ/ in "sing."


Liquid: Liquids involve a relatively free airflow, with slight constriction in the vocal tract.

Lateral Liquid: /l/ in "let."


Retroflex Liquid: /ɹ/ in "red" (Note: The pronunciation of this sound can vary regionally.)
Glide (Semivowel): Glides are produced with a slight constriction in the vocal tract but are more
vowel-like in nature.

Example: /j/ in "yes," /w/ in "we."


Approximant: Approximants have a less constricted airflow than fricatives but more than glides.

Example: /ɹ/ in "red" (in some dialects), /ʋ/ in some languages.


These are the main manners of articulation for consonants.

CSA4006-Dr. Anirban Bhowmick


Vowel

Vowels are classified by how high or low the


tongue is, if the tongue is in the front or back of
the mouth, and whether or not the lips are
rounded
High vowels: [i] [ɪ] [u] [ʊ]
Mid vowels: [e] [ɛ] [o] [ə] [ʌ] [ɔ]
Low vowels: [æ] [a]
Front vowels: [i] [ɪ] [e] [ɛ] [æ]
Central vowels: [ə] [ʌ]
Back vowels: [u] [ɔ] [o] [æ] [a]

CSA4006-Dr. Anirban Bhowmick


14

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 10
Syllabus

Syllabus
3
Module 3:
Syntax Parsing: Tagsets for English - Part of
Speech Tagging- Rule based Part-of-speech
Tagging- Stochastic Part-of speech Tagging-
4
Transformation-Based Tagging- Context-Free
Grammars for English - Context-Free Rules and
Trees- The Noun Phrase. The Verb Phrase and
Subcategorization- Grammar Equivalence
&Normal Form- Finite State & Context-Free
Grammars.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 3: Syntax
Parsing

Topic: Introduction
Tagsets for English
There are a small number of popular tagsets for English

 The choice of Tagsets depends on the nature of the


application
 Small Tagsets (more general)
 Large Tagsets (finer tags)
8

 Some of the widely used Part of Speech Tagsets are


 45 - tag Penn Treebank tagset
 87 - tag tagset used for the Brown corpus
 medium sized 61 tag C5 tagset
 146 tag C7 tagset
 Each word is associated with the tag from the used
Tagsets.
CSA4006-Dr. Anirban Bhowmick
Some Common Tagsets (English)
Penn Treebank Tagset:
This is one of the most widely used tagsets in natural language processing and linguistics. It
was developed for the Penn Treebank project and includes tags like NN (Noun), VB (Verb),
JJ (Adjective), RB (Adverb), and more. It's known for its detailed granularity.
Universal POS Tagset:
The Universal POS Tagset is designed to be more cross-linguistic and universal, making it
easier to work with multilingual data. It includes tags like NOUN, VERB, ADJ, ADV, and 9
others, providing a simpler and more consistent set of labels compared to the Penn Treebank
Tagset.
Brown Corpus Tagset:
The Brown Corpus is a well-known linguistic corpus, and it has its own tagset. It includes tags
like N (Noun), V (Verb), ADJ (Adjective), and others. It's used primarily for linguistic research.
CLAWS Tagset:
The CLAWS (Constituent Likelihood Automatic Word-tagging System) Tagset is designed to
be a detailed and linguistically motivated tagset. It includes a wide range of tags to capture
grammatical and syntactic information.
CSA4006-Dr. Anirban Bhowmick
Some Common Tagsets (English)
Lancaster-Oslo/Bergen (LOB) Tagset:
The LOB Corpus is another linguistic corpus, and it has its own tagset. It's used primarily in
corpus linguistics and includes tags like NN (Noun), VB (Verb), JJ (Adjective), and more.
Medical Subject Headings (MeSH) Tagset:
This tagset is specific to the medical domain and is used for indexing and categorizing
medical texts. It includes tags like A1.4.1 (Anatomy), D1.1.1 (Diseases), and others.
10
OntoNotes Tagset:
The OntoNotes project developed a tagset for annotating a wide range of linguistic
information, including part of speech, named entities, and syntactic structures. It's used in
various natural language processing tasks.
Google Universal Dependencies Tagset:
Google's Universal Dependencies project aims to provide universal grammatical relations
and dependency labels for multiple languages, including English. It includes tags like NOUN,
VERB, ADJ, and more, similar to the Universal POS Tagset.

CSA4006-Dr. Anirban Bhowmick


POS tagging
 Part of speech tagging (or just tagging for short) is the process of assigning a part of
speech or other lexical class marker to each word in a corpus
 Tags are also usually applied to punctuation markers thus tagging for natural language is
the same process as tokenization for computer languages, although tags for natural
languages are much more ambiguous
 Even in these simple examples, automatically
11
assigning a tag to each word is not trivial.
 For example, book is ambiguous more than one
possible usage and part of speech.
 It can be a verb (as in book that flight or to book
the suspect) or a noun (as in hand me that book,
or a book of matches).
 Similarly that can be a determiner (as in Does
that flight serve dinner), or a complementizer (as
in I thought that your flight was earlier).
CSA4006-Dr. Anirban Bhowmick
Multiple POS
 Words often have more than one POS: back
 The back door = JJ
 On my back = NN
 Olga’s not looking forward to going back to school in
September.= RB
12
 Promised to back the bill = VB
 The POS tagging problem is to determine the POS tag for
a particular instance of a word.
 The problem of POS tagging is to resolve these ambiguities, choosing the proper tag
for the context.
 Part of speech tagging is thus one of the many disambiguation tasks

CSA4006-Dr. Anirban Bhowmick


How Hard is POS Tagging? Measuring
Ambiguity
T

13

CSA4006-Dr. Anirban Bhowmick


Methods for POS Tagging
 Rule based tagging uses hand written rules
ENGTWOL (ENGlish TWO Level analysis)
 Stochastic Probabilistic sequence models
HMM (Hidden Markov Model) tagging
MEMMs (Maximum Entropy Markov Models)
14
 Transformation Based Tagging uses ruled learned automatically.

CSA4006-Dr. Anirban Bhowmick


Rule based POS tagging
 First stage used a dictionary to assign each word a list of potential parts of
speech
 Second stage used large lists of hand written disambiguation rules to winnow
down this list to a single part of speech for each word

15

 These taggers are knowledge-driven taggers.


 The rules in Rule-based POS tagging are built manually.
 The information is coded in the form of rules.
 We have some limited number of rules approximately around 1000.
 Smoothing and language modeling is defined explicitly in rule-based taggers.

CSA4006-Dr. Anirban Bhowmick


Rule based POS tagging
Rule-based POS taggers can be relatively simple to implement and are often used as a starting point for
more complex machine learning-based taggers. However, they can be less accurate and less efficient than
machine learning-based taggers, especially for tasks with large or complex datasets

Here is an example of how a rule-based POS tagger might work:


Define a set of rules for assigning POS tags to words. For example:

If the word ends in “-tion,” assign the tag “noun.” 16

If the word ends in “-ment,” assign the tag “noun.”


If the word is all uppercase, assign the tag “proper noun.”
If the word is a verb ending in “-ing,” assign the tag “verb.”

iterate through the words in the text and apply the rules to each word in turn. For example:
“Nation” would be tagged as “noun” based on the first rule.
“Investment” would be tagged as “noun” based on the second rule.
“UNITED” would be tagged as “proper noun” based on the third rule.
“Running” would be tagged as “verb” based on the fourth rule.

CSA4006-Dr. Anirban Bhowmick


ENGTWOL –rule based tagger
 Uses two level lexicon transducer
 Uses hand crafted rules (about 1100 rules)

 Process: Start With a Dictionary

17

CSA4006-Dr. Anirban Bhowmick


ENGTWOL –rule based tagger
Eliminate VBN if VBD is an option when
VBN|VBD follows “<start> PRP”

18

CSA4006-Dr. Anirban Bhowmick


Stochastic Probabilistic sequence models
HMM (Hidden Markov Model) is a Stochastic technique for POS tagging. Hidden Markov
models are known for their applications to reinforcement learning and temporal pattern
recognition such as speech, handwriting, gesture recognition, musical score following, partial
discharges, and bioinformatics.
Let us consider an example proposed by Dr.Luis
Serrano and find out how HMM selects an appropriate
19
tag sequence for a sentence.

Process:
Training Data:

Mary Jane can see Will


Spot will see Mary
Will Jane spot Mary?
Mary will pat Spot

CSA4006-Dr. Anirban Bhowmick


HMM-POS Tagging
Words Noun Modal Verb
Mary 4 0 0
Jane 2 0 0
Will 1 3 0
Spot 2 0 1
Can 0 1 0 20

See 0 0 2
pat 0 0 1

CSA4006-Dr. Anirban Bhowmick


HMM-POS Tagging
Now let us divide each column by the total number of their appearances for example, ‘noun’ appears nine
times in the above sentences so divide each term by 9 in the noun column. We get the following table after
this operation.

Words Noun Model Verb


Mary 4/9 0 0
Jane 2/9 0 0
21
Will 1/9 3/4 0
Spot 2/9 0 1/4
Can 0 1/4 0
See 0 0 2/4
pat 0 0 1

These are the emission probabilities.

CSA4006-Dr. Anirban Bhowmick


HMM-POS Tagging
N M V <E>
Next, we have to calculate the transition probabilities,
so define two more tags <S> and <E>. <S> is placed at <S> 3 1 0 0
the beginning of each sentence and <E> at the end as N 1 3 1 4
shown in the figure below. M 1 0 3 0
V 4 0 0 0

In the above figure, we can see that the <S> tag is 22

followed by the N tag three times, thus the first entry is


3.The modal tag follows the <S> just once, thus the
second entry is 1. In a similar manner, the rest of the
table is filled.
Next, we divide each term in a row of the table by the
total number of co-occurrences of the tag in
consideration, for example, The Model tag is followed
by any other tag four times as shown below, thus we
divide each element in the third row by four.

CSA4006-Dr. Anirban Bhowmick


HMM-POS Tagging

N M V <E>
<S> 3/4 1/4 0 0
N 1/9 3/9 1/9 4/9
M 1/4 0 3/4 0
V 4/4 0 0 0 23

CSA4006-Dr. Anirban Bhowmick


HMM-POS Tagging
Take a new sentence and tag them with wrong tags.
Let the sentence, ‘ Will can spot Mary’ be tagged
as-

Will as a modal
Can as a verb
Spot as a noun
24
Mary as a noun

Now calculate the probability of this sequence being


correct in the following manner.
The probability of the tag Model (M) comes after the tag <S> is ¼ as seen in the table. Also, the
probability that the word Will is a Model is 3/4. In the same manner, we calculate each and every
probability in the graph. Now the product of these probabilities is the likelihood that this sequence is
right. Since the tags are not correct, the product is zero.

1/4*3/4*3/4*0*1*2/9*1/9*4/9*4/9=0
CSA4006-Dr. Anirban Bhowmick
HMM-POS Tagging
When these words are correctly tagged, we get a probability greater than zero as shown
below

Calculating the product of these terms we get,

3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164
25

CSA4006-Dr. Anirban Bhowmick


HMM-POS Tagging
For our example, keeping into consideration just three POS tags we have mentioned, 81 different
combinations of tags can be formed. In this case, calculating the probabilities of all 81 combinations seems
achievable. But when the task is to tag a larger sentence and all the POS tags in the Penn Treebank
project are taken into consideration, the number of possible combinations grows exponentially and this
task seems impossible to achieve. Now let us visualize these 81 combinations as paths and using the
transition and emission probability mark each vertex and edge as shown below.

26

CSA4006-Dr. Anirban Bhowmick


HMM-POS Tagging
The next step is to delete all the vertices and edges with probability zero, also the vertices which do not
lead to the endpoint are removed

27

Now there are only two paths that lead to the end, let us calculate the probability associated with each path.

<S>→N→M→N→N→<E> =3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.00000846754

<S>→N→M→V→N→<E>=3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164

Clearly, the probability of the second sequence is much higher and hence the HMM is going to tag each word in the sentence
according to this sequence.

CSA4006-Dr. Anirban Bhowmick


28

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 11
Syllabus

Syllabus
3
Module 3:
Syntax Parsing: Tagsets for English - Part of
Speech Tagging- Rule based Part-of-speech
Tagging- Stochastic Part-of speech Tagging-
4
Transformation-Based Tagging- Context-Free
Grammars for English - Context-Free Rules and
Trees- The Noun Phrase. The Verb Phrase and
Subcategorization- Grammar Equivalence
&Normal Form- Finite State & Context-Free
Grammars.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 3: Syntax
Parsing

Topic: Introduction
Optimizing HMM with Viterbi Algorithm
The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden
states—called the Viterbi path—that results in a sequence of observed events, especially in the context of
Markov information sources and hidden Markov models (HMM).

In the previous section, we optimized the HMM and bought our calculations down from 81 to just two. Now
we are going to further optimize the HMM by using the Viterbi algorithm. Let us use the same example we
used before and apply the Viterbi algorithm to it
8

CSA4006-Dr. Anirban Bhowmick


Optimizing HMM with Viterbi Algorithm
Consider the vertex encircled in the above example. There are two paths leading to this vertex as shown
below along with the probabilities of the two mini-paths

Now we are really concerned with the mini path having the lowest probability. The same procedure is
done for all the states in the graph as shown in the figure below

CSA4006-Dr. Anirban Bhowmick


Optimizing HMM with Viterbi Algorithm
As we can see in the figure, the probabilities of
all paths leading to a node are calculated and we
remove the edges or path which has lower
probability cost. Also, you may notice some
nodes having the probability of zero and such
nodes have no edges attached to them as all the
paths are having zero probability. The graph
10
obtained after computing probabilities of all
paths leading to a node is shown below:

CSA4006-Dr. Anirban Bhowmick


Optimizing HMM with Viterbi Algorithm
To get an optimal path, we start from the end and trace backward, since each state has only one incoming
edge, This gives us a path as shown below

As you may have noticed, this algorithm returns only


one path as compared to the previous method which
suggested two paths. Thus by using this algorithm, we
saved us a lot of computations.
11

After applying the Viterbi algorithm the model tags the


sentence as following-
Will as a noun
Can as a modal
Spot as a verb
Mary as a noun
These are the right tags so we conclude that the model
can successfully tag the words with their appropriate
POS tags
CSA4006-Dr. Anirban Bhowmick
Bi-gram statistical tagger

12

CSA4006-Dr. Anirban Bhowmick


Transformation Based (Brill) Tagging
A hybrid approach

 Like rule-based taggers, this tagging is based on rules


 Like (most) stochastic taggers, rules are also automatically induced from hand-tagged data

Basic Idea: do a quick and dirty job first, and then use learned rules to patch things up
13

Overcomes the pure rule-based approach problems of being too expensive, too slow, too tedious etc…
An instance of Transformation-Based Learning.

Combine rules and statistics

Start with a dumb statistical system and patch up the typical mistakes it makes.
How dumb?
Assign the most frequent tag (unigram) to each word in the input

CSA4006-Dr. Anirban Bhowmick


Process
1. Choose a Baseline Tagger:
To start, you need a baseline POS tagger that assigns initial tags to words in a sentence. Common
baseline taggers include Hidden Markov Models (HMMs) or rule-based taggers.

2. Collect Training Data:


You need labeled training data, which consists of sentences with the correct POS tags for each word. This
data is used to learn transformation rules.
14

3. Initialize Tag Assignments:


Apply the baseline tagger to a sentence and assign initial POS tags to each word.

4. Generate Transformation Rules:


The core of the Brill tagging process involves learning transformation rules from the training data. These
rules are typically in the form of "if-then" statements that specify how to modify or correct POS tags. Rules
are learned based on observed tagging errors in the training data.
Example transformation rule: "If a noun is followed by 'to,' change the tag of 'to' to 'TO'."

CSA4006-Dr. Anirban Bhowmick


Process
5. Apply Transformation Rules:
Iterate through the sentence and apply transformation rules to modify the POS tags generated by the
baseline tagger.

6. Evaluate the Updated Tags:


After applying a set of transformation rules to a sentence, evaluate the updated POS tags. If the tagging
accuracy improves, keep the updated tags; otherwise, revert to the previous tagging.
15

7. Repeat:
Continue applying transformation rules and evaluating the tagging accuracy until a stopping criterion is
met, such as reaching a maximum number of iterations or achieving a desired level of accuracy.

8. Finalize Tags:
Once the iterative process is complete, the final POS tags are used as the output for the sentence.

CSA4006-Dr. Anirban Bhowmick


Syntax
By syntax, we mean various aspects of how words are strung together to form components of
sentences and how those components are strung together to form sentences. syntax comes from the
Greek sy´ntaxis, meaning “setting out together or arrangement”,

• that and after year last


• I saw you yesterday
• colorless green ideas sleep furiously
16

The kind of implicit knowledge of your native language that you had mastered by the time you were 3 or 4
years old without explicit instruction, not necessarily the type of rules you were later taught in school.

Why should you care?


Grammar checkers
Question answering
Information extraction
Machine translation

CSA4006-Dr. Anirban Bhowmick


Constituency
The idea: Groups of words may behave as a single unit or phrase, called a constituent.

E.g. Noun Phrase


Kermit the frog
they
December twenty-sixth
the reason he is running for president
17

 Sentences have parts, some of which appear to have subparts. These groupings of words that go
together we will call constituents.
 These units form coherent classes that behave in similar ways
 For example, we can say that noun phrases can come before verbs

CSA4006-Dr. Anirban Bhowmick


Constituent Phrases
For constituents, we usually name them as phrases based on the word that
heads the constituent:

the man from Amherst is a Noun Phrase (NP) because the head man is a noun
extremely clever is an Adjective Phrase (AP) because the head clever is an adjective
down the river is a Prepositional Phrase (PP) because the head down is a preposition
killed the rabbit is a Verb Phrase (VP) because the head killed is a verb 18

 Note that a word is a constituent (a little one). Sometimes words also act as phrases. In:
Joe grew potatoes.

Joe and potatoes are both nouns and noun phrases.

CSA4006-Dr. Anirban Bhowmick


Evidence constituency exists
1. They appear in similar environments (before a verb)
Kermit the frog comes on stage
They come to Massachusetts every summer
December twenty-sixth comes after Christmas
The reason he is running for president comes out only now.

But not each individual word in the constituent 19


*The comes our... *is comes out... *for comes out...

2. The constituent can be placed in a number of different locations


Constituent = Prepositional phrase: On December twenty-sixth
On December twenty-sixth I’d like to fly to Florida.
I’d like to fly on December twenty-sixth to Florida.
I’d like to fly to Florida on December twenty-sixth.
But not split apart
*On December I’d like to fly twenty-sixth to Florida.
*On I’d like to fly December twenty-sixth to Florida.
CSA4006-Dr. Anirban Bhowmick
20

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 12
Syllabus

Syllabus
3
Module 3:
Syntax Parsing: Tagsets for English - Part of
Speech Tagging- Rule based Part-of-speech
Tagging- Stochastic Part-of speech Tagging-
4
Transformation-Based Tagging- Context-Free
Grammars for English - Context-Free Rules and
Trees- The Noun Phrase. The Verb Phrase and
Subcategorization- Grammar Equivalence
&Normal Form- Finite State & Context-Free
Grammars.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 3: CFG

Topic: Introduction
Syntax
By syntax, we mean various aspects of how words are strung together to form components of
sentences and how those components are strung together to form sentences. syntax comes from the
Greek sy´ntaxis, meaning “setting out together or arrangement”,

• that and after year last


• I saw you yesterday
• colorless green ideas sleep furiously
8

The kind of implicit knowledge of your native language that you had mastered by the time you were 3 or 4
years old without explicit instruction, not necessarily the type of rules you were later taught in school.

Why should you care?


Grammar checkers
Question answering
Information extraction
Machine translation

CSA4006-Dr. Anirban Bhowmick


Constituency
The idea: Groups of words may behave as a single unit or phrase, called a constituent.

E.g. Noun Phrase


Kermit the frog
they
December twenty-sixth
the reason he is running for president
9

 Sentences have parts, some of which appear to have subparts. These groupings of words that go
together we will call constituents.
 These units form coherent classes that behave in similar ways
 For example, we can say that noun phrases can come before verbs

CSA4006-Dr. Anirban Bhowmick


Constituent Phrases
For constituents, we usually name them as phrases based on the word that
heads the constituent:

the man from Amherst is a Noun Phrase (NP) because the head man is a noun
extremely clever is an Adjective Phrase (AP) because the head clever is an adjective
down the river is a Prepositional Phrase (PP) because the head down is a preposition
killed the rabbit is a Verb Phrase (VP) because the head killed is a verb 10

 Note that a word is a constituent (a little one). Sometimes words also act as phrases. In:
Joe grew potatoes.

Joe and potatoes are both nouns and noun phrases.

CSA4006-Dr. Anirban Bhowmick


Evidence constituency exists
1. They appear in similar environments (before a verb)
Kermit the frog comes on stage
They come to Massachusetts every summer
December twenty-sixth comes after Christmas
The reason he is running for president comes out only now.

But not each individual word in the constituent 11


*The comes our... *is comes out... *for comes out...

2. The constituent can be placed in a number of different locations


Constituent = Prepositional phrase: On December twenty-sixth
On December twenty-sixth I’d like to fly to Florida.
I’d like to fly on December twenty-sixth to Florida.
I’d like to fly to Florida on December twenty-sixth.
But not split apart
*On December I’d like to fly twenty-sixth to Florida.
*On I’d like to fly December twenty-sixth to Florida.
CSA4006-Dr. Anirban Bhowmick
Context-free grammar
The most common way of modeling constituency.

CFG = Context-Free Grammar = Phrase Structure Grammar


= BNF = Backus-Naur Form

The idea of basing a grammar on constituent structure dates back to Wilhem Wundt (1890), but not 12
formalized until Chomsky (1956), and, independently, by Backus (1959).
Consist of:

Terminals: We’ll take these to be words


Non-Terminals: The constituents in a language Like noun phrase, verb phrase and sentence
Rules: Rules are equations that consist of a single non-terminal on the left and any number of terminals
and non-terminals on the right.

CSA4006-Dr. Anirban Bhowmick


CFG 4 Tuple
G = {T, N, S, R}

 T is set of terminals (lexicon)


 N is set of non-terminals
 S is start symbol (one of the non-terminals)
 R is rules/productions of the form X → γ, where X is a nonterminal and γ is a sequence of
terminals and non-terminals (may be empty).
13

A grammar G generates a language L.

CSA4006-Dr. Anirban Bhowmick


CFG
G = {T, N, S, R}
T = {that, this, a, the, man, book, flight, meal, include, read, does}
N = {S, NP, NOM, VP, Det, Noun, Verb, Aux}
S=S
R={
S → NP VP Det → that | this | a | the
S → Aux NP VP Noun → book | flight | meal | man 14
S → VP Verb → book | include | read
NP → Det NOM Aux → does
NOM → Noun
NOM → Noun NOM
VP → Verb
VP → Verb NP
}

CSA4006-Dr. Anirban Bhowmick


The man read this book
CFG Example

S → NP VP
→ Det NOM VP
→ The NOM VP
→ The Noun VP
→ The man VP 15

→ The man Verb NP


→ The man read NP
→ The man read Det NOM
→ The man read this NOM
→ The man read this Noun
→ The man read this book

CSA4006-Dr. Anirban Bhowmick


Parse tree

16

The man read this book I prefer a morning flight

CSA4006-Dr. Anirban Bhowmick


CFGs can capture recursion
Example of seemingly endless recursion of embedded
prepositional phrases:

PP → Prep NP
NP → Noun PP

17
[S The mailman ate his [NP lunch [PP with his friend [PP from the cleaning staff [PP of the building
[PP at the intersection [PP on the north end [PP of town]]]]]]].

CSA4006-Dr. Anirban Bhowmick


Grammaticality
A CFG defines a formal language = the set of all sentences (strings of words) that can be derived by
the grammar.

 Sentences in this set said to be grammatical.

 Sentences outside this set said to be ungrammatical.

18

CSA4006-Dr. Anirban Bhowmick


Parsing
 Parsing is the process of taking a string and a grammar and returning a (or multiple) parse tree(s)
for that string
 It is completely analogous to running a finite state transducer with a tape
 It’s just more powerful there are languages we can capture with CFGs that we can’t capture with
finite state machines
 A recognizer is a program for which a given grammar and a given sentence returns YES if the
sentence is accepted by the grammar (i.e.the sentence is in the language), and NO otherwise. 19
 A parser in addition to doing the work of a recognizer also returns the set of parse trees for the
string.

Top-Down Parsing and Bottom-Up Parsing are used for parsing a tree to reach the starting node of
the tree. Both the parsing techniques are different from each other. The most basic difference between
the two is that top-down parsing starts from top of the parse tree, while bottom-up parsing starts from
the lowest level of the parse tree.

CSA4006-Dr. Anirban Bhowmick


Top-down parsing
Top-down parsing is goal-directed

A top-down parser starts with a list of constituents to be built.


• It rewrites the goals in the goal list by matching one against the LHS of the grammar rules, and
expanding it with the RHS,
• attempting to match the sentence to be derived 20

If a goal can be rewritten in several ways, then there is a choice of which rule to apply (search
problem)
Can use depth-first or breadth-first search, and goal ordering.

CSA4006-Dr. Anirban Bhowmick


Top-down parsing example (Breadth-first)

21

CSA4006-Dr. Anirban Bhowmick


Contd.

22

CSA4006-Dr. Anirban Bhowmick


Problems with top-down parsing
Left recursive rules... e.g. NP → NP PP... lead to infinite recursion

 Will do badly if there are many different rules for the same LHS. Consider if there are 600 rules
for S, 599 of which start with NP, but one of which starts with a V, and the sentence starts with
a V.
 Useless work: expands things that are possible top-down but not there (no bottom-up evidence
for them).
 Top-down parsers do well if there is useful grammar-driven control: search is directed by the 23

grammar.
 Top-down is hopeless for rewriting parts of speech (pre-terminals) with words (terminals). In
practice that is always done bottom-up as lexical lookup.
 Repeated work: anywhere there is common substructure

CSA4006-Dr. Anirban Bhowmick


Bottom-up parsing
Top-down parsing is data-directed.

 The initial goal list of a bottom-up parser is the string to be parsed.


 If a sequence in the goal list matches the RHS of a rule, then this sequence may be replaced by the
LHS of the rule.
 Parsing is finished when the goal list contains just the start symbol. If the RHS of several rules match
the goal list, then there is a choice of which rule to apply (search problem)
24

Can use depth-first or breadth-first search, and goal ordering.

The standard presentation is as shift-reduce parsing

CSA4006-Dr. Anirban Bhowmick


Bottom-up parsing example

25

CSA4006-Dr. Anirban Bhowmick


Shift-reduce parsing

26

CSA4006-Dr. Anirban Bhowmick


Stochastic Probabilistic sequence models

Start with the sentence to be parsed in an input buffer.

• a ”shift” action corresponds to pushing the next input symbol from the buffer onto the stack
• a ”reduce” action occurs when we have a rule’s RHS on top of the stack. To perform the reduction,
we pop the rule’s RHS off the stack and replace it with the terminal on the LHS of the corresponding
27
rule.

(When either ”shift” or ”reduce” is possible, choose one arbitrarily.)

If you end up with only the Start symbol on the stack, then success!

If you don’t, and you cannot and no ”shift” or ”reduce” actions are possible,
backtrack.

CSA4006-Dr. Anirban Bhowmick


Contd.
In a top-down parser, the main decision was which production rule to pick.
In a bottom-up shift-reduce parser there are two decisions:

1. Should we shift another symbol, or reduce by some rule?


2. If reduce, then reduce by which rule?

both of which can lead to the need to backtrack 28

Problem:
• Unable to deal with empty categories: termination problem, unless rewriting empties as
constituents is somehow restricted (but then it’s generally incomplete)
• Useless work: locally possible, but globally impossible
• Inefficient when there is great lexical ambiguity (grammar-driven control might help here).
Conversely, it is data-directed: it attempts to parse the words that are there
• Repeated work: anywhere there is common substructure.

CSA4006-Dr. Anirban Bhowmick


Noun Phrase
The noun phrase can be viewed as revolving around a head, the central noun in the noun
phrase. The syntax of English allows for both
• Prenominal prehead modifiers
• Post nominal (post head) modifiers

Prenominal prehead modifiers are words or phrases that appear before the noun and modify it.
These modifiers provide additional information about the noun. Here's an example: 29
The big, red apple: In this noun phrase, "big" and "red" are prenominal prehead modifiers that
provide more details about the noun "apple.“

Postnominal (post head) modifiers are words or phrases that appear after the noun and
modify it. These modifiers also offer additional information about the noun. Here's an example:
The car with the broken windshield: In this noun phrase, "with the broken windshield" is a
postnominal modifier that provides more information about the noun "car."

CSA4006-Dr. Anirban Bhowmick


Noun Phase
Noun phrases can begin with a determiner, as follows:

a stop, the flights, that fare, this flight, those flights, any flights, some flight
Word classes that appear in the NP before the determiner are called
predeterminers .
A number of different kinds of word classes can appear in the NP between the 30
determiner and the head noun.
• Cardinal numbers Eg two friends, one stop
• Ordinal numbers include first, second, third etc but also words like next, last,
past, other, and another Eg the first one, the next day, the second leg, the last
flight, the other American flight, any other fares.
• Quantifiers many, few, several occur only with plural count nouns Eg many
fares
• The quantifiers much and a little occur only with noncount nouns

CSA4006-Dr. Anirban Bhowmick


Noun Phase
Noun phrases can start with determiners...
Determiners can be

Simple lexical items: the, this, a, an, etc.

A car 31
Or simple possessives

John’s car

Or complex recursive versions of that


John’s sister’s husband’s son’s car

CSA4006-Dr. Anirban Bhowmick


Noun Phase
Adjectives occur after quantifiers but before nouns.

A first class fare


A nonstop flight
The longest layover
The earliest lunch flight
32
Adjectives can also be grouped into a phrase called an adjective phrase AP.

APs can have an adverb before the adjective

Eg. the least expensive fare

All the options for prenominal modifiers are combined with one rule as follows:
NP--(Det) (Card) (Ord) (Quant) (AP) Nominal
the use of parentheses () to mark optional constituents.

CSA4006-Dr. Anirban Bhowmick


Noun Phase
A head noun can be followed by Post Modifiers
.
Three kinds Prepositional phrases
• Flights from Seattle
Non finite clauses
• Flights arriving before noon
Relative clauses
• Flights that serve breakfast

• any stopovers [for Delta seven fifty one]


• all flights [from Cleveland] [to Newark]
• arrival [in San Jose] [before seven p.m]
• a reservation [on flight six oh six] [from Tampa]
[to Montreal]
Here’s a new NP rule to account for one to three PP postmodifiers:
Nominal : Nominal PP
CSA4006-Dr. Anirban Bhowmick
Noun Phase
•The three most common kinds of non finite postmodifiers are the gerundive ing ed, and
infinitive forms
• Gerundive postmodifiers are so called because they consist of a verb phrase that begins with
the gerundive ing form of the verb

In the following examples, the verb phrases happen to all have only prepositional phrases after
verb. 34
• any of those (leaving on Thursday)
• any flights (arriving after eleven a.m)
• flights (arriving within thirty minutes of each other)

The use of a new nonterminal GerundVP:


Nominal : Nominal GerundVP

CSA4006-Dr. Anirban Bhowmick


Noun Phase
Rules for GerundVP constituents by duplicating all of our VP productions, substituting GerundV
for V.
• GerundVP -- GerundV NP
• GerundV PP
• GerundV
• GerundV NP PP
35
GerundV can then be defined as:
GerundV being/preferring/arriving /leaving/…

A postnominal relative clause (more correctly a restrictive relative clause), is a clause that often
begins with a relative pronoun (that and who are the most common)

CSA4006-Dr. Anirban Bhowmick


Agreement
Constraints that hold among various constituents
For example, in English, determiners and the head nouns in NPs have to agree in their
number.
Which of the following cannot be parsed by the rule

NP Det Nominal ?
36
(O) This flight (X) This flights
(O) Those flights (X) Those flight

Which of the following cannot be parsed by the rule


NP Det Nominal ?
This rule does not handle agreement! (The rule does not detect whether the agreement is
correct or not
(O) This flight (X) This flights
(O) Those flights (X) Those flight

CSA4006-Dr. Anirban Bhowmick


Problem
Our earlier NP rules are clearly deficient since they don’t capture the agreement constraint

NP: Det Nominal

Accepts, and assigns correct structures, to grammatical examples (this flight)


But its also happy with incorrect examples (*these flight)
Such a rule is said to overgenerate 37

CSA4006-Dr. Anirban Bhowmick


THE VERB PHRASE AND SUBCATEGORIZATION
The verb phrase consists of the verb and a number of other constituents -arguments

38

But, even though there are many valid VP rules in English, not all verbs are
allowed to participate in all those VP rules
We can subcategorize the verbs in a language according to the sets of VP rules
that they participate in
This is a modern take on the traditional notion of transitive/intransitive
Modern grammars may have 100 s or such classes

CSA4006-Dr. Anirban Bhowmick


Contd.
Sneeze: John sneezed
Find: Please find [a flight to NY] NP
Give: Give [me] NP [a cheaper fare] NP
Help: Can you help [me] NP [with a flight] PP
Prefer: I prefer [to leave earlier] TO VP
Told: I was told [United has a flight] S
39

• *John sneezed the book


• *I prefer United has a flight
• *Give with a flight
As with agreement phenomena, we need a way to formally express the constraints!

CSA4006-Dr. Anirban Bhowmick


Contd.
The various rules for VPs overgenerate .
They permit the presence of strings containing verbs and arguments that don’t go together
For example : VP --> V NP
therefore Sneezed the book is a VP since “ sneeze ” is a verb and “ the book ” is a valid NP

40

CSA4006-Dr. Anirban Bhowmick


Grammar Equivalence & Normal Form
A formal language is defined as a (possibly infinite) set of strings of words Two kinds of grammar
equivalence
Weak equivalence
Strong equivalence
Two grammars are strongly equivalent generate the same set of string and assign the same phrase
structure to each sentence (allowing merely for renaming of the non terminal symbols)
Two grammars are weakly equivalent generate the same set of strings but do not assign the same 41
phrase structure to each sentence
It is sometimes useful to have a normal form for grammars, in which each of the productions takes a
particular form
For example a context free grammar is in Chomsky normal form (CNF)
Any grammar can be converted into a weakly equivalent Chomsky normal form grammar
For example a rule of the form
A-BCD
can be converted into the following two CNF rules
A-BX
X-CD
CSA4006-Dr. Anirban Bhowmick
42

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 13
Syllabus

Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4

Syntax-Driven Semantic Analysis- Attachments


for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic

Topic: Introduction
Semantic Analysis
Semantic analysis in natural language processing (NLP) refers to the process of understanding the
meaning of words, phrases, sentences, or even entire documents. It goes beyond syntactic analysis,
which focuses on the grammatical structure of language, to extract the underlying meaning and
context.
Here are some key aspects of semantic analysis in NLP:

Word Sense Disambiguation (WSD): Words often have multiple meanings depending on the context
in which they are used. WSD is the task of determining the correct sense of a word in a given 8
context. For example, the word "bank" could refer to a financial institution or the side of a river.

Named Entity Recognition (NER): NER involves identifying and classifying entities such as names of
people, organizations, locations, dates, and other specific terms in a text. This helps in
understanding the key entities and their relationships within a document.

Semantic Role Labeling (SRL): SRL aims to identify the roles of different components of a sentence,
such as the subject, object, and predicate. It helps in understanding the relationships between
entities and their actions in a given context.

CSA4006-Dr. Anirban Bhowmick


Semantic Analysis
Coreference Resolution: This involves determining when two or more expressions in a text refer to the
same entity. For example, in the sentence "John went to the store. He bought some groceries," resolving
the pronoun "He" to refer to "John" requires coreference resolution.

Sentiment Analysis: While often associated more with the emotional aspect of language, sentiment
analysis also involves understanding the underlying meaning of text. It helps determine whether a piece
of text expresses a positive, negative, or neutral sentiment.
9
Semantic Similarity: This involves measuring the degree of similarity between words, phrases, or
sentences in terms of meaning. It is useful in tasks like information retrieval, document clustering, and
question answering.

Word Embeddings and Vector Representations: Techniques like word embeddings (e.g., Word2Vec,
GloVe, and BERT) represent words in a continuous vector space where semantically similar words are
closer in the vector space. This allows algorithms to capture semantic relationships between words.

Frame Semantics and Ontologies: Understanding the frames or scenarios in which words and phrases
are used can contribute to a deeper understanding of meaning.
CSA4006-Dr. Anirban Bhowmick
Meaning Representation Language
In natural language processing (NLP), meaning representation languages are formal languages or
frameworks used to represent the meaning of linguistic expressions in a structured and interpretable
way. These representations are essential for tasks such as semantic analysis, machine translation,
question answering, and other applications where understanding the meaning of natural language is
crucial.

But unlike parse trees, these representations aren’t primarily descriptions of the structure of the inputs 10

Consider the following everyday language tasks that require some form of semantic processing

 Answering an essay question on an exam


 Deciding what to order at a restaurant by reading a menu
 Learning to use a new piece of software by reading the manual
 Realizing that you’ve been insulted
 Following a recipe

CSA4006-Dr. Anirban Bhowmick


Contd.
For example, some of the knowledge of the world needed to perform the above tasks includes:

Answering and grading essay questions requires background knowledge about


 The topic of the question
 The desired knowledge level of the students
 How such questions are normally answered
11

Learning to use a piece of software by reading a manual


 Giving advice about how to do the same
 Requires deep knowledge about current computers
 The specific software in question
 Similar software applications
 Knowledge about users in general

CSA4006-Dr. Anirban Bhowmick


Computational Desiderata for
Representation
Computational desiderata refer to the desired properties or characteristics that representations should
possess to effectively capture and model the meaning of language.

To focus this discussion, we will consider in more detail the task of giving advice about restaurants to
tourists. In this discussion, we will assume that we have a computer system that accepts spoken
language queries from tourists and construct appropriate responses by using a knowledge base of
relevant domain knowledge.
12

 Verifiability
 Unambiguous Representations
 Canonical Form
 Inference and Variables
 Expressiveness

CSA4006-Dr. Anirban Bhowmick


Verifiability
Verifiability: The system’s ability to compare representations to facts in memory

The most straightforward way to implement this notion is make it possible for a system to compare,
or match the representation of the meaning of an input against the representations in its knowledge
base its store of information about its world.

Does Maharani serve vegetarian food? 13


Serves(Maharani; Vegetarian Food)
 Input matched against the knowledge base of facts about a set of restaurants
 Matching the input proposition in its knowledge base, it can return an affirmative answer
 Otherwise, it must either say No if its knowledge of local restaurants is complete, or say that it does
not know

CSA4006-Dr. Anirban Bhowmick


Unambiguous Representations
 The domain of semantics is subject to ambiguity
 Single linguistic inputs can legitimately have different meaning representations assigned to them
based on the circumstances in which they occur.

The cat is on the mat


Ambiguity:
The phrase "on the mat" might have multiple interpretations, as it could refer to a physical location or 14
imply a scolding or disciplinary action.

Unambiguous representations are crucial for NLP tasks to enhance the accuracy and reliability of
natural language understanding systems.

CSA4006-Dr. Anirban Bhowmick


Vagueness
A concept closely related to ambiguity is vagueness
 Like ambiguity, vagueness can make it difficult to determine what to do with a particular input
based on its meaning representation
 Vagueness, however, does not give rise to multiple representations
 Consider the following request as an example
I want to eat Italian food
Use of the phrase Italian food may provide enough information for a restaurant advisor to provide 15
reasonable recommendations
 It is nevertheless quite vague as to what the user really wants to eat
 A vague representation of the meaning of this phrase may be appropriate for some purposes,
while a more specific representation may be needed for other purposes

CSA4006-Dr. Anirban Bhowmick


Canonical Form
The notion that single sentences can be assigned multiple meanings leads to the related phenomenon of
distinct inputs that should be assigned the same meaning representation

 Does Maharani have vegetarian dishes?


 Do they have vegetarian food at Maharani?
 Are vegetarian dishes served at Maharani? 16
 Does Maharani serve vegetarian fare?

CSA4006-Dr. Anirban Bhowmick


Inference and Variables
Can vegetarians eat at Maharani?

 The term inference to refer generically to a system’s ability to draw valid conclusions based on the
meaning representation of inputs and its store of background knowledge
 It must be possible for the system to draw conclusions about the truth of propositions that are not
explicitly represented in the knowledge base, but are nevertheless logically derivable from the
propositions that are present 17
 I’d like to find a restaurant where I can get vegetarian food.
 In this examples, this request does not make reference to any particular restaurant
 The user is stating that they would like information about an unknown and unnamed entity that is a
restaurant that serves vegetarian food
 Answering this request requires a more complex kind of matching that involves the use of variables
 A representation containing such variables as follows

Serves(x; Vegetarian Food)

CSA4006-Dr. Anirban Bhowmick


Expressiveness
Expressiveness in meaning representation in NLP refers to the ability of a representation system to
capture the richness and diversity of meanings present in natural language. An expressive
representation should be able to convey nuanced relationships, distinctions, and semantic intricacies
inherent in human language. Here's an example to illustrate expressiveness
The conference room echoed with the enthusiastic applause of the audience.

This representation captures the expressiveness of the sentence by not only representing the basic 18

actions and entities but also incorporating additional details about the manner of applause and the
specific location of the event. It goes beyond a simple surface-level representation and delves into the
nuanced aspects of the sentence's meaning.

CSA4006-Dr. Anirban Bhowmick


Meaning Structure of Language
These include a variety of conventional form
 Meaning associations
 Word order regularities
 Tense systems
 Conjunctions and quantifiers
 A fundamental predicate argument structure
19
A predicate is a statement about a subject that either is true or false. It expresses a property or a
relation. Predicates often use verbs to convey actions or states.
Examples:
The cat is on the mat.
Predicate: "is on the mat"
Subject: "The cat"

 Predicates: Primarily Verbs , VPs , Sentences, sometimes Nouns and NPs


 Arguments: Primarily Nouns, Nominals, NPs, PPs.

CSA4006-Dr. Anirban Bhowmick


Meaning Structure of Language
Argument:
An argument is a value that is applied to a function or, in logic, a subject that satisfies a predicate. In
simpler terms, it is what the predicate is about.

Examples:
1.In "The cat is on the mat," "The cat" is the argument of the predicate "is on the mat."
2.In "She likes to read books," "She" is the argument of the predicate "likes to read books." 20
3.In "The sun sets in the west," "The sun" is the argument of the predicate "sets in the west."

 Predicates: Primarily Verbs , VPs , Sentences, sometimes Nouns and NPs


 Arguments: Primarily Nouns, Nominals, NPs, PPs.

CSA4006-Dr. Anirban Bhowmick


Contd.
These examples can be classified as having one of the three syntactic argument frames
I want Italian food  NP want NP
I want to spend less than five dollars  NP want Inf VP
I want it to be close by here  NP want NP Inf VP
 These syntactic frames specify the number, position and syntactic category of the arguments that are
expected.
 The frame for the variety of want that appears in Example 1 specifies the following facts 21
 There are two arguments to this predicate.
 Both arguments must be NPs.
 The first argument is pre verbal and plays the role of the subject.
 The second argument is post verbal and plays the role of the direct object.

CSA4006-Dr. Anirban Bhowmick


Contd.
 Semantic roles and Semantic restrictions on these roles
 The notion of a semantic role can be understood by looking at the similarities among the arguments in
Examples 1 to 4.
 The study of roles associated with specific verbs and across classes of verbs is usually referred to as
thematic role or case role
 The notion of semantic restrictions arises directly from these semantic roles

 Consider the following phrases from the BERP corpus 22

An Italian restaurant under fifteen dollars


 In this example, the meaning representation associated with the preposition under can be seen as
having
 something like the following structure
Under(Italian Restaurant ; $15)
 Prepositions can be characterized as two argument predicates where the first argument is an object that
is being placed in some relation to the second argument

CSA4006-Dr. Anirban Bhowmick


Contd.
Another non verb based predicate argument structure example

Make a reservation for this evening for a table for two persons at 8

 The predicate argument structure is based on the concept underlying the noun reservation, rather
than make, the main verb in the phrase
 This example gives rise to a four argument predicate structure like the following
23
Reservation(Today; 8PM ; 2)
 Any useful meaning representation language must be organized supports the specification of
semantic predicate argument structures
 This support must include support for the kind of semantic information that languages present
 Variable arity predicate argument structures
 The semantic labeling of arguments to predicates
 The statement of semantic constraints on the fillers of argument roles

CSA4006-Dr. Anirban Bhowmick


24

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 14
Syllabus

Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4

Syntax-Driven Semantic Analysis- Attachments


for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic

Topic: Introduction
Review

CSA4006-Dr. Anirban Bhowmick


Meaning Structure of Language
A predicate is a statement about a subject that either is true or false. It expresses a property or a
relation. Predicates often use verbs to convey actions or states.
Examples:
The cat is on the mat.
Predicate: "is on the mat"
Subject: "The cat"
9
 Predicates: Primarily Verbs , VPs , Sentences, sometimes Nouns and NPs
 Arguments: Primarily Nouns, Nominals, NPs, PPs.

CSA4006-Dr. Anirban Bhowmick


Meaning Structure of Language
Argument:
An argument is a value that is applied to a function or, in logic, a subject that satisfies a predicate. In
simpler terms, it is what the predicate is about.

Examples:
1.In "The cat is on the mat," "The cat" is the argument of the predicate "is on the mat."
2.In "She likes to read books," "She" is the argument of the predicate "likes to read books." 10
3.In "The sun sets in the west," "The sun" is the argument of the predicate "sets in the west."

 Predicates: Primarily Verbs , VPs , Sentences, sometimes Nouns and NPs


 Arguments: Primarily Nouns, Nominals, NPs, PPs.

CSA4006-Dr. Anirban Bhowmick


Contd.
These examples can be classified as having one of the three syntactic argument frames
I want Italian food  NP want NP
I want to spend less than five dollars  NP want Inf VP
I want it to be close by here  NP want NP Inf VP
 These syntactic frames specify the number, position and syntactic category of the arguments that are
expected.
 The frame for the variety of want that appears in Example 1 specifies the following facts 11
 There are two arguments to this predicate.
 Both arguments must be NPs.
 The first argument is pre verbal and plays the role of the subject.
 The second argument is post verbal and plays the role of the direct object.

CSA4006-Dr. Anirban Bhowmick


Contd.
 Semantic roles and Semantic restrictions on these roles
 The notion of a semantic role can be understood by looking at the similarities among the arguments in
Examples 1 to 4.
 The study of roles associated with specific verbs and across classes of verbs is usually referred to as
thematic role or case role
 The notion of semantic restrictions arises directly from these semantic roles

 Consider the following phrases from the BERP corpus 12

An Italian restaurant under fifteen dollars


 In this example, the meaning representation associated with the preposition under can be seen as
having something like the following structure
Under(Italian Restaurant ; $15)
 Prepositions can be characterized as two argument predicates where the first argument is an object that
is being placed in some relation to the second argument

CSA4006-Dr. Anirban Bhowmick


Contd.
Another non verb based predicate argument structure example

Make a reservation for this evening for a table for two persons at 8

 The predicate argument structure is based on the concept underlying the noun reservation, rather
than make, the main verb in the phrase
 This example gives rise to a four argument predicate structure like the following
13
Reservation(Today; 8PM ; 2)
 Any useful meaning representation language must be organized supports the specification of
semantic predicate argument structures
 This support must include support for the kind of semantic information that languages present
 Variable arity predicate argument structures
 The semantic labeling of arguments to predicates
 The statement of semantic constraints on the fillers of argument roles

CSA4006-Dr. Anirban Bhowmick


Propositional Logic
The simplest, and most abstract logic we can study is called propositional logic.
• Definition: A proposition is a statement that can be either true or false; it must be one or
the other, and it cannot be both.
Examples:
The fan is on,
2+3 = 5,
Where as, 14

1+2,
Where is John?

There are two types of Propositions:

Atomic Propositions
Compound propositions

CSA4006-Dr. Anirban Bhowmick


Propositional Logic
Atomic Propositions:
Definition: An atomic proposition is one whose truth or falsity does not depend on the truth or
falsity of any other proposition
Example:
"The Sun is cold“
2+2 is 4
15

Compound Propositions:
Compound propositions are constructed by combining simpler or atomic propositions, using
parenthesis and logical connectives.
Example:
"It is raining today, and street is wet."
"Ankit is a doctor, and his clinic is in Mumbai."

CSA4006-Dr. Anirban Bhowmick


Propositional Logic
Logical Connectives:

Implication: In propositional logic, we have a connective that combines two propositions into a new 16
proposition called the conditional

If it is raining, then the street is wet.


Let P= It is raining, and Q= Street is wet, so it is
represented as P → Q

CSA4006-Dr. Anirban Bhowmick


Propositional Logic
Biconditional: A sentence such as P⇔ Q is a Biconditional sentence, example If I am
breathing, then I am alive. It is written as p iff q
P= I am breathing, Q= I am alive, it can be represented as P ⇔ Q.

Definition: If p and q are arbitrary propositions, then the biconditional of p and q is written: p ⇔
q and will be true iff either:
1. p and q are both true; or 17

2. p and q are both false.

CSA4006-Dr. Anirban Bhowmick


Propositional Logic
We can nest complex formulae as deeply as we want.
• We can use parentheses i.e., ),(, to disambiguate formulae.
• EXAMPLES. If p, q, r, s and t are atomic propositions, then all of the following are
formulae:
p∧q⇒r
p ∧ (q ⇒ r)
(p ∧ (q ⇒ r)) ∨ s 18

((p ∧ (q ⇒ r)) ∨ s) ∧ t
EXAMPLE. Suppose we have a valuation 𝜐, such that:
𝜐(p) = F
𝜐(q) = T
𝜐(r) = F
Then we truth value of (p ∨ q) ⇒ r is
evaluated by:

CSA4006-Dr. Anirban Bhowmick


First Order Predicate Calculus
First-Order Predicate Calculus (FOPC) plays a crucial role in representing and reasoning
about linguistic structures and meanings. It serves as a foundation for semantic analysis
and knowledge representation in NLP systems. Let's delve into a detailed explanation with
an example relevant to NLP.

Allows us to break sentences into predicates, subjects and objects, while also allowing us to
use quantifiers like “all”, “each”, “some” etc. 19

Blackburn & Bos make a strong argument for using first-order logic as the meaning
representation.

Powerful, flexible, general.

CSA4006-Dr. Anirban Bhowmick


First Order Predicate Calculus
FOL symbols

○ Constants: john, mary


○ Predicates & relations: man, walks, loves
○ Variables: x, y
○ Logical connectives: ∧ ∨ ¬ →
○ Quantifiers: ∀ ∃ 20
○ Other punctuation: parens, commas

FOL formulae

○ Atomic formulae: loves(john, mary)


○ Connective applications: man(john) ∧ loves(john, mary)
○ Quantified formulae: ∃x (man(x))

CSA4006-Dr. Anirban Bhowmick


Predicates categories
One place: Intransitive verbs, common nouns, adjectives

Dog(x), Happy (x)

Two Place: Transitive verbs, prepositions

Likes(x,y), In(x,y) 21

Three Place: Ditransitive verbs

Gives(x,y,z)

CSA4006-Dr. Anirban Bhowmick


Quantifier
Quantifiers generate quantification and specify the number of specimen in the universe.
Quantifiers allow us to determine or identify the range and scope of the variable in a logical expression.
There are two types of quantifiers:
Universal quantifier: for all, everyone, everything.
Existential quantifier: for some, at least one.
1. Universal quantifiers
Universal quantifiers specify that the statement within the range is true for everything or every instance of a
22
particular thing.
Universal quantifiers are denoted by a symbol (∀) that looks like an inverted A. In a universal quantifier, we
use →.
If x is a variable, then ∀x can read as:
For all x
For every x
For each x

Example
Every kid likes football ∀x kid(x) → likes(x, football)

CSA4006-Dr. Anirban Bhowmick


Quantifier
2. Existential quantifiers
Existential quantifiers are used to express that the statement within their scope is true for at least one
instance of something.

∃, which looks like an inverted E, is used to represent them. We always use AND or conjunction symbols.

If x is a variable, the existential quantifier will be ∃x:


For some x 23
There exists an x
For at least one x

Example
Some people like Football. ∃x: people(x) ∧ likes Football(x)

CSA4006-Dr. Anirban Bhowmick


Scope and Free & Bound Variables
∀x[Person(x)] ∧ Happy(x)

(Every x is a person) and x is happy

Everyone is a person and he is happy

∀x[Person(x) ∧ Happy(x)] 24

(Every x is a person and every x is happy)

Everyone is happy

CSA4006-Dr. Anirban Bhowmick


Examples
1. Some boys hate football

∃x: boys(x) ∧ hate( x, Football)

2. Every person who buys a Policy is smart

∀x ∀y: Person(x) ∧ Policy(y)^buys(x,y)Smart(x)


25
3. No person buys expensive Policy
∀x ∀y: Person(x) ∧ Policy(y)^expensive(y) ¬ buys(x,y)
4. Mary loves everyone
∀x: (person(x) → love (Mary, x))

5. Everyone loves everyone except himself


∀x ∀y: (x ≠y → L(x,y))

CSA4006-Dr. Anirban Bhowmick


Scope Ambiguity
Every student loves some teacher

(Every student)x loves (some teacher)y

One way : (Every student)x (some teacher)y x loves y

∀x [student(x)  ∃y[teacher(y) ∧ loves(x,y)]] 26

Another way : (some teacher)y (Every student)x x loves y

∃y[teacher(y)] ^ ∀x [student(x)  loves(x,y)]]

CSA4006-Dr. Anirban Bhowmick


Variables and Quantifiers
Consider the following example.

A restaurant that serves Mexican food near ICSI.

The following would be a reasonable representation of the

meaning of such a phrase. 27

Restaurant(x) ∧ Serves(x; Mexican Food) ∧ Near((Location of (x); Location of (ICSI))

CSA4006-Dr. Anirban Bhowmick


Contd.
For example, if AyCaramba is a Mexican restaurant near ICSI, then substituting
AyCaramba for x results in the following logical formula

Restaurant(AyCaramba) ∧ Serves(AyCaramba; Mexican Food) ∧ Near((Location of


(AyCaramba); Location of (ICSI))
28
 Based on the semantics of the operator ^, this sentence will be true if all of its three
component atomic formulas are true

CSA4006-Dr. Anirban Bhowmick


Syntax
I only have five dollars and I don’t have a lot of time

Have(Speaker; Five Dollars) ∧ ¬ Have(Speaker Lot Of Time)

The semantic representation for this example is built up in a straightforward way from
semantics of the individual clauses through the use of the and ¬ operators 29

CSA4006-Dr. Anirban Bhowmick


30

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 15
Syllabus

Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4

Syntax-Driven Semantic Analysis- Attachments


for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic

Topic: Introduction
Review

CSA4006-Dr. Anirban Bhowmick


FOPL more examples
Theresa is the mother of John and Mary

John likes oranges but he doesn’t like apples


9

Mary is studying pharmacy or medicine

CSA4006-Dr. Anirban Bhowmick


FOPL more examples
Everyone likes Venice

Horses are mammals which are animals

10
All that John inherited was a book

John inherited all of the books

CSA4006-Dr. Anirban Bhowmick


FOPL more examples
Existential quantifier- ∃x p(x) and is read as: there exists one x such as p(x) or there is atleast one x
such as p(x)

There is at least one bird in the forest

John and Mary are siblings


11
There is one person who likes salad

Everyone likes someone and no one likes everyone

CSA4006-Dr. Anirban Bhowmick


FOPL more examples
The negation connectives and the quantifiers have the highest priority. Then come the connectives of
conjunction and disjunction. After that, implication, and finally the biconditional has the lowest priority.

Similar formulae:

∀x ¬ P  ¬ ∃x P
12
Example:
Nobody likes John : ∀x ¬ like(x,John)  ¬ ∃x like(x,John)

¬ ∀x P  ∃x ¬ P
Example:
There is at least one person who doesnot like John: ¬ ∀x like (x,John)  ∃x ¬ like
(x,John)

CSA4006-Dr. Anirban Bhowmick


FOPL more examples
Similar formulae:

∀x P  ¬ ∃x ¬ P
Example:
Everyone likes John : ∀x like(x,John)  ¬ ∃x ¬ like(x,John)
13
∃x P  ¬ ∀x ¬ P
Example:
There is at least one person who likes John: ¬ ∃x like (x,John)  ¬ ∀x ¬ like (x,John)

CSA4006-Dr. Anirban Bhowmick


Syntax Driven Semantic Analysis
• How meaning representations are created
• Syntax driven semantic analysis is a computational approach to semantic analysis that
uses static knowledge from the lexicon and the grammer.
• Based on Principle of compositionality The key idea is that the meaning of a
sentence can be composed from the meanings of it parts
• The meaning of a sentence not based solely on the words that make it up
• It is based on the ordering, grouping, and relations among the words in the sentence 14

• This analysis is then passed as input to a semantic analyzer to produce a meaning


representation

CSA4006-Dr. Anirban Bhowmick


Syntax Driven Semantic Analysis
Franco likes Frasca.

15

CSA4006-Dr. Anirban Bhowmick


Steps in semantic representation
1. Find meaning representation corresponding to verb nominates
- it is the verb whose meaning defines the meaning of the whole sentence
- The meaning representation of the verb acts as the template for meaning
representation of the whole sentence
- The NPs are arguments to the verb and are filled in the template based on their roles

2. Find meaning representation for the two NPs 16

3. Bind the meaning representation of the NPs to the variables in the meaning
representation of the verb to get the meaning representation of the whole sentence

CSA4006-Dr. Anirban Bhowmick


Parse tree to Meaning Representation
How is the mapping from parse tree to meaning representation done?

Augment the lexicon and grammar rules with semantic attachment – devise a mapping between
rules of the grammar and rules of semantic representation (rule to rule hypothesis)

An augmented rule can take the form


17

The text appearing within brackets specifies the meaning representation assigned to A as a function of
the semantic attachment of A’s constituents

CSA4006-Dr. Anirban Bhowmick


Contd.
President nominates speaker

Noun  President {President}


Noun  Speaker {speaker}

{President} and {speaker} are meaning associated with the augmented rules
18

NP -> Noun {𝑵𝒐𝒖𝒏𝒔𝒆𝒎 }

Verb  nominates (∃e,x,y nomination (e) ∧ nominator (e,x) ∧ nominee (e,y))

VP  verb NP {𝑽𝒆𝒓𝒃𝒔𝒆𝒎 {𝑵𝑷𝒔𝒆𝒎 }}

To combine 𝑵𝑷𝒔𝒆𝒎 and 𝒗𝒆𝒓𝒃𝒔𝒆𝒎 , y has to be replaced with speaker, not specified in 𝒗𝒆𝒓𝒃𝒔𝒆𝒎 .
Need to revise the semantic attachment for verb
CSA4006-Dr. Anirban Bhowmick
Example:

19

CSA4006-Dr. Anirban Bhowmick


Example

20

CSA4006-Dr. Anirban Bhowmick


Compositionality

How do we know how to construct the VP?


love(?, mary) OR love(mary, ?)
How can we specify in which way the bits &
pieces combine?
21

The meaning of the sentence is constructed


from:
● the meaning of the words (i.e., the lexicon)
● paralleling the syntactic construction (i.e.,
the semantic rules)

CSA4006-Dr. Anirban Bhowmick


22

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 16
Syllabus

Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4

Syntax-Driven Semantic Analysis- Attachments


for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic

Topic: Introduction
Review

CSA4006-Dr. Anirban Bhowmick


Example:

CSA4006-Dr. Anirban Bhowmick


Example

10

CSA4006-Dr. Anirban Bhowmick


Compositionality

How do we know how to construct the VP?


love(?, mary) OR love(mary, ?)
How can we specify in which way the bits &
pieces combine?
11

The meaning of the sentence is constructed


from:
● the meaning of the words (i.e., the lexicon)
● paralleling the syntactic construction (i.e.,
the semantic rules)

CSA4006-Dr. Anirban Bhowmick


Lambda Calculus
Loves(?,Mary)
Add a new operator λ to bind free variables
λx.love(x, mary) loves Mary

Gluing together formulae/terms with function application


12
(λx.love(x, mary)) @ john
(λx.love(x, mary))(john)

CSA4006-Dr. Anirban Bhowmick


Lambda Calculus
Lambda Calculus used to combine semantic representations systematically
 Lambda Calculus is an extension of FOPC
The following three rules define how to build all syntactically valid lambda terms

Eg: λx.P(x)(Taj)  P(Taj)

 Replaces the variable x with Taj and removes λ 13


 With λ calculus, VP semantics problem can be solved

CSA4006-Dr. Anirban Bhowmick


Beta reduction
(λx.love(x, mary)) (john)

1. Strip off the λ prefix


(love(x, mary)) (john)

2. Remove the argument


14
love(x, mary)

3. Replace all occurrences of λ-bound variable by argument


love(john, mary)

CSA4006-Dr. Anirban Bhowmick


Rules
Rule  1 If ∝ is a terminal node, then [| ∝ |] is specified in the lexicon

Rule  2 if [| ∝ |] is a non-branching node, and 𝛽 is its daughter node then [| ∝ |] = [| 𝛽 |]

Rule  3 if ∝ is a branching node, {𝛽, 𝛾} is the set of daughters and [| 𝛽 |] is a function whose domain
contain [| 𝛾 |], then [| ∝ |] = [|𝛽|] ([|𝛾|] )
15

Lexical Entries:

(i) Proper Names: as it is


(ii) Intransitive Verbs: [|dies|] = 𝜆𝑥 .x dies
(iii) Transitive Verbs: [|loves|] = 𝜆𝑦 𝜆𝑥 . x loves y

CSA4006-Dr. Anirban Bhowmick


Types
Types are important differentiator for the semanticists

e : individual (Proper nouns)


T : truth values (0,1)

If we have two entities <𝜎, 𝜏> is a type as well it is called function


We can write it as : f(e) t or f: 𝐷𝑒  𝐷𝑡 16

Types of different parts

S=t
N=e

VP = {find subject, output a truth value} <input, output> , <e,t>


V = {find object, find subject, output truth values} <e,<e,t>>

CSA4006-Dr. Anirban Bhowmick


Semantic Construction with Lambdas

17

CSA4006-Dr. Anirban Bhowmick


Adjective

18

CSA4006-Dr. Anirban Bhowmick


Preposition

19

CSA4006-Dr. Anirban Bhowmick


Negate

20

CSA4006-Dr. Anirban Bhowmick


21

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 17
Syllabus

Syllabus
3
Module 4:
Semantics: Computational Desiderata for
Representations- Meaning Structure of
Language- First Order Predicate Calculus-
Elements of FOPC- The Semantics of FOPC- 4

Syntax-Driven Semantic Analysis- Attachments


for a Fragment of English.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 4:
Semantic
Review

CSA4006-Dr. Anirban Bhowmick


Attachments for a Fragment of English
The sentences with imperatives, Yes/No questions, and WH questions Let’s start by
considering the following examples

Flight 487 serves lunch convey factual information to a hearer

Serve lunch is a request for an action


9
Does Flight 207 serve lunch? are requests for information

Which flights serve lunch? are requests for information

 The meaning representations of these examples all contain propositions concerning the
serving of lunch on flights

 They differ with respect to the role that these propositions are intended to serve

CSA4006-Dr. Anirban Bhowmick


Contd.
 To capture these differences a set of operators is applied to FOPC sentences

Specifically, the following operators will be applied to the FOPC representations


 DCL declaratives
 IMP imperatives
 YNQ yes no questions
 WHQ wh question 10

• The normal interpretation for a representation headed by the DCL operator would be as a factual
statement to be added to the current knowledge base.

• Imperative sentences begin with a verb phrase and lack an overt subject. Because of the missing subject,
the meaning representation for the main verb phrase will consist of a λ expression with an unbound λ
variable representing this missing subject

CSA4006-Dr. Anirban Bhowmick


Contd.
 Simply supply a subject to the λ-expression by applying a final λ-reduction to a dummy constant.

 The IMP operator can then be applied to this representation as in the following semantic
attachment.

11
 Applying this rule

 Imperatives can be viewed as a kind of speech act

CSA4006-Dr. Anirban Bhowmick


Contd.
 yes-no-questions consist of a sentence initial auxiliary verb, followed by a subject noun phrase and
then a verb phrase.

 The following semantic attachment simply ignores the auxiliary, and with the exception of the YNQ 12

operator
 Yes or No Questions should be thought as asking the whether the propositional part of its meaning
is true or false given the knowledge currently contained in the knowledge-base.

CSA4006-Dr. Anirban Bhowmick


Contd.
 wh-subject-questions ask for specific information about the subject of the sentence rather than
the sentence as a whole.

 The following attachment produces a representation that consists of the operator WHQ, the
variable corresponding to the subject of the sentence, and the body of the proposition.
13

CSA4006-Dr. Anirban Bhowmick


Contd.
 Such questions can be answered by returning a set of assignments for the subject variable that
make the resulting proposition true with respect to the current knowledge base.
 Finally, consider the following wh non subject question.

How can I go from Minneapolis to Long Beach?

 The question is not about the subject of the sentence but rather some other argument, or some 14

aspect of the proposition as a whole.


 In this case, the representation needs to provide an indication as to what the question is about.
 The following attachment provides this information by providing the semantics of the auxiliary as an
argument to the WHQ operator.

CSA4006-Dr. Anirban Bhowmick


15

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 17
Syllabus

Syllabus
3
Module 5:
Machine Translation And Applications: Basic
Issues in Machine
Translation- Statistical Translation- Word
4
Alignment- Phrase based
Translation- Synchronous Grammars-
Applications of Natural Language
Processing: Spell Check- Summarization-
Language Translation.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 5: MT
What is Machine Translation?

Automatic conversion of text/speech from one natural language to another

Be the change you want to see in the world


8

वह परिवर्तन बनो जो संसाि में दे खना चाहर्े हो

CSA4006-Dr. Anirban Bhowmick


Use cases
Government Translation under the hood
● Administrative requirements ● Cross-lingual Search
● Education ● Cross-lingual Summarization
● Security
● Building multilingual
Enterprise dictionaries
9

● Product manuals
● Customer support

Social
Any multilingual NLP system will
● Travel (signboards, food)
involve some kind of machine
● Entertainment (books, movies, videos)
translation at some level

CSA4006-Dr. Anirban Bhowmick


History of MT

10

CSA4006-Dr. Anirban Bhowmick


History of MT
 Georges Artsrouni and Petr Troyanskii received the first-ever patents for MT-like tools in 1933. These
tools were quite rudimentary, especially in comparison to what we think of when we hear the term
“MT” today. They worked by comparing dictionaries in the source and target language

 first general purpose electronic computers were not far off on the horizon — in the mid-1940s,
developers like Warren Weaver began to theorize about ways they could use computers to automate
the translation process.
11

 Early RBMT systems include the Institute Textile de France’s TITUS and Canada’s METEO system,
among others. And while US-based research certainly slowed down after the ALPAC report, it didn’t
come to a complete stop — SYSTRAN, founded in 1968, utilized RBMT as well, working closely with
the US Air Force for Russian-English translation in the 1970s.

 In the 1990s, researchers at IBM developed a renewed interest in MT technology, publishing


research on some of the first SMT systems in 1991. Unlike RBMT, SMT doesn’t require developers
to manually input the rules of each language — instead, SMT engines utilize a bilingual corpus of
text to identify patterns in the languages that could be converted into statistical data.

CSA4006-Dr. Anirban Bhowmick


History of MT
 And as electronic computers slowly became more of a household item, so too did MT systems.
SYSTRAN launched the first web-based MT tool in 1997, providing lay people — not just
researchers and language service providers — access to an MT tool. Nearly a decade later, in 2006,
Google launched Google Translate, which was powered by SMT from 2007 until 2016.
 In 2003, researchers at the University of Montreal developed a language model based on neural
networks, but it wasn’t until 2014, with the development of the sequence-to-sequence (Seq2Seq)
model, that NMT became a formidable rival for SMT.
 After that, NMT quickly became the state-of-the-art MT tool — Google Translate adopted it in 2016. 12

NMT engines use larger corpora than SMT and are more reliable when it comes to translating long
strings of text with complex sentence structures.
 Although large language models (LLMs) perform a lot of other functions besides translation, some
thought leaders have presented tools like ChatGPT as the future of localization and, by extension,
MT.

CSA4006-Dr. Anirban Bhowmick


Why should you study Machine Translation?
 One of the most challenging problems in Natural Language Processing
 Pushes the boundaries of NLP
 Involves analysis as well as synthesis
 Involves all layers of NLP: morphology, syntax, semantics, pragmatics,
discourse
13
 Theory and techniques in MT are applicable to a wide range of other
problems like transliteration, speech recognition and synthesis

CSA4006-Dr. Anirban Bhowmick


Why is Machine Translation interesting?

Language Divergence  the great diversity among languages of the world

14

The central problem of MT is to bridge


this language divergence

CSA4006-Dr. Anirban Bhowmick


Language Divergence
Word order: SOV (Hindi), SVO (English), VSO, OSV

E: Argentina won the last World Cup

H: अजें टीना ने पपछला पवश्व कप जीर्ा था

15

Free (Hindi) vs rigid (English) word order

पपछला पवश्व कप अजें टीना ने जीर्ा था (correct)

The last World Cup Argentina won (grammatically incorrect)


The last World Cup won Argentina (meaning changes)

CSA4006-Dr. Anirban Bhowmick


Language Divergence.

Different ways of expressing same concept


water  पानी, जल, नीर
16

Language registers
Formal: आप बैठिये Informal: तू बैि
Standard : मझ
ु े डोसा चाठिए Dakhini: मेरे को डोसा िोना

CSA4006-Dr. Anirban Bhowmick


Why is Machine Translation difficult?
● Ambiguity
○ Same word, multiple meanings: मंत्री (minister or chess piece)
○ Same meaning, multiple words: जल, पानी, नीि (water)

● Word Order
○ Underlying deeper syntactic structure 17

○ Phrase structure grammar?


○ Computationally intensive

● Morphological Richness
○ Identifying basic units of words

CSA4006-Dr. Anirban Bhowmick


Approaches to build MT systems

18

CSA4006-Dr. Anirban Bhowmick


Rule-based MT
 Rules are written by linguistic experts to analyze the source, generate an intermediate
representation, and generate the target sentence
 Depending on the depth of analysis: interlingua or transfer-based MT

19

CSA4006-Dr. Anirban Bhowmick


Vauquois Triangle
 Translation approaches can be classified by the depth of linguistic analysis they perform

20

CSA4006-Dr. Anirban Bhowmick


Problems with rule-based MT
 Required linguistic expertise to develop systems
 Maintenance of system is difficult
 Difficult to handle ambiguity
 Scaling to a large number of language pairs is not easy
21

CSA4006-Dr. Anirban Bhowmick


Example-based MT
Translation by analogy ⇒ match parts of sentences to known translations and then combine

Input: He buys a book on international politics

1. Phrase fragment matching: (data-driven)


he buys
a book 22
international politics

2. Translation of segments: (data-driven)


वह खिीदर्ा है
एक पकर्ाब
अंर्ि िाष्ट्रीय िाजनीपर्
● Partly rule-based, partly data-
driven. 3. Recombination: (human crafted rules/templates)
● Good methods for matching वह अंर्ि िाष्ट्रीय िाजनीपर् पि एक पकर्ाब खिीदर्ा है
and large corpora did not exist
when proposed
CSA4006-Dr. Anirban Bhowmick
23

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 18
Topic: Statistical Machine Translation
Syllabus

Syllabus
3
Module 5:
Machine Translation And Applications: Basic
Issues in Machine
Translation- Statistical Translation- Word
4
Alignment- Phrase based
Translation- Synchronous Grammars-
Applications of Natural Language
Processing: Spell Check- Summarization-
Language Translation.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 5: MT
Review

CSA4006-Dr. Anirban Bhowmick


SMT
 Parallel corpora are available in several language pairs.

 Basic idea: use a parallel corpora as a training set of translation examples

 Classis example: IBM work on French-English translation, using the Canadian Hansards.
(1.7 million sentences of 30 words or less in length)
9
 Idea goes back to Warren Weaver (1949): suggested applying statistical and cryptanalytic
techniques to translation.

….one naturally wonders if the problem of translation could conceivably be treated as


a problem in cryptography. When I look at an article in Russian, I say: “This is really
written in English, but it has been coded in some strange symbols. I will now proceed
to decode”
(Warren Weaver, 1949, in a letter to Norbert Wiener)

CSA4006-Dr. Anirban Bhowmick


The Noisy Channel Model
 Goal: translation system from French to English

 Have a model p(e|f) which estimates conditional probability of any English sentence e
given the French sentence f. Use the training corpus to set the parameters.

 A Noisy Channel Model has two components:

10
p(e) the language model
p(f|e) the translation model

Giving:
p(e,f) 𝑝 𝑒 𝑝(𝑓|𝑒)
p(e|f) = = =
p(f) 𝑝 𝑒 𝑝(𝑓|𝑒)
and
𝑎𝑟𝑔max 𝑝(𝑒|𝑓) = 𝑎𝑟𝑔max p e p(f|e)
𝑒 𝑒

CSA4006-Dr. Anirban Bhowmick


NCM

11

CSA4006-Dr. Anirban Bhowmick


SMT
Let’s formalize the translation process

We will model translation using a probabilistic model. Why?


- We would like to have a measure of confidence for the translations we learn
- We would like to model uncertainty in translation
12

Model: a simplified and idealized understanding of a physical process

CSA4006-Dr. Anirban Bhowmick


SMT

13

Why use this counter-intuitive way of explaining translation?

● Makes it easier to mathematically represent translation and learn probabilities


● Fidelity and Fluency can be modelled separately

CSA4006-Dr. Anirban Bhowmick


SMT
We have already seen how to learn n-gram language models

14
Let’s see how to learn the translation model  𝑃(𝒇|𝒆)

To learn sentence translation probabilities,


 we first need to learn word-level translation probabilities

That is the task of word alignment

CSA4006-Dr. Anirban Bhowmick


Word Alignment
A common use of aligned texts
is the derivation of bilingual
dictionaries and terminology
databases.
This is usually done in two steps.
First the terminology
databases text alignment is
15
extended to a word alignment
(unless we are dealing with an
approach in which word and text
alignment are induced
simultaneously).
Then some criteria such as
frequency is used to select
aligned. Given a parallel sentence pair, find word level
correspondences

CSA4006-Dr. Anirban Bhowmick


Contd.

16

CSA4006-Dr. Anirban Bhowmick


Contd.

17

CSA4006-Dr. Anirban Bhowmick


Contd.

18

CSA4006-Dr. Anirban Bhowmick


Contd.

19

CSA4006-Dr. Anirban Bhowmick


Contd.
If we knew the alignments, we could compute P(f|e)

20

CSA4006-Dr. Anirban Bhowmick


21

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 19
Topic: Statistical Machine Translation
Syllabus

Syllabus
3
Module 5:
Machine Translation And Applications: Basic
Issues in Machine
Translation- Statistical Translation- Word
4
Alignment- Phrase based Translation-
Synchronous Grammars- Applications of Natural
Language Processing: Spell Check-
Summarization- Language Translation.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 5: MT
Review

CSA4006-Dr. Anirban Bhowmick


Word Alignment
A common use of aligned texts
is the derivation of bilingual
dictionaries and terminology
databases.
This is usually done in two steps.
First the terminology
databases text alignment is
9
extended to a word alignment
(unless we are dealing with an
approach in which word and text
alignment are induced
simultaneously).
Then some criteria such as
frequency is used to select
aligned. Given a parallel sentence pair, find word level
correspondences

CSA4006-Dr. Anirban Bhowmick


Contd.

10

CSA4006-Dr. Anirban Bhowmick


Contd.

11

CSA4006-Dr. Anirban Bhowmick


Contd.

12

CSA4006-Dr. Anirban Bhowmick


Contd.

13

CSA4006-Dr. Anirban Bhowmick


Contd.
If we knew the alignments, we could compute P(f|e)

14

CSA4006-Dr. Anirban Bhowmick


IBM Model
 IBM Model 1 is a statistical machine translation model that aims to align words between a source
language and a target language. The model is designed to learn the probabilities of word
alignments based on observed parallel sentences in bilingual corpora. The primary goal is to
understand how words in the source language correspond to words in the target language.\

 Alignments:
15

CSA4006-Dr. Anirban Bhowmick


IBM Model

16

CSA4006-Dr. Anirban Bhowmick


IBM Model

17

CSA4006-Dr. Anirban Bhowmick


IBM Model

18

CSA4006-Dr. Anirban Bhowmick


Alignments in the IBM Models

19

CSA4006-Dr. Anirban Bhowmick


Alignments in the IBM Models
In IBM Model 1 all alignments a are equally likely:

20
Next step: come up with an estimate for

In Model 1, this is:

CSA4006-Dr. Anirban Bhowmick


IBM Model 1: Example

21

CSA4006-Dr. Anirban Bhowmick


Alignments in the IBM Models

22

CSA4006-Dr. Anirban Bhowmick


Alignments in the IBM Models

23

CSA4006-Dr. Anirban Bhowmick


Phase Based Translation
Phrase-based machine translation is an approach to machine translation that focuses on translating
smaller units of text, typically phrases or short sequences of words, rather than translating word by
word. This approach allows for more flexibility in capturing linguistic variations and improves the overall
translation quality.

• Word-Based Models translate words as atomic units


• Phrase-Based Models translate phrases as atomic units
24

• Foreign input is segmented in


phrases
• Each phrase is translated into
English
• Phrases are reordered

CSA4006-Dr. Anirban Bhowmick


Phrase Translation Table
Main knowledge source: table with phrase translations and their probabilities

Example: phrase translations for natuerlich

25

CSA4006-Dr. Anirban Bhowmick


Phrase Translation Table
Phrase translations for den Vorschlag learned from the Europarl corpus:

26

– lexical variation (proposal vs suggestions)


– morphological variation (proposal vs proposals)
– included function words (the, a, ...)
– noise (it)
CSA4006-Dr. Anirban Bhowmick
Linguistic Phrases?
Model is not limited to linguistic phrases
(noun phrases, verb phrases, prepositional phrases, ...)

• Example non-linguistic phrase pair


spass am → fun with the

• Prior noun often helps with translation of preposition 27

• Experiments show that limitation to linguistic phrases hurts quality

CSA4006-Dr. Anirban Bhowmick


Probabilistic Model

28

CSA4006-Dr. Anirban Bhowmick


Distance-Based Reordering

29

CSA4006-Dr. Anirban Bhowmick


Learning a Phrase Translation Table
• Three stages:
– word alignment: using IBM models or other method
– extraction of phrase pairs
– scoring phrase pairs

30

CSA4006-Dr. Anirban Bhowmick


Learning a Phrase Translation Table

31

All words of the phrase pair have to align to each other

CSA4006-Dr. Anirban Bhowmick


Learning a Phrase Translation Table

32

CSA4006-Dr. Anirban Bhowmick


Scoring Phrase Translations
• Phrase pair extraction: collect all phrase pairs from the data
• Phrase pair scoring: assign probabilities to phrase translations
• Score by relative frequency:

33

CSA4006-Dr. Anirban Bhowmick


34

EEE1001-Dr. Anirban Bhowmick


Natural Language
Processing
CSA4006

Dr. Anirban Bhowmick


Assistant Professor
VIT Bhopal
Lecture : 20
Topic: Neural Machine Translation
Syllabus

Syllabus
3
Module 5:
Machine Translation And Applications: Basic
Issues in Machine
Translation- Statistical Translation- Word
4
Alignment- Phrase based Translation-
Synchronous Grammars- Applications of Natural
Language Processing: Spell Check-
Summarization- Language Translation.
Text Books:
Daniel Jurafsky and James H. Martin "Speech and
Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech recognition", Prentice Hall, 2nd edition, 2008. 5
Reference Books:
1. Roland R. Hausser "Foundations of Computational
Linguistics: Human- Computer Communication in
Natural Language", Paperback, MIT Press, 2011.
2. Christopher D. Manning and Hinrich Schuetze,
6
"Foundations of Statistical Natural Language
Processing" by MIT Press.
Module 5: MT
Review

CSA4006-Dr. Anirban Bhowmick


MT

CSA4006-Dr. Anirban Bhowmick


Encoder Decoder Model

Encoder: Takes in the input sentence and produces a


fixed-size context vector.
Decoder: Takes the context vector and generates the 10
output sentence in the target language.

CSA4006-Dr. Anirban Bhowmick


Contd.

11

CSA4006-Dr. Anirban Bhowmick


Contd.
The encoder
Layers of recurrent units where, in each time step, an input token is received, collecting relevant
information and producing a hidden state. This depends on the type of RNN; in our example, a LSTM,
the unit mixes the current hidden state and the input and returns an output, discarded, and a new
hidden state.
The encoder vector
The encoder vector is the last hidden state of the encoder, and it tries to contain as much of the useful 12

input information as possible to help the decoder get the best results. It’s the only information from the
input that the decoder will get.
The decoder
Layers of recurrent units — e.g., LSTMs — where each unit produces an output at a time step t. The
hidden state of the first unit is the encoder vector, and the rest of the units accept the hidden state
from the previous unit. The output is calculated using a softmax function to obtain a probability for
every token in the output vocabulary.

CSA4006-Dr. Anirban Bhowmick


Problem and Solution
Why? Longer sentences illustrate the limitations of a single-directional encoder-decoder architecture.

Because language consists of tokens and grammar, the problem with this model is it does not entirely
address the complexity of the grammar.

Specifically, when translating the nth word in the source language, the RNN was considering only the
1st n-word in the source sentence, but grammatically, the meaning of a word depends on both the 13
sequence of words before and after it in a sentence:

A solution: The bi-directional LSTM model. If we use a bi-directional model, it allows us to input the
context of both past and future words to create an accurate encoder output vector:

CSA4006-Dr. Anirban Bhowmick


Bi-LSTM

14

But then, the challenge then becomes, which word do we need to focus on in a sequence?

CSA4006-Dr. Anirban Bhowmick


Attention Mechanism
Attention Mechanism Overview:
The attention mechanism enhances
the traditional encoder-decoder
architecture by allowing the decoder
to "pay attention" to different parts of
the source sentence when generating
each word in the target sequence.
15

CSA4006-Dr. Anirban Bhowmick


Attention Mechanism

16

CSA4006-Dr. Anirban Bhowmick


Video Tutorials

17

CSA4006-Dr. Anirban Bhowmick


18

EEE1001-Dr. Anirban Bhowmick

You might also like