Unit 4
Unit 4
CSE NLP
Prerequisites:
1. Data structures and compiler design
Course Objectives:
Introduction to some of the problems and solutions of NLP and their relation
to linguistics and statistics.
Course Outcomes:
CS525PE Show sensitivity to linguistic phenomena and an ability to model them with
formal grammars.
Understand and carry out proper experimental methodology for training and
Natural Language evaluating empirical NLP systems
Manipulate probabilities, construct statistical models over strings and
Processing trees, and estimate parameters using supervised and unsupervised training
methods.
Design, implement, and analyze NLP algorithms; and design different
Professional Elective – II language modelling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their
Components, Issues and Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods,
Complexity of the Approaches, Performances of the
Approaches, Features
UNIT - II
Page 1 of 76
R22 B.Tech. CSE NLP
Page 2 of 76
R22 B.Tech. CSE NLP
The PAS for this sentence would be: chased (cat, mouse)
1.1 Resources:
These resources help computers understand the
Page 54 of 76
R22 B.Tech. CSE NLP
meaning of sentences by identifying the action and Think the word “break” in two different frames:
who is involved.
This is important for things like translating Frame 1: “Break” as in breaking a rule
languages, answering questions, and even helping Roles:
virtual assistants understands commands better.
i) Framenet Breaker (the person who breaks the rule)
ii) PropBank Rule (the rule being broken)
1.1.1 FRAMENET: Frame 2: “Break” as in breaking an object.
Framenet looks at how words are used in different Roles:
situations (frames) and identifies the roles that other
words play in these situations. Breaker (the person who breaks the object)
It is based on the theory of frame semantics, which
suggests that the meaning of a word can be Rule (the thing being broken)
understood in terms of the physical situations it
describes.
Working:
Key Elements:
1. Identify frames: Researchers identify common
Frames: A frame is a type situation or scenario. Each situations (frames).
frame involves certain 2. Assign frame elements: Each frame has specific
participants, which are called frame elements. roles.
3. Label Sentences: Sentences are tagged with these
Frame Elements: These are the role played by the frames and frame elements to show how words
different participants in a frame. are used in context.
Lexical Units(LUs): These are the pairs of words and Example:
their meanings (frames). Each
Frame: COMMERCE_BUY
lexical unit is a specific meaning of a word in a given
frame. Sentence: “John bought a car from Mary for $20,000.”
Page 55 of 76
R22 B.Tech. CSE NLP
Predicate: Usually a verb, it represents an action or state. Sentence: “John gave Mary a book”
[ARG0 John] [gave] [ARG2 Mary] [ARG1 a book]. ARGM-MOD: Modality (e.g., “can,” “might”)
Core Arguments: Example of a complex Annotation:
These are essential participants directly involved with the Sentence: “The company operates stores mostly in Iowa
predicate: and Nebraska.”
ARG0: Typically, the agent or does of the action. Predicate: operates
ARG1: Typically, the patient or theme (the entity Arguments:
undergoing the action).
ARG0 (Agent): The company
ARG2, ARG3, ARG4: Other roles that vary depending
on the verb’s meaning. ARG1 (Theme): stores
ARGM-LOC (Location): mostly in Iowa and
Nebraska
Adjunctive Arguments:
1.2 Other Resources:
These provide additional information about the action and
are labelled as ARGM-XYZ, where XYZ indicates the 1.Nombank
type of information: 2.VerbNet
ARGM-LOC: Location (e.g., “in the hotel”) 1.3 Software’s:
ARGM-TMP: Time (e.g., “yesterday”) Following is a list of software packages available for
ARGM-MNR: Manner (e.g., “quickly”) semantic role labelling.
ARGM-CAU: Cause (e.g., “because he was hungry”) 1. ASSERT: (Automatic Statistical Semantic Role
Tagger)
ARGM-DIR: Direction (e.g., “to the store”) A semantic role labeller trained on the English
PropBank data.
ARGM-PRP: Purpose (e.g., “to buy groceries”) 2. C-ASSERT:
ARGM-NEG: Negation (e.g., “not”)
Page 57 of 76
R22 B.Tech. CSE NLP
An extension of ASSERT for Chinese language. develop systems that convert natural language into a form
of usable by applications for decision-making. Specifically,
3. SwiRL: it focused on transforming user queries about flight
Another semantic role labeller trained on PropBank information into SQL queries to extract answers from a
data. flight database.
4. Shalmaneser (A Shallow Semantic Parser):
A toolchain for shallow semantic parsing based on the Here’s how it worked:
FrameNet Data. 1. A user would ask a question in natural speech
using a restricted vocabulary.
2. Meaning Representation Systems: 2. The system would convert this query into a
hierarchical frame representation, encoding the
Meaning representation is a deeper level of semantic
interpretation aimed at converting natural language into a essential semantic information.
format that machines can understand and act on. 3. This representation was then compiled into a
This process is similar to how programming languages are SQL query to retrieve the required data from the
compiled into machine code that Computers execute. database.
Unlike Artificial Languages, natural language is flexible The ATIS training corpus included over 7,300 spoken
and relies on context and general world knowledge for utterances from 137 subjects with 2,900 of them categorized and
understanding, which poses a challenge for machines. annotated and around 600 treebanked for detailed syntactic
Researchers have been working for decades to develop analysis. This resource helped promote experimentation in
methods to interpret and encode the context and knowledge transforming natural language into machine-readable formats.
for machines.
However, current techniques are limited to specific 2. COMMUNICATOR:
domains and problems and do not scale well to arbitrary The communicator Program was next step after ATIS
domains. project. While ATIS focused on user:
2.1 Resources: Initiated dialogues where users ask questions and
1. ATIS (Air Travel Information System): machines provide answers. Communicator introduced & mixed-
initiative dialog system. This means both user and the machine
The ATIS project was one of the first major efforts to could actively participate in the conversation.
Page 58 of 76
R22 B.Tech. CSE NLP
3. GeoQuery:
GeoQuery is a Natural Language Interface (NLI) designed
to interact with geographic database called Geobase. Geobase Language Modeling: Introduction, N-Gram Models,
contains about 800 prolog facts, which store geographic Language Model Evaluation, Bayesian parameter
information such as populations, neighbouring states, major
estimation, Language Model Adaptation, Language Models-
rivers, and major cities in a relational database.
class based, variable length, Bayesian topic based,
4. Robocup: CLang Multilingual and Cross Lingual Language Modeling
Robocup is an international competition where teams of
robots play soccer, and it’s organized by the artificial intelligence
community. The goal is to advance AI and robotics research Language Modeling:
through this challenging and fun domain. 5.1 Introduction:
What is language modeling?
2.2 Software’s: Language modeling, or LM, is the use of various statistical
WASP and probabilistic techniques to determine the probability of a
KRISPER given sequence of words occurring in a sentence. Language
CHILL models analyze bodies of text data to provide a basis for their word
predictions.
Language modeling is used in artificial intelligence (AI),
natural language processing (NLP), natural language
understanding and natural language generation systems,
particularly ones that perform text generation, machine translation
and question answering.
Page 59 of 76