0% found this document useful (0 votes)

232 views

NLP Unit Test 2

Text summarization involves automatically generating a concise summary of a document while preserving its most important ideas. It works by analyzing the document to identify key elements like topics, main ideas and important details. Sentence extraction and abstraction methods are commonly used to summarize text by selecting or rewriting sentences. Evaluation involves comparing system summaries to human-generated references to assess accuracy.

Uploaded by

sneha

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

232 views

NLP Unit Test 2

Uploaded by

sneha

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

1.

Compare Top down and bottom up parsing

2. What do you mean by Synset give example

A "Synset" is a term used in linguistics and natural language processing to refer to a set
of words or phrases that are synonymous or semantically related. In other words, a
Synset groups together words that have similar meanings.
For example, consider the words "car," "automobile," and "vehicle." These words are
related in meaning because they all refer to a type of transportation. In a Synset, these
words would be grouped together. In the context of computational linguistics and
natural language processing, Synsets are often used in lexical databases like WordNet.
WordNet, for instance, is a large lexical database of English that groups words into sets
of synonyms called synsets. Each synset is linked to other synsets by means of semantic
relationships.
So, for the words "car," "automobile," and "vehicle," there would be a synset that
includes all of them, indicating their semantic relationship.

3. How thesaurus based approach is used to find Word similarity

Thesaurus-based approach for finding word similarity relies on the idea that words with
similar meanings are often listed together in a thesaurus, which is a lexical resource that
groups words based on their semantic relationships.
In this approach, words are represented as nodes in a graph, and edges between nodes
represent the strength of their semantic relationship. The distance or path length
between two nodes in this graph can be used to measure the similarity between the
corresponding words. Shorter paths indicate higher similarity.
For instance, let's consider the words "happy" and "joyful". In a thesaurus, they would
likely be listed as synonyms or near-synonyms. Therefore, the distance between their
nodes in the graph would be short, indicating a high level of similarity.
This approach is useful in various natural language processing tasks such as information
retrieval, text summarization, and sentiment analysis. By quantifying word similarity,
it allows algorithms to better understand the context and meaning of text.
4. For the given concept graph find Sim(coinage, money)and Sim(coinage, Budget)using
Resnik Lin and JC Methods.

5. Explain with example relationships between word senses

6. Explain with suitable example following relationships between word meanings.

Homonymy, Polysemy, Synonymy, Antonymy
• Homonymy It may be described as words with the same spelling or form but
diverse and unconnected meanings E g Bat, Bank
• Hyponymy It illustrates the connection between a generic word and its
occurrences The generic term is known as hypernym, while the occurrences
are known as hyponyms. Fruit (hypernym) and Apple, Banana, Orange (hyponyms).
• Polysemy Polysemy is a term or phrase that has a different but comparable
meaning To put it another way, polysemy has the same spelling but various
and related meanings. Mouse (referring to both a small rodent and a computer input
device).
• Synonymy It denotes the relationship between two lexical elements that
have different forms but express the same or a similar meaning. Happy and Joyful
(both have similar meanings).
• Antonymy It is the relationship between two lexical items that include
semantic components that are symmetric with respect to an axis. Hot and Cold
(opposites on the temperature scale).
• Meronomy It is described as a logical arrangement of letters and words
indicating a component portion of or member of anything Wheel (part of) Car.

7. List and explain steps in text processing for Information Retrieval

Text processing in Information Retrieval involves several key steps to prepare and analyze text
data for effective retrieval. Here are the main steps along with brief explanations:
• Tokenization:It involves breaking a text into individual units, or tokens, which are typically
words or punctuation marks. Example: The sentence "Chatbots are fascinating!" would be
tokenized into ["Chatbots", "are", "fascinating", "!"].
• Lowercasing: Convert all tokens to lowercase to ensure case insensitivity. This prevents
"Chatbots" and "chatbots" from being treated as different terms. Example: "Chatbots" becomes
"chatbots".
• Stopword Removal:Remove common and less informative words like "the", "is", "and" which
occur frequently but don't contribute much to the meaning. Example: "The quick brown fox
jumps over the lazy dog" might become "quick brown fox jumps lazy dog".
• Stemming or Lemmatization: Definition: Reduce words to their base or root form (stem) to
capture the core meaning. Lemmatization is a more controlled approach, considering the
context and part of speech. Example: "Running", "ran", and "runner" all become "run".
• Normalization: Additional processes like removing special characters, numbers, or specific
symbols to further clean the text. Example: "Let's meet at 3:30 pm!" might become "Let's meet
at pm".
• Term Frequency (TF) Calculation: Calculate how often each term occurs in a document. This
helps in understanding the importance of terms within a document.Example: In the sentence
"Chatbots are fascinating, and chatbots are useful.", the term "chatbots" has a TF of 2.
• Inverse Document Frequency (IDF) Calculation: Assess the rarity of a term across all
documents in a corpus. Rare terms are often more informative. Example: "Chatbots" might
have a low IDF if it appears frequently in many documents.
• Vectorization: Represent each document as a numerical vector in a high-dimensional space.
This can be done using techniques like Bag-of-Words or Word Embeddings. Example: The
sentence "Chatbots are fascinating!" might be represented as [1, 1, 1, 0, ...] in a Bag-of-Words
model.
• Indexing: Create an index or data structure that allows for efficient querying and retrieval of
documents based on their vector representations. Example: Using an inverted index to map
terms to the documents they appear in.
• Query Processing: Apply similar preprocessing steps to user queries to prepare them for
comparison with indexed documents. Example: If a user enters "Tell me about chatbots", this
would go through tokenization, stopword removal, etc.
These steps collectively form the preprocessing pipeline for text data in Information Retrieval
systems, enabling efficient storage, retrieval, and ranking of relevant documents based on user
queries.

8. Explain Yarowsky bootstrapping approach of semi supervised learning

The Yarowsky Bootstrapping Algorithm is a semi-supervised learning method used in
natural language processing. It iteratively refines a model's understanding of word
senses using a small initial labeled dataset and a larger, unlabeled dataset. The process
starts with a seed set of labeled examples, where each word is tagged with its sense.
The model then uses this initial information to predict senses for the unlabeled data.
These predictions are incorporated into the training set, expanding the labeled dataset.
This augmented dataset is then used to retrain the model, which produces more
accurate predictions. This cycle repeats until a stopping criterion is met. The strength
of the Yarowsky approach lies in its ability to leverage a small amount of labeled data
to make predictions on a larger pool of unlabeled data, gradually improving accuracy
through iterations.

9. For given corpus,

<s>Martin Justin can watch Will</s>
<s>Martin Justin can watch Will</s>
<s>Spot will watch Martin</s>
<s>Will Justin spot Martin </s>
<s>Martin will spat Spot</s>
N: Noun [Martin, Justin, Will, Spot, Pat]
M: Modal verb [can , will]
V:Verb [ watch, spot, pat]
Create Transition Matrix & Emission Probability Matrix
Statement is “Justin will spot Will”
Apply Hidden Markov Model and do POS tagging
Transition Matrix:
Given the states:
• N (Noun)
• M (Modal verb)
• V (Verb)
• E (End of sentence)
And the transition counts from the corpus:

N M V E

N 0 1 2 2

M 1 0 2 2

V 1 1 1 2

E 0 0 0 0

Emission Probability Matrix:

Given the words:
• Martin, Justin, can, watch, Will, Spot, will, spat, Pat
And the emission counts from the corpus:
Applying Hidden Markov Model for POS Tagging:
Given the statement: "Justin will spot Will"
1. Initialization:
• Start with initial probabilities based on the corpus frequencies.
2. Forward Algorithm:
• Calculate forward probabilities for each state at each word position.
3. Backward Algorithm:
• Calculate backward probabilities for each state at each word position.
4. Combine Probabilities:
• Combine forward and backward probabilities to get the state probabilities at
each word position.
5. Decoding:
• For each word, select the state with the highest probability as the POS tag.
The POS tagging for the statement "Justin will spot Will" would be:
• Justin (Noun), will (Modal verb), spot (Verb), Will (Noun)

10.For a given grammar using CYK or CKY algorithm parse the statement
a. “Book the meal flight”
Rules:

10. Explain Text summarization in detail.

Text summarization is the process of condensing a longer piece of text while
preserving its core meaning and information. It's a crucial task in natural language
processing with applications in information retrieval, document categorization, and
more. There are two main approaches to text summarization:

Extractive Summarization:

In extractive summarization, sentences or phrases from the original text are selected
and combined to form the summary. This approach directly uses parts of the original
text. It's similar to how a person might extract key sentences when creating a
summary. Techniques for extractive summarization include:
TF-IDF (Term Frequency-Inverse Document Frequency): Ranks sentences based on
their importance in the context of the entire document.
Graph-Based Methods: Represent the text as a graph and use algorithms like
PageRank to find the most important sentences.
Machine Learning Approaches: Train models to predict the importance of sentences.
Abstractive Summarization:

Abstractive summarization involves generating new sentences that capture the main
points of the original text, potentially using different words and structures. It requires
a deeper understanding of the text and is more similar to how humans create
summaries. Techniques for abstractive summarization include:
Sequence-to-Sequence Models (e.g., using LSTM or Transformer architectures):
Train models to generate summaries by learning to map input sequences to output
sequences.
Transformer-Based Models (e.g., GPT-3, BERT): Utilize large-scale pre-trained
language models to generate abstractive summaries.
11. Explain Maximum Entropy Model for POS Tagging
he Maximum Entropy Model is a probabilistic model used in natural language
processing for tasks like part-of-speech (POS) tagging. It's based on the principle of
maximum entropy, which states that, given a set of constraints, the best model is the
one that makes the fewest assumptions.

Here's how the Maximum Entropy Model works for POS tagging:

Feature Selection:

Identify relevant features that can help predict the correct part-of-speech tag for a
given word in context. Features may include the word itself, surrounding words,
capitalization, suffixes, etc.
Define Feature Functions:

Associate each feature with a function that maps an input (e.g., a word and its context)
to a binary value indicating whether the feature is present or not.
Collect Training Data:

Gather a labeled dataset where each word is tagged with its corresponding part-of-
speech.
Training:

Use an optimization algorithm (like Generalized Iterative Scaling) to find the model
parameters that maximize the likelihood of the training data, subject to the feature
function constraints.
Prediction:

Given a new sentence, calculate the probability distribution over possible tags for
each word using the trained model. The tag with the highest probability is assigned to
the word.
12. Explain Hobbs algorithm for pronoun resolution.

13. Write a note on wordnet

WordNet is a lexical database of the English language, organized in a hierarchical
structure of synsets, which are sets of synonyms representing distinct concepts. It links
words based on their meanings, providing a rich resource for natural language
understanding. Each synset contains words that can be interchangeably used in
certain contexts, aiding tasks like semantic analysis, information retrieval, and
machine learning. WordNet's extensive coverage and detailed relationships between
words make it a valuable tool in various fields, from linguistics to artificial intelligence,
enabling deeper insights into the complexities of language. For instance, in WordNet,
the synset {car, automobile, motorcar} indicates that these words are synonymous,
and "car" is a hyponym of "vehicle".
.

14. Explain how HMM is used for sequence labelling

Hidden Markov Models (HMMs) are used for sequence labeling tasks in natural
language processing (NLP) and other fields where sequences of data need to be
classified or annotated. Here's an explanation of how HMMs are used for sequence
labeling:

Defining States and Observations:

In the context of sequence labeling, the states represent the different labels or tags,
while the observations correspond to the elements in the sequence that we want to
label (e.g., words in a sentence).
Transition Probabilities:
For each pair of adjacent states, define the transition probabilities. These probabilities
represent the likelihood of transitioning from one state to another. In sequence
labeling, they capture the likelihood of transitioning from one label to another.
Emission Probabilities:

Define the probabilities of emitting each observation from each state. These
probabilities represent the likelihood of observing a particular element given a
specific label. In NLP, this is often related to word-tag probabilities.
Initialization Probabilities:

Specify the initial probabilities of starting in each state. This represents the likelihood
of starting the sequence with a particular label.
The Forward Algorithm:

Given a sequence of observations (e.g., a sentence), use the forward algorithm to

compute the likelihood of the sequence occurring under the model. This involves
recursively calculating probabilities while considering all possible state sequences.
The Viterbi Algorithm:

Use the Viterbi algorithm to find the most likely sequence of states given the
observations. This algorithm efficiently finds the best sequence by considering both
the transition probabilities and the emission probabilities.
Decoding:

Once the Viterbi algorithm has been applied, the HMM produces a sequence of labels
that best matches the input sequence.
Output:

The output of the HMM is the sequence of labels that correspond to the input
sequence. These labels provide the desired annotation or classification for each
element in the input sequence.
In NLP, HMMs have been used for various sequence labeling tasks, including part-of-
speech tagging, named entity recognition, and chunking. They are particularly
effective when there are strong dependencies between adjacent elements in a
sequence, making them a valuable tool for understanding and processing natural
language text.
15. Construct parse tree for the following CFG using following rules ;
“The man read the book”
a.
16. Explain Discourse reference resolution
Discourse Reference Resolution, also known as anaphora resolution, is a crucial task
in natural language processing (NLP) that involves identifying the entities or
expressions to which pronouns, definite noun phrases, or other referring expressions
refer within a text or conversation.
Consider this example: John went to the store. He bought a book.
In this example, "He" refers to "John", and "a book" refers to a book that John bought.
Discourse Reference Resolution involves identifying reference phrases, determining
potential antecedents, scoring antecedents, selecting the best antecedent, forming
coreference chains, and handling ambiguity.
Discourse Reference Resolution is essential for understanding the flow of information
in a text, especially in more complex documents or dialogues. It is applied in a wide
range of NLP applications, including machine translation, text summarization,
question answering, and more. Accurate resolution of references greatly enhances the
ability of machines to understand and generate coherent and contextually appropriate
text.

17. What do you mean by word sense disambiguation (WSD)?. Explain Machine learning
based methods

Word Sense Disambiguation (WSD) is a natural language processing task that aims to determine the
correct meaning or sense of a word within a given context. Many words in natural language have
multiple meanings (polysemy), and identifying the correct sense is crucial for tasks like machine
translation, information retrieval, and sentiment analysis.
Machine learning-based methods for WSD leverage algorithms that learn patterns from annotated
data to make predictions about word senses. Here's an explanation of how machine learning is applied
to WSD:
1. Feature Extraction: Start by representing words and their context in a way that can be used
for machine learning. This often involves creating feature vectors where each dimension
corresponds to a specific linguistic feature (e.g., surrounding words, part-of-speech tags,
syntactic structures).
2. Annotated Data: Gather a dataset of annotated examples where each instance includes a
word, its context, and the correct sense label. For instance, given the sentence "I saw a bat",
the context "saw a" could be labeled with senses related to either a flying mammal or a sports
equipment.
3. Training Phase: Use the annotated data to train a machine learning model. Common
algorithms used for WSD include Support Vector Machines (SVMs), Decision Trees, Random
Forests, and Neural Networks.
4. Feature Selection: Identify the most relevant features that contribute to distinguishing
between different word senses. This can involve techniques like Information Gain or L1
Regularization.
5. Model Training: The machine learning algorithm learns the relationships between the
features and the correct word senses in the training data.
6. Prediction: Given a new instance with a word and its context, the trained model predicts the
most likely sense for that word in that context.
7. Evaluation: The performance of the WSD system is assessed using evaluation metrics like
accuracy, precision, recall, and F1-score on a separate test set.
8. Fine-tuning and Optimization: Depending on the results, the model may be fine-tuned, or
different algorithms or features may be explored to improve performance.
9. Application: The trained model can be used for disambiguating words in new, unseen texts.
This is particularly useful in various NLP applications, such as machine translation, information
retrieval, and sentiment analysis.
Machine learning-based approaches for WSD can be effective, especially when they are trained on
large and diverse datasets. However, they often require substantial amounts of annotated data for
training, and the quality of the features used can greatly impact the performance of the model.
Additionally, pre-trained word embeddings or contextual embeddings from models like BERT have
shown promise in improving WSD performance.

DESIGN and ANALYSIS of ALGORITHM Hand Written Notes
No ratings yet
DESIGN and ANALYSIS of ALGORITHM Hand Written Notes
126 pages
Final
No ratings yet
Final
16 pages
Unit 2
No ratings yet
Unit 2
32 pages
NLP Unit 1 Notes
100% (1)
NLP Unit 1 Notes
19 pages
M6 QA Univ Sol
No ratings yet
M6 QA Univ Sol
19 pages
IML-IITKGP - Assignment 2 Solution
No ratings yet
IML-IITKGP - Assignment 2 Solution
11 pages
Unit 4 NLP Notes
No ratings yet
Unit 4 NLP Notes
35 pages
PAT Trees and PAT Arrays
No ratings yet
PAT Trees and PAT Arrays
12 pages
Information Retrieval Systems U6
No ratings yet
Information Retrieval Systems U6
13 pages
University of Mumbai Dec 2018 TCS Paper Solved
No ratings yet
University of Mumbai Dec 2018 TCS Paper Solved
18 pages
Fake Job Post Detection Using Machine Learning
100% (1)
Fake Job Post Detection Using Machine Learning
24 pages
Assignment 8
No ratings yet
Assignment 8
5 pages
NLP Notes For Students
No ratings yet
NLP Notes For Students
18 pages
Data Mining-Graph Mining
No ratings yet
Data Mining-Graph Mining
9 pages
NATURAL LANGUAGE PROCESSING (18CS2T50) - Mid Term Exam - 2021-2022
No ratings yet
NATURAL LANGUAGE PROCESSING (18CS2T50) - Mid Term Exam - 2021-2022
2 pages
CST308 - KQB KtuQbank
No ratings yet
CST308 - KQB KtuQbank
13 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
No ratings yet
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
9 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
51 pages
NLP Assignment-4 Solution
100% (1)
NLP Assignment-4 Solution
5 pages
Compiler Design Unit 2
No ratings yet
Compiler Design Unit 2
117 pages
Irs Important Questions
0% (1)
Irs Important Questions
3 pages
Daa-r22-Unit 1&2-Digital Notes Cse Dept (A.y 2024-25) @DR.K
No ratings yet
Daa-r22-Unit 1&2-Digital Notes Cse Dept (A.y 2024-25) @DR.K
50 pages
Model Question Paper
0% (1)
Model Question Paper
2 pages
Expert Systems: Dendral & Mycin
100% (2)
Expert Systems: Dendral & Mycin
7 pages
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
No ratings yet
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
4 pages
NLP Techmax NLP
100% (1)
NLP Techmax NLP
137 pages
Unit 5
No ratings yet
Unit 5
8 pages
M. Tech. (Sem-Ii) Theory Examination 2017-18 Distributed Data Base
100% (1)
M. Tech. (Sem-Ii) Theory Examination 2017-18 Distributed Data Base
2 pages
Reducibility: Design and Analysis of Algorithms (18CSE107)
No ratings yet
Reducibility: Design and Analysis of Algorithms (18CSE107)
20 pages
FIND-S Algorithm: Machine Learning 15CSL76
No ratings yet
FIND-S Algorithm: Machine Learning 15CSL76
3 pages
Sample Solutions Unit Test 1 For Set A, B, C and D
No ratings yet
Sample Solutions Unit Test 1 For Set A, B, C and D
33 pages
Week 7 Assignment 1
No ratings yet
Week 7 Assignment 1
6 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
Characteristics of Soft Computing
88% (8)
Characteristics of Soft Computing
11 pages
NLP UNIT 2 (Ques Ans Bank)
No ratings yet
NLP UNIT 2 (Ques Ans Bank)
26 pages
SEM-2-NLP Questions
No ratings yet
SEM-2-NLP Questions
3 pages
Optimization For Long-Term Dependencies
No ratings yet
Optimization For Long-Term Dependencies
57 pages
Question Bank NLP
100% (1)
Question Bank NLP
11 pages
Non-Deterministic Reward and Action
No ratings yet
Non-Deterministic Reward and Action
2 pages
Artificial Intelligence (AI) Part - 2, Lecture - 12: Unification in First-Order Logic
0% (1)
Artificial Intelligence (AI) Part - 2, Lecture - 12: Unification in First-Order Logic
18 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
23 pages
Partial Order Plan
No ratings yet
Partial Order Plan
30 pages
Artifact - Diagrams - 5th Unit
No ratings yet
Artifact - Diagrams - 5th Unit
3 pages
Q. What Is Input Buffering. What Is Sentinels?
No ratings yet
Q. What Is Input Buffering. What Is Sentinels?
6 pages
Analytical Learning
No ratings yet
Analytical Learning
42 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
DCDR Question Bank
No ratings yet
DCDR Question Bank
4 pages
Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"
67% (3)
Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"
23 pages
NLP Unit-3-Semantics-And-Pragmatics
No ratings yet
NLP Unit-3-Semantics-And-Pragmatics
20 pages
Uniquely Decodable Codes (UDC) : Data Compression and Data Retrieval
No ratings yet
Uniquely Decodable Codes (UDC) : Data Compression and Data Retrieval
9 pages
NLP Important and Super Important Questions-18CS743
No ratings yet
NLP Important and Super Important Questions-18CS743
2 pages
Ccs369-Unit 4
No ratings yet
Ccs369-Unit 4
13 pages
It-3035 (NLP) - CS End May 2023
No ratings yet
It-3035 (NLP) - CS End May 2023
10 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Data Analytics Unit-I
No ratings yet
Data Analytics Unit-I
25 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
NLP Assign Mod-4,5,6 IramShaikh
No ratings yet
NLP Assign Mod-4,5,6 IramShaikh
10 pages
Text Mining
No ratings yet
Text Mining
34 pages

NLP Unit Test 2

Uploaded by

NLP Unit Test 2

Uploaded by

1.

Compare Top down and bottom up parsing

2. What do you mean by Synset give example

3. How thesaurus based approach is used to find Word similarity

5. Explain with example relationships between word senses

6. Explain with suitable example following relationships between word meanings.

7. List and explain steps in text processing for Information Retrieval

8. Explain Yarowsky bootstrapping approach of semi supervised learning

9. For given corpus,

Emission Probability Matrix:

10. Explain Text summarization in detail.

13. Write a note on wordnet

14. Explain how HMM is used for sequence labelling

Defining States and Observations:

Given a sequence of observations (e.g., a sentence), use the forward algorithm to

You might also like