NLP ANONYMOUS QB Ans
NLP ANONYMOUS QB Ans
UNIT 3
1. Discuss the following NLP techniques.
a. TFIDF
b. Word2Vec
c. Doc2Vec
2. What is smoothing? Explain in detail.
3. Write short note on Hidden Markov Model?
4. Explain Generative models of language?
5. What is N-gram model? Explain Unigram, Bigram and trigram model?
6. Write short note on
a. Word Embeddings/Vector Semantics
b. BERT
c. Graph based model
UNIT 4
1. Define Information Retrieval. Explain IR process with suitable diagram.
2. Explain Term-Weighting with suitable example.
3. What is Relation extraction?
4. Explain classes of algorithms for relation extraction.
5. What are the referring expression? Explain each with suitable example.
6. What are the approaches to cross lingual Information retrieval.
UNIT 5
1. What is significance use of TextBlob library, with example.
2. Write a note on each one of the NLP libraries: NLTK, SpaCy, TextBlob, Gensim
3. Which types of tasks are performed by Gensim library? Give one example.
4. What are the different lexical knowledge networks? Explain Indo-WordNet and
VerbNet with examples.
5. Explain Lesk Algorithm with example.
UNIT 6
1. Explain statistical machine translation.
2. Explain stages of Natural Language Generation.
3. List and discuss various commercial and open-source NLG tools.
4. How discourse processing uncover linguistic structures from texts at several levels?
5. Explain Cross Lingual machine translation (XLM) model.
1. Discuss the following NLP techniques.
a. TFIDF:
TFIDF stands for Term Frequency-Inverse Document Frequency. It is a technique used to
quantify the importance of a word in a document or a corpus. TFIDF takes into account both
the frequency of a term in a document and its rarity in the entire corpus.
The formula to calculate TFIDF is:
TFIDF(t, d, D) = TF(t, d) * IDF(t, D)
TF(t, d) represents the term frequency of term 't' in document 'd', while IDF(t, D)
represents the inverse document frequency of term 't' in the entire corpus 'D'.
For example, let's say we have a corpus of three documents: Document 1, Document 2,
and Document 3. We want to calculate the TFIDF score for the term 'apple' in Document 2. If
'apple' appears 5 times in Document 2, and it appears in the other documents as follows:
Document 1 - 2 times, Document 3 - 3 times, then the TFIDF score for 'apple' in Document 2
will be calculated based on its frequency and rarity.
b. Word2Vec:
Word2Vec is a technique used to generate distributed representations (word embeddings)
of words based on their context in a large corpus of text. It uses a neural network model to
learn the associations between words. Word2Vec represents each distinct word with a
vector of numbers, where words with similar meanings are located closer to each other in
the vector space. This allows us to capture the semantic and syntactic qualities of words.
For example, if we have a Word2Vec model trained on a large corpus of news articles, we
can use it to find similar words or complete partial sentences. If we input the word "king"
into the model, it might output words like "queen," "royal," and "throne" as words that are
semantically similar to "king."
c. Doc2Vec:
Doc2Vec is an extension of Word2Vec that generates distributed representations of
variable-length pieces of text, such as sentences, paragraphs, or entire documents. It allows
us to estimate the semantic meanings for different sections of text and capture the
relationships between them. Doc2Vec models can be trained on a corpus of documents and
used to infer document embeddings for new, unseen documents.
2. What is smoothing? Explain in detail.
b. BERT:
BERT (Bidirectional Encoder Representations from Transformers) is a language model
introduced by Google in 2018. It has had a significant impact on various natural language
processing tasks, including sentiment analysis, text classification, and question answering.
BERT is based on the transformer architecture and is designed to capture bidirectional
contextual information from text data.
Unlike traditional language models that process text in a left-to-right or right-to-left manner,
BERT uses a masked language model (MLM) and next sentence prediction (NSP) tasks during
pre-training to learn contextualized word representations. The MLM randomly masks some
words in a sentence, and BERT learns to predict those masked words based on the
surrounding context. The NSP task involves predicting whether two sentences appear
consecutively in a document.
After pre-training, BERT can be fine-tuned on specific downstream tasks by adding task-
specific layers on top of the pre-trained model. This allows BERT to adapt to specific NLP
tasks with fewer training examples.
c. Graph-based Model:
Graph-based models in natural language processing leverage graph theory to represent and
analyze textual data. They represent words or entities as nodes and the relationships
between them as edges in a graph structure. Graph-based models are particularly useful for
tasks that involve understanding semantic relationships and capturing the global structure of
text
One popular graph-based model is the Graph Convolutional Network (GCN), which applies
convolutional operations on graph-structured data. GCNs can capture the local and global
dependencies between nodes in a graph and learn node representations based on their
neighbors. These representations can be used for various tasks, such as document
classification, entity recognition, and relation extraction.
Graph-based models can also incorporate external knowledge graphs, such as WordNet or
ConceptNet, to enhance semantic understanding and enable reasoning about concepts and
relationships between entities.
Overall, graph-based models provide a flexible framework for analyzing text data, capturing
relationships, and leveraging external knowledge sources to enhance the performance of
various natural language processing tasks.
1. Define Information Retrieval. Explain IR process with a suitable diagram.
Information Retrieval (IR) is the process of retrieving relevant information from a large
collection of unstructured or semi-structured data, typically textual documents. It involves
searching and retrieving documents that are most relevant to a user's query or information
need.
The IR process typically involves the following steps:
1. Document Collection: A collection of documents is gathered, which can include web
pages, articles, books, or any other textual sources.
2. Indexing: The documents are processed to create an index, which is a data structure that
enables efficient and quick retrieval of documents based on their content. The index stores
the terms found in the documents and their corresponding locations.
3. Query Processing: When a user submits a query, it is analyzed and processed to
understand the user's information need. This involves tasks like query expansion, where
additional terms are added to the query to improve retrieval accuracy.
4. Ranking: The indexed documents are ranked based on their relevance to the query.
Various ranking algorithms, such as TF-IDF (Term Frequency-Inverse Document Frequency)
or BM25 (Best Matching 25), are used to calculate the relevance scores.
5. Retrieval: The top-ranked documents are retrieved and presented to the user as search
results. The user can then browse through the results to find the desired information.
There are different classes of algorithms used for relation extraction. Here are three
commonly used classes:
a. Rule-based Approaches: Rule-based algorithms rely on handcrafted patterns or rules to
identify relationships between entities. These rules are designed based on linguistic
patterns, syntactic structures, or domain-specific knowledge. For example, a rule could be
defined to extract the relationship "is the CEO of" by looking for patterns like "X is the CEO of
Y." Rule-based approaches require manual rule creation, which can be time-consuming and
may not cover all possible variations in the text.
b. Supervised Machine Learning Approaches: Supervised machine learning algorithms learn
from labeled training data to automatically classify the relationship between entities. The
training data consists of annotated sentences where the relationships are pre-defined.
Features such as lexical, syntactic, or contextual information are extracted from the text, and
a machine learning model, such as a support vector machine (SVM) or a neural network, is
trained to predict the relationship. These approaches require a large amount of labeled data
and may suffer from limitations if the training data does not cover the full range of
relationships.
c. Distant Supervision and Bootstrapping Approaches: Distant supervision methods leverage
existing knowledge bases or ontologies to automatically generate labeled training data. The
idea is to use the relationships present in the knowledge bases as supervision signals and
align them with the text. These methods use heuristics to identify sentences that mention
the entities of interest and assume that if a sentence mentions the entities, it implies a
relationship between them. Bootstrapping methods iteratively improve the initial set of
extracted relations by using the initially extracted relations to find additional instances.
These approaches help in reducing the dependency on manually labeled training data but
can still be sensitive to noise in the knowledge bases.
These classes of algorithms can be used individually or in combination depending on the
specific requirements and available resources for relation extraction tasks. Each approach
has its strengths and limitations, and the choice of algorithm depends on factors such as the
complexity of relationships, availability of training data, and domain-specific knowledge.
5. Referring Expressions:
Referring expressions are linguistic expressions used in language to refer to specific entities
or objects. They are used to identify and distinguish entities within a given context. Referring
expressions play a crucial role in communication, as they allow speakers and listeners to
understand which entities are being referred to. There are different types of referring
expressions:
a. Definite Noun Phrases: Definite noun phrases refer to specific entities that are already
known or mentioned in the discourse. They are typically preceded by definite articles such
as "the." For example: "The cat is sleeping on the mat."
b. Indefinite Noun Phrases: Indefinite noun phrases refer to unspecified entities or objects.
They are typically preceded by indefinite articles such as "a" or "an." For example: "I saw a
bird in the tree."
c. Pronouns: Pronouns are referring expressions that replace noun phrases and refer back to
previously mentioned entities. They help avoid repetition and maintain coherence in the
discourse. For example: "John is hungry. He wants to eat."
d. Demonstratives: Demonstratives are words like "this," "that," "these," and "those" that
indicate the proximity or distance of an entity from the speaker or the listener. They point to
specific entities in the context. For example: "This book is interesting."
e. Proper Nouns: Proper nouns are names of specific individuals, places, or organizations.
They are used as referring expressions to refer to those specific entities. For example: "John
went to Paris."
The choice of referring expression depends on the context, discourse, and the level of
familiarity of the referred entity to the speaker and listener.
The TextBlob library is a powerful tool for natural language processing (NLP) tasks. It is built
on top of the popular NLTK (Natural Language Toolkit) library and provides a simplified API
for common NLP tasks. Here are some significant uses of the TextBlob library:
1. TextBlob simplifies text processing: TextBlob makes it easier to perform common text
processing operations, such as tokenization, part-of-speech tagging, noun phrase extraction,
and sentiment analysis. It abstracts away the complexities of these tasks and provides a
simple interface to work with.
2. Sentiment analysis: TextBlob offers built-in sentiment analysis capabilities, allowing you to
determine the sentiment polarity (positive, negative, or neutral) and subjectivity
(opinionated vs. objective) of a given text. This is useful for tasks like sentiment analysis of
customer reviews, social media sentiment analysis, and opinion mining
3. Noun phrase extraction: TextBlob can extract noun phrases from text, which are useful for
identifying key entities or topics within a document. This feature enables tasks such as
keyword extraction, topic modeling, and information retrieval.
4. Language translation: TextBlob supports language translation through the integration of
the Google Translate API. You can easily translate text between different languages using the
TextBlob library.
5. Spelling correction: TextBlob provides functionality for spelling correction, allowing you to
automatically correct misspelled words in text. This is beneficial for tasks like text
normalization, spell-checking, and data cleaning.
6. Part-of-speech tagging: TextBlob performs part-of-speech tagging, which assigns
grammatical tags (noun, verb, adjective, etc.) to words in a given text. This information is
valuable for tasks such as syntactic analysis, named entity recognition, and text
understanding.
7. Easy integration with other libraries: TextBlob seamlessly integrates with other Python
libraries, including NLTK, allowing you to leverage additional NLP functionalities. It provides a
convenient and efficient way to combine various NLP tools and techniques in your projects.
The significance of the TextBlob library lies in its simplicity, ease of use, and the wide range
of NLP tasks it supports. It is particularly useful for beginners or users who need to quickly
perform common NLP operations without diving into the intricacies of implementing them
from scratch.
2. Write a note on each one of the NLP libraries: NLTK, SpaCy, TextBlob, Gensim
NLP Libraries:
a. NLTK (Natural Language Toolkit): NLTK is one of the most popular libraries for NLP tasks in
Python. It provides a wide range of tools and resources for tasks such as tokenization,
stemming, lemmatization, part-of-speech tagging, parsing, and more. NLTK is widely used for
research, education, and development of NLP applications.
b. SpaCy: SpaCy is a modern and efficient NLP library that focuses on performance and
production use cases. It provides high-performance, pre-trained models for tasks such as
tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and
more. SpaCy is known for its speed and memory efficiency, making it suitable for large-scale
NLP applications.
c. TextBlob: As discussed earlier, TextBlob is a simple and beginner-friendly NLP library built
on top of NLTK. It provides an easy-to-use API for common NLP tasks such as tokenization,
part-of-speech tagging, sentiment analysis, noun phrase extraction, and more. TextBlob is
often preferred for quick prototyping and smaller NLP tasks due to its simplicity.
d. Gensim: Gensim is a library for topic modeling and document similarity analysis. It
provides efficient implementations of algorithms such as Latent Semantic Analysis (LSA),
Latent Dirichlet Allocation (LDA), and word2vec. Gensim allows you to train topic models on
large corpora and perform document similarity analysis, making it suitable for tasks such as
document clustering and information retrieval.
Each of these libraries has its own strengths and focuses on different aspects of NLP. The
choice of library depends on the specific requirements of the NLP task, the available
resources, and the expertise of the user.
- Topic Modeling: Gensim allows you to build topic models from a collection of documents. It
implements popular algorithms like Latent Semantic Analysis (LSA) and Latent Dirichlet
Allocation (LDA) to discover latent topics in the text data.
- Document Similarity: Gensim provides methods to calculate the similarity between
documents based on their content. It uses techniques like cosine similarity and word
embeddings to measure the similarity between text documents.
- Word Embeddings: Gensim supports the training and use of word embeddings, such as
Word2Vec and FastText. Word embeddings are dense vector representations of words that
capture semantic relationships and can be used for tasks like word similarity, word analogy,
and word clustering.
- Text Preprocessing: Gensim offers utilities for text preprocessing, including tokenization,
stop word removal, and stemming. These preprocessing steps are essential for preparing
text data before applying topic modeling or similarity analysis.
4. What are the different lexical knowledge networks? Explain Indo-WordNet and
VerbNet with examples.
Lexical knowledge networks are resources that organize lexical information and relationships
between words. They capture semantic connections and provide a structured representation
of word meanings. Two popular lexical knowledge networks are Indo-WordNet and VerbNet.
Example:
In Indo-WordNet, the word "पक्षी" (pakshi) belongs to the synset "पंछी" (panchi), which
represents the concept of "bird." This synset includes words like "गरुड़" (garud) and
"चिचड़या" (chidiya), which are synonyms of "पक्षी." Additionally, the synset "पंछी" has
hypernyms like "जीव" (jiv) meaning "living being" and "जन्तु" (jantu) meaning "animal."
- VerbNet: VerbNet is a lexical knowledge network that focuses specifically on verbs in the
English language. It provides information about verb classes, syntactic patterns, and
semantic roles associated with verbs. VerbNet captures the hierarchical structure of verb
classes, which helps in understanding the behavior and usage of different verbs.
Example:
In VerbNet, the verb "run" belongs to the class "Motion," which represents verbs related to
movement. The class "Motion" has subclasses like "Walk," "Jump," and "Sprint." Each
subclass defines specific syntactic patterns and semantic roles associated with the verbs in
that class. For instance, the verb "run" in the "Motion" class is associated with roles like
"Agent," "Destination," and "Path," indicating the entity performing the action, the target
destination, and the path of movement.
Lexical knowledge networks like Indo-WordNet and VerbNet provide valuable resources for
natural language processing tasks, enabling semantic analysis, word sense disambiguation,
and knowledge representation.
1. Gather the context: Obtain the target word and its surrounding words in the given
sentence or text. This context is crucial in understanding the intended sense of the word.
2. Retrieve word senses: Retrieve the definitions or senses of the target word from a lexical
resource such as a dictionary or WordNet. Each sense represents a different meaning or
usage of the word.
3. Calculate overlap: For each sense of the target word, calculate the overlap between the
context words and the words in the sense definition. The overlap is determined by counting
the number of common words between the context and the sense definition.
4. Select the most appropriate sense: Identify the sense with the highest overlap score. This
sense is considered the most contextually relevant to the target word in the given context.
Here's an example to illustrate the Lesk Algorithm:
1. Gather the context: Consider the words surrounding the target word: "refused," "grant,"
"loan."
2. Retrieve word senses: Retrieve the senses of the word "bank" from a lexical resource like
WordNet. Let's say there are two senses:
a. Sense 1: "Financial institution"
b. Sense 2: "Riverside or lakeside area"
3. Calculate overlap: Compare each sense definition with the context words:
a. Overlap score for Sense 1: "Financial institution" has an overlap of 1 with "loan."
b. Overlap score for Sense 2: "Riverside or lakeside area" has no overlap with the context
words.
4. Select the most appropriate sense: Since Sense 1 has a higher overlap score (1), it is
considered the most suitable sense in this context. Therefore, the intended meaning of
"bank" in the given sentence is a "financial institution."
The Lesk Algorithm helps disambiguate words with multiple senses by analyzing the context
and selecting the sense that best aligns with the surrounding words. It is a useful technique
in natural language processing tasks where word sense disambiguation is required, such as
machine translation, information retrieval, and text summarization.
1. Explain statistical machine translation.
Statistical Machine Translation (SMT) is a type of machine translation approach that relies on
statistical models and algorithms to translate text from one language to another. It uses a
large amount of parallel or aligned bilingual corpora to learn the translation patterns and
probabilities. Here's how SMT works:
a. Corpus alignment: A parallel corpus containing source language and target language
sentences is aligned at the sentence or phrase level. Each sentence or phrase in the source
language is paired with its translation in the target language.
b. Training phase: The aligned corpus is used to extract statistical information such as word
frequencies, translation probabilities, and language models. Various techniques like word
alignment models (e.g., IBM Models), phrase-based models, or more advanced neural
network models are used to estimate these probabilities.
c. Translation process: During the translation phase, the source language text is analyzed
and broken down into units such as words or phrases. The statistical models are then used
to generate the most probable translations based on the learned probabilities. The
translation output is a target language text that conveys the meaning of the source text.
d. Decoding and reordering: In the decoding stage, the system selects the best translation
option based on the calculated probabilities. It also handles word reordering, as the word
order may differ between languages.
e. Evaluation and refinement: The translated output is evaluated using metrics such as
BLEU (Bilingual Evaluation Understudy) to measure its quality and compare it to reference
translations. The system can be further refined by iteratively training on larger and more
diverse parallel corpora or by incorporating additional linguistic and contextual information.
b. Document structuring: The selected content is structured into a coherent format. This
includes deciding on the appropriate sections, paragraphs, headings, and subheadings to
present the information effectively.
c. Sentence planning: In this stage, the system generates the high-level structure of
sentences. It involves determining the order and type of sentences, as well as specifying the
main ideas and relationships between them.
d. Lexicalization: The system selects the appropriate words and phrases to express the
intended meaning. It takes into account factors such as style, tone, and the target audience.
This stage involves applying grammar rules, vocabulary selection, and potentially
incorporating pre-defined templates or language patterns.
f. Aggregation and realization: The generated sentences and phrases are combined and
realized into a coherent and fluent piece of text. This stage involves ensuring grammatical
correctness, appropriate punctuation, and overall readability of the generated output.
g. Post-processing: Additional steps may be taken to enhance the quality of the generated
text. This can include proofreading, grammar checking, and applying stylistic modifications
to make the text more engaging and natural.
The stages of NLG are designed to transform structured data or information into human-
readable language, enabling the automated generation of reports, summaries, product
descriptions, and other types of textual content.
3. List and discuss various commercial and open-source NLG tools.
Various commercial and open-source Natural Language Generation (NLG) tools are available
for automating the generation of human-like text. Here are some examples:
4. How discourse processing uncover linguistic structures from texts at several levels?
Discourse processing aims to uncover linguistic structures and relationships within texts at
multiple levels to understand the discourse or conversation flow. It involves analyzing the
organization, coherence, and cohesion of sentences and paragraphs to capture the intended
meaning. Here's how discourse processing uncovers linguistic structures:
a. Discourse segmentation: The text is segmented into coherent and meaningful discourse
units, such as paragraphs or sections, based on breaks in topic or theme.
b. Discourse parsing: The parsed discourse structure represents the relationships between
sentences or clauses within a discourse. It captures connections like elaboration, contrast,
causality, or temporal ordering.
e. Cohesion analysis: Cohesion analysis deals with the linguistic devices that establish
connections within a text, such as lexical repetition, reference, conjunctions, and discourse
markers. It ensures smooth transitions and clarifies relationships between sentences.
a. Multilingual training: The XLM model is trained on a large corpus of parallel text data
from multiple languages. The training includes both the source language and target language
sentences aligned at the sentence or phrase level.
b. Shared encoding: The model employs shared encoders that can encode and represent
the input text in a language-agnostic way. This allows the model to capture the common
linguistic features and relationships across multiple languages.
c. Language-specific decoders: The XLM model also includes language-specific decoders
that generate the translated output in the target language. Each decoder is trained to handle
the specific characteristics and nuances of a particular language.
d. Cross-lingual transfer: Through the shared encoding, the model can transfer knowledge
and learnings from one language to another, improving the translation quality for low-
resource languages or language pairs with limited training data.
e. Fine-tuning and adaptation: The XLM model can be further fine-tuned on specific
language pairs or domains to improve translation accuracy and handle domain-specific
terminology.
The XLM model has shown promising results in cross-lingual machine translation by
leveraging shared representations and knowledge across multiple languages, enabling more
effective and accurate translation between diverse language pairs.