Natural Language Processing All Question
Natural Language Processing All Question
Module 1 ->
2. With examples, explain the different types of NER attributes. (10 Marks)
Person (PER): Refers to the identification of people's names, such as "Barack Obama" or
"Albert Einstein."
Organization (ORG): Identifies names of companies, institutions, and other
organizations. For example, "Google," "Apple," and "United Nations."
Location (LOC): Refers to places such as countries, cities, or landmarks. Example:
"New York," "India," "Mount Everest."
Date/Time (DATE/TIME): Identifies expressions of time, such as "Monday," "April
5th," or "2022."
Money (MONEY): Refers to monetary amounts, like "$500," "₹1000," or "£200."
Percentage (PERCENT): Identifies percentages, e.g., "5%" or "20 percent."
Miscellaneous (MISC): Other entities that do not fit into the above categories, like event
names or product names. Example: "World Cup," "iPhone."
3. What do you understand about Natural Language Processing? (2 Marks)
Stop words are commonly occurring words such as "the," "is," "in," and "and," which are
usually filtered out in text processing because they do not carry significant meaning for
analysis.
Machine Translation: Translating text from one language to another, like Google
Translate.
Sentiment Analysis: Analyzing social media posts, reviews, or customer feedback to
determine sentiment (positive, negative, neutral).
Precision: Measures the accuracy of the retrieved results. It is the ratio of relevant
documents retrieved to the total number of documents retrieved. Formula:
Precision=True PositivesTrue Positives + False Positives\text{Precision} =
\frac{\text{True Positives}}{\text{True Positives + False
Positives}}Precision=True Positives + False PositivesTrue Positives
Recall: Measures how many relevant documents were retrieved out of all the relevant
documents available. Formula:
Recall=True PositivesTrue Positives + False Negatives\text{Recall} = \frac{\text{True
Positives}}{\text{True Positives + False
Negatives}}Recall=True Positives + False NegativesTrue Positives
Difference: Precision focuses on minimizing false positives, while recall focuses on
minimizing false negatives. An ideal system balances both.
Multi Word Tokenization is the process of breaking a text into meaningful multi-word
units (e.g., "New York," "United States") rather than just individual words.
Stems are the base or root forms of words obtained by removing prefixes and suffixes.
For example, "running" becomes "run," and "better" becomes "good."
Affixes are prefixes, suffixes, infixes, or circumfixes attached to a root word to modify its
meaning. Examples: "un-" (prefix) in "undo," "-ing" (suffix) in "running."
Morphology in NLP refers to the study of the structure of words, including the study of
prefixes, suffixes, and root forms of words.
Corpus Examples:
o Brown Corpus: A collection of texts that covers different genres, such as news
articles and fiction.
o Reuters Corpus: A large collection of news stories used for text classification
tasks.
19. State the difference between word and sentence tokenization. (2 Marks)
Word tokenization is the process of splitting a sentence or text into individual words or tokens. This is a
fundamental step in Natural Language Processing (NLP) that enables further text analysis.
Example:
Importance:
Tokenization breaks down text into manageable units, which is essential for tasks like text
classification, Named Entity Recognition (NER), and machine learning models.
Named Entity Recognition (NER) is a subtask of information extraction that identifies and classifies
named entities in text into predefined categories like persons, organizations, locations, dates, etc.
How NER Works:
1. Preprocessing: The text is first tokenized and processed to identify words, punctuation, and
other structural components.
2. Feature Extraction: Features such as part-of-speech (POS) tags, word shapes (e.g., capitalized
words), and surrounding words are extracted.
3. NER Model: Machine learning models (like CRFs, HMMs, or deep learning models) are used to
classify each token in the sentence into a named entity or non-entity.
4. Classification: The tokens are classified into categories such as:
o Person: "Barack Obama"
o Location: "New York"
o Organization: "Google"
o Date: "January 1, 2020"
Example:
23. What are the benefits of eliminating stop words? Give some examples where
stop word elimination may be harmful. (5 Marks)
Stop words are common words (such as "the", "is", "at", etc.) that are usually filtered out during text
processing because they don’t carry significant meaning for certain tasks like information retrieval or
classification.
1. Improved Model Efficiency: Reduces the size of the text data, which can speed up the analysis
and improve processing efficiency.
2. Better Focus on Important Terms: Helps focus on more meaningful words, which are crucial for
understanding the main topics or context.
3. Reduced Noise: Removes common words that may introduce noise into machine learning
models.
1. Context Loss: In some cases, removing stop words may alter the meaning. For example, “I am
reading a book” becomes “reading book,” which can change the intended meaning.
2. For Sentiment Analysis: Words like "not" or "very" can significantly alter sentiment (e.g., "not
good" vs. "good"). Removing such words can lead to misinterpretation.
24. What do you mean by RegEx? Explain with example. (5 Marks)
Regular Expressions (RegEx) are sequences of characters that form search patterns. They are used for
pattern matching within strings, allowing tasks like search, replace, or validate data formats.
1. Pattern Matching: Defines a search pattern to find specific character sequences in strings.
2. Special Characters:
o ".": Matches any single character.
o "*": Matches zero or more occurrences of the preceding character.
o "^": Matches the beginning of a string.
o "$": Matches the end of a string.
Example:
Regex: \d{3}-\d{2}-\d{4}
o Explanation: This matches a pattern like "123-45-6789" (social security number format).
o Input: "My SSN is 123-45-6789."
o Matches: "123-45-6789"
Dependency Parsing is a technique in NLP that establishes relationships between words in a sentence,
where the words are connected based on grammatical structure. Each word depends on another word,
forming a tree-like structure.
Understanding Sentence Structure: It helps understand syntactic structures, which is useful for
tasks like machine translation and question answering.
26. Write a regular expression to represent a set of all strings over {a, b} of even
length. (5 Marks)
To create a regular expression for all strings over the alphabet {a, b} of even length, we need to ensure
that every string consists of pairs of characters.
Regular Expression:
Regex: ^(aa|bb|ab|ba)*$
Explanation:
Example:
27. Write a regular expression to represent a set of all strings over {a, b} of
length 4 starting with an a. (5 Marks)
Regular Expression:
Regex: ^a(b|a){3}$
Explanation:
28. Write a regular expression to represent a set of all strings over {a, b}
containing at least one 'a'. (5 Marks)
Regular Expression:
Regex: ^(.*a.*)$
Explanation:
Example:
29. Compare and contrast NLTK and Spacy, highlighting their differences. (5
Marks)
1. Ease of Use:
o NLTK: More beginner-friendly, with a wide range of tools and libraries.
o Spacy: Designed for real-world applications, with a focus on speed and efficiency.
2. Speed:
o NLTK: Slower, as it is more focused on research and educational use.
o Spacy: Optimized for performance and production-level tasks.
3. Pre-trained Models:
o NLTK: Limited pre-trained models for tasks like NER.
o Spacy: Offers state-of-the-art pre-trained models for tasks like POS tagging, dependency
parsing, and NER.
4. Use Cases:
o NLTK: Ideal for educational purposes, research, and exploration.
o Spacy: Better suited for production systems and applications requiring speed and
scalability.
30. What is a Bag of Words? Explain with examples. (5 Marks)
Bag of Words (BoW) is a simple representation of text data where each unique word is treated as a
feature, ignoring grammar and word order but keeping track of word frequencies.
How it works:
Example:
Documents:
o Doc 1: "I love NLP"
o Doc 2: "NLP is fun"
Vocabulary: ["I", "love", "NLP", "is", "fun"]
BoW Representation:
o Doc 1: [1, 1, 1, 0, 0] ("I"=1, "love"=1, "NLP"=1, "is"=0, "fun"=0)
o Doc 2: [0, 0, 1, 1, 1] ("I"=0, "love"=0, "NLP"=1, "is"=1, "fun"=1)
1. Definition:
o Regular Grammar: A formal grammar that can be used to describe regular languages,
typically used in computational theory.
o Regular Expression (RegEx): A sequence of characters that forms a search pattern, used
for matching strings in text.
2. Use:
o Regular Grammar: Defines rules to generate strings in a language.
o Regular Expression: Defines patterns to search or match strings.
3. Application:
o Regular Grammar: Often used in theoretical computer science to describe languages.
o Regular Expression: Used in programming for pattern matching, text searching, and
validation.
4. Syntax:
o Regular Grammar: Involves production rules like S → aS | bS | ε.
o Regular Expression: Uses symbols like *, +, ?, etc., for matching text.
32. Describe the word and sentence tokenization steps with the help of an
example. (10 Marks)
Word Tokenization:
Sentence Tokenization:
Example of Both:
33. How can the common challenges faced in morphological analysis in natural
language processing be overcome? (10 Marks)
1. Ambiguity: Words may have multiple forms or meanings, e.g., "run" can be a verb or noun.
2. Rich Morphology: Languages like Turkish or Finnish have many forms of a word based on
inflections.
3. Out-of-Vocabulary Words: New or rare words that don’t appear in dictionaries.
Solutions:
1. Use of Morphological Analyzers: Tools like the Porter Stemmer or Snowball Stemmer can
handle many common inflections.
2. Contextual Analysis: Implementing context-based analysis to disambiguate word meanings,
such as using POS tagging.
3. Machine Learning Models: Train models on large datasets to predict rare word forms and
handle out-of-vocabulary words.
4. Lexical Resources: Use comprehensive lexicons or dictionaries for handling rich morphology in
specific languages.
34. Derive Minimum Edit Distance Algorithm and compute the minimum edit
distance between the words "MAM" and "MADAM". (10 Marks)
The Minimum Edit Distance algorithm computes the minimum number of operations (insertions,
deletions, or substitutions) required to convert one string into another.
1. Initialization: Create a matrix where the cell (i, j) represents the edit distance between the
first i characters of the first word and the first j characters of the second word.
2. Recurrence Relation:
o If the characters are equal: cost(i, j) = cost(i-1, j-1)
o If the characters are different: cost(i, j) = 1 + min(cost(i-1, j), cost(i,
j-1), cost(i-1, j-1))
3. Result: The bottom-right cell of the matrix will contain the minimum edit distance.
Example:
MADAM
M0 1 2 3 4
A 1 0123
M2 1 2 2 3
The minimum edit distance is 2 (Insert 'D' at position 3 and Insert 'A' at position 4).
36. How to solve any application of NLP? Justify with an example. (10 Marks)
37. What is Corpora? Define the steps of creating a corpus for a specific task. (10
Marks)
Corpora:
Definition: A corpus is a large and structured set of texts that is used for statistical analysis,
natural language processing, and machine learning applications. It provides a foundation for
training NLP models.
Information Extraction (IE) is the process of automatically extracting structured information from
unstructured text. It aims to identify specific entities, relationships, and events within the text to convert
it into a structured format.
Key Tasks:
1. Named Entity Recognition (NER): Identifying proper nouns like names of people, places,
organizations, dates, etc.
2. Relationship Extraction: Identifying relationships between entities (e.g., "Alice works at XYZ
Corp").
3. Event Extraction: Identifying events and their participants (e.g., "Bob met Alice on January 1st").
Example:
Text: "Apple announced its new iPhone in San Francisco on September 10, 2020."
Extracted Information:
o Entity: "Apple" (organization)
o Event: "announced"
o Location: "San Francisco"
o Date: "September 10, 2020"
39. State the different applications of Sentiment analysis and Opinion mining
with examples. Write down the variations as well. (10 Marks)
Variations:
1. Fine-Grained Sentiment Analysis: Detects sentiments at a more granular level (e.g., determining
if a customer is happy with specific features).
2. Aspect-Based Sentiment Analysis: Focuses on identifying sentiments about specific aspects
(e.g., "The battery life of the phone is great" vs "The camera quality is poor").
40. State a few applications of Information Retrieval. (5 Marks)
Text Normalization:
Text normalization is the process of transforming text into a standardized format to reduce variations
and make it easier to process. It involves tasks like:
1. Tokenization:
o Definition: The process of splitting text into smaller units, such as words, phrases, or
sentences.
o Purpose: Helps break down text into manageable chunks for further analysis.
o Example:
Input: "I love NLP!"
Tokens: ["I", "love", "NLP", "!"]
2. Normalization:
o Definition: The process of standardizing text by converting it into a uniform format.
o Purpose: Helps reduce variations by making the text consistent for analysis.
o Example:
Input: "I'm loving NLP!"
Normalized: "I am loving NLP"
Justification:
Tokenization deals with breaking text into units, while Normalization ensures uniformity by
addressing case, punctuation, and other inconsistencies.
43. What makes part-of-speech (POS) tagging crucial in NLP, in your opinion?
Give an example to back up your response. (5 Marks)
45. Do you believe there are any distinctions between prediction and
classification? Illustrate with an example. (5 Marks)
1. Prediction:
o Definition: Prediction refers to estimating a continuous value based on input features. It
typically involves regression models where the output is a real number.
o Example: Predicting the temperature for tomorrow based on historical data.
2. Classification:
o Definition: Classification is the process of categorizing input data into predefined classes
or categories. It involves assigning labels to data points.
o Example: Classifying emails as either "spam" or "not spam."
Key Distinction:
Prediction deals with continuous values, while classification deals with discrete labels or
categories.
46. Explain the connection between word tokenization and phrase tokenization
using examples. How do both tokenization methods contribute to the
development of NLP applications? (10 Marks)
1. Word Tokenization:
o Definition: The process of breaking down a text into individual words.
o Purpose: It’s often the first step in NLP preprocessing, allowing the model to understand
and process each word independently.
o Example:
Sentence: "I love NLP."
Tokens: ["I", "love", "NLP"]
2. Phrase Tokenization:
o Definition: The process of grouping multiple words into meaningful units or phrases,
often to preserve context and meaning.
o Purpose: Useful for recognizing multi-word expressions like named entities, common
phrases, or keywords.
o Example:
Sentence: "Natural Language Processing is fascinating."
Phrases: ["Natural Language Processing", "fascinating"]
Word Tokenization focuses on breaking down the sentence into words, while Phrase
Tokenization focuses on understanding and grouping important multi-word units (such as
entities, technical terms, etc.).
Both tokenization techniques contribute to NLP applications such as Named Entity Recognition
(NER), machine translation, sentiment analysis, and information retrieval by ensuring
meaningful segmentation of the text, preserving context, and improving accuracy.
47. “Natural Language Processing (NLP) has many real-life applications across
various industries.”- List any two real-life applications of Natural Language
Processing. (5 Marks)
Solution:
Solutions:
1. Definition:
o Rule-based POS tagging uses a set of handcrafted rules to assign POS tags to words in a
sentence. These rules are based on the word's context, surrounding words, and
syntactic patterns.
2. Process:
o A word is assigned a tag based on predefined patterns, such as if a word follows a
determiner (DT), it is tagged as a noun (NN).
o Example: In the sentence "The cat is on the mat," "The" would be tagged as a
determiner (DT), "cat" as a noun (NN), and "is" as a verb (VB).
3. Advantages:
o Simple and interpretable.
o Can be highly accurate with a well-defined set of rules.
4. Disadvantages:
o Rules must be manually created, which can be time-consuming.
o Limited flexibility to handle ambiguities in language.
Regular Grammar:
Definition: A formal grammar that generates a regular language. It consists of production rules
where the left-hand side is a non-terminal and the right-hand side is either a terminal or a
combination of terminal and non-terminal symbols.
Example:
o Production: S→aSS \rightarrow aSS→aS | bbb
Usage: Used for defining the structure of languages in automata theory.
Regular Expression:
Definition: A sequence of characters that defines a search pattern, primarily for string matching
within texts. It is more compact and concise for pattern matching.
Example:
o Regular expression: a(b|a)*
Usage: Used in text searching, pattern matching, and string manipulation tasks.
Key Differences:
Concept: Regular grammar is used to generate strings, while regular expressions are used to
match patterns in strings.
Flexibility: Regular expressions provide more succinct and flexible ways to represent patterns
than regular grammar.
Definition: Multi-Word Tokenization refers to the process of identifying and grouping multiple
words that together represent a single unit or meaning, such as named entities (e.g., "New
York") or multi-word expressions (e.g., "high school").
Example:
o Tokenizing "New York City" as a single token rather than three separate words "New,"
"York," and "City."
Definition: Sentence segmentation is the process of dividing a text into individual sentences. It
involves identifying sentence boundaries, typically by recognizing punctuation marks such as
periods, question marks, and exclamation points.
Example:
o Given the text: "Hello! How are you? I am fine.", the segmentation would be: ["Hello!",
"How are you?", "I am fine."]
Definition: Morphology is the study of the structure and form of words. In NLP, it refers to the
process of analyzing and understanding the internal structure of words, including their prefixes,
suffixes, and root forms.
Example: The word "running" can be broken down into the root word "run" and the suffix "-
ing."
Examples:
1. Brown Corpus: A collection of texts representing a wide variety of written genres.
2. WordNet: A lexical database of English, where words are grouped into sets of
synonyms.
3. Reuters-21578: A collection of news documents, commonly used for text classification
tasks.
Definition: Word tokenization is the process of splitting a text into individual words or tokens.
This is often the first step in text preprocessing for NLP tasks.
Example:
o Sentence: "I love NLP."
o Tokens: ["I", "love", "NLP"]
58. Find the minimum edit distance between two strings ELEPHANT and
RELEVANT? (10 Marks)
Solution:
The minimum edit distance between two strings is calculated using the Levenshtein Distance algorithm,
which finds the minimum number of operations (insertions, deletions, and substitutions) required to
transform one string into another.
1. Strings:
o ELEPHANT
o RELEVANT
2. Operations:
o Insert, delete, or substitute characters to transform one string into the other.
3. Calculation:
o Using dynamic programming, the minimum edit distance between "ELEPHANT" and
"RELEVANT" is calculated as 2 (substitute 'E' with 'R' and substitute 'P' with 'V').
Solution:
1. Strings:
o SUNDAY
o SATURDAY
2. Operations:
o Insertions, deletions, or substitutions are required to convert one string into the other.
3. Minimum Edit Distance:
o The minimum edit distance between "SUNDAY" and "SATURDAY" is 3 (substitute 'S' with
'S', insert 'A', insert 'R', and insert 'Y').
Types of Morphology:
1. Inflectional Morphology:
o Deals with the modification of words to express different grammatical categories such
as tense, case, or number.
o Example: "running" (verb) → "runs" (third-person singular).
2. Derivational Morphology:
o Involves the creation of new words by adding prefixes or suffixes to existing words.
o Example: "happy" → "unhappy" (prefix), "teach" → "teacher" (suffix).
3. Compounding:
o The process of combining two or more words to create a new word.
o Example: "toothbrush" (tooth + brush).
4. Clitics:
o Words or morphemes that cannot stand alone but attach to other words (e.g.,
contractions).
o Example: "I'll" (I + will), "he's" (he + is).
Definition: Stemming is the process of reducing a word to its root form by removing affixes
(prefixes or suffixes).
Example: "running" → "run", "happier" → "happi".
Definition: A corpus is a large and structured set of texts used for linguistic analysis or as
training data for NLP models. It contains a collection of written or spoken texts, often annotated
with additional linguistic information.
Example: The Brown Corpus, containing a diverse collection of text from different genres.
63. State with example the difference between stemming and lemmatization. (5
Marks)
Stemming:
Definition: The process of removing affixes from words to get the root form, which might not
always be a valid word.
Example: "better" → "bett", "running" → "run".
Lemmatization:
Definition: The process of reducing a word to its base or dictionary form, called the lemma. It
considers the word's context and part of speech.
Example: "better" → "good", "running" → "run".
Key Differences:
Stemming may produce non-dictionary words, while lemmatization always results in a valid
word.
Lemmatization considers the word's meaning and context, whereas stemming only focuses on
removing affixes.
64. Write down the different stages of NLP pipeline. (10 Marks)
1. Text Acquisition:
o Collecting text data from various sources such as websites, documents, or databases.
2. Text Preprocessing:
o Includes tasks like tokenization, removing stop words, punctuation, and special
characters.
o Example: "Hello, World!" → ["Hello", "World"]
3. Part-of-Speech Tagging:
o Assigning POS tags (like noun, verb, adjective) to each token.
4. Named Entity Recognition (NER):
o Identifying entities such as names, dates, or locations in the text.
5. Syntactic Parsing:
o Analyzing the syntactic structure of the sentence to understand grammar relationships.
6. Semantic Analysis:
o Understanding the meaning of the text using techniques like sentiment analysis or word
embeddings.
7. Text Generation:
o Generating new text, such as summarization or text completion, based on the processed
data.
8. Output:
o The final output such as translated text, summarized text, or chatbot responses.
65. What is your understanding about Chatbot in the context of NLP? (10
Marks)
Types of Chatbots:
66. Write short note on text pre-processing in the context of NLP. Discuss
outliers and how to handle them (10 Marks)
Text Preprocessing: It refers to the series of steps taken to clean and prepare raw text data for
further analysis or modeling. Preprocessing ensures that the data is in a usable form for machine
learning algorithms.
Definition: Outliers are unusual or inconsistent data points that deviate significantly from the
rest of the dataset.
Handling Outliers:
o Removing Outliers: Remove text entries that are too short or too long.
o Normalizing Outliers: Transforming outliers into a more standardized format, such as
lowercasing or fixing spelling mistakes.
67. Explain with example the challenges with sentence tokenization. (5 Marks)
Sentence Tokenization: The process of dividing a block of text into individual sentences.
Challenges:
1. Punctuation Ambiguities:
o Example: "I saw the movie. It was great." can be easily tokenized into two sentences.
But, "Dr. Smith is a doctor." could be incorrectly split at the period.
2. Abbreviations:
o Abbreviations like "U.S." or "e.g." could cause incorrect sentence boundary detection.
They don’t indicate the end of a sentence.
3. Complex Sentence Structures:
o Sentences with quotes, parentheses, or embedded clauses can confuse tokenizers.
Example: "He said, 'I will help you later.'" might be incorrectly tokenized.
4. Multilingual Issues:
o Some languages like Chinese do not have explicit sentence-ending punctuation, making
sentence segmentation difficult.
1. Tokenization:
o Splitting text into smaller units like words, sentences, or sub-words.
o Example: "I love NLP" → ["I", "love", "NLP"]
2. Part-of-Speech (POS) Tagging:
o Assigning grammatical categories (noun, verb, adjective) to words in a sentence.
o Example: "The cat sleeps." → [("The", "DT"), ("cat", "NN"), ("sleeps", "VBZ")]
3. Named Entity Recognition (NER):
o Identifying and classifying named entities (e.g., person names, locations, dates).
o Example: "Barack Obama was born in Hawaii." → [("Barack Obama", "PERSON"),
("Hawaii", "LOCATION")]
4. Sentiment Analysis:
o Determining the sentiment (positive, negative, neutral) of a given text.
o Example: "I love this product!" → Positive sentiment.
5. Machine Translation:
o Translating text from one language to another.
o Example: "Bonjour" → "Hello" (French to English).
69. What do you mean by text extraction and cleanup? Discuss with examples.
(10 Marks)
Text Extraction: The process of extracting relevant pieces of text from a larger corpus or
document. It involves identifying and retrieving specific information.
Original: "Hello!! How are you? <br> Have a good day! <br> "
Cleaned: "Hello How are you Have a good day"
70. What is word sense ambiguity in NLP? Explain with examples. (5 Marks)
Word Sense Ambiguity: Occurs when a word has multiple meanings or senses, and determining
the correct sense depends on context.
Example 1:
Bag of Words (BOW): A popular model used for representing text data where each word in a
document is treated as a unique token without considering the order of words. It focuses on the
frequency of words to build a vector representation.
Example:
Document 1: [1, 1, 1, 0, 0]
Document 2: [0, 0, 1, 1, 1]
Homonymy refers to the phenomenon where two or more words share the same spelling
or pronunciation but have different meanings.
Example:
o "Bank" (a financial institution) vs. "Bank" (the side of a river).
Homonyms can create ambiguity in text, which requires disambiguation in NLP
tasks.
WordNet is a large lexical database of the English language, which groups words into
sets of synonyms called synsets.
Words in WordNet are interlinked through semantic relationships like synonymy,
antonymy, hyponymy, and meronymy, helping machines understand word meanings.
Example: The synset for the word "dog" includes words like "canine", "pooch", and
"pup".
74. Consider a document containing 100 words wherein the word apple appears 5 times
and assume we have 10 million documents and the word apple appears in one thousandth
of these. Then, calculate the term frequency and inverse document frequency. (CO4 BL5)
10 Marks
Term Frequency (TF): The number of times a term appears in a document divided by
the total number of terms in the document.
o TF = (5 / 100) = 0.05
Inverse Document Frequency (IDF): Measures how important a term is in the entire
corpus.
o IDF = log(Total Documents / Documents containing the word)
o IDF = log(10,000,000 / 10,000) = log(1000) ≈ 3
TF-IDF: TF * IDF = 0.05 * 3 = 0.15
The term "apple" has a relatively low TF but a moderate IDF, indicating its relevance in
the context of the larger corpus.
75. Explain the relationship between Singular Value Decomposition, Matrix Completion,
and Matrix Factorization. (CO1 BL3) 5 Marks
76. Give two examples that illustrate the significance of regular expressions in NLP. (CO1
BL1) 5 Marks
1. Text Cleaning: Regular expressions are used to remove unwanted characters (e.g.,
punctuation, special symbols) from raw text data to clean the input before processing it.
Example: Removing digits or punctuation from text: text.replace(r"[^a-zA-Z\s]",
"").
2. Tokenization: Regular expressions can help in splitting a text into words or phrases
based on specific patterns.
Example: Tokenizing a sentence into words using a regular expression pattern like
r"\b\w+\b".
77. Why is multiword tokenization preferable over single word tokenization in NLP? Give
examples. (CO1 BL1) 5 Marks
78. Differentiate between formal language and natural language. (CO3 BL1) 10 Marks
Formal Language:
1. A formal language is a set of strings that follow a specific syntactic structure
defined by a set of rules or grammar (e.g., programming languages, mathematical
expressions).
2. It is unambiguous and precise, allowing machines to process it without errors.
3. Examples include languages like C, Python, and SQL.
Natural Language:
1. A natural language is a language that has evolved naturally among humans for
communication (e.g., English, Spanish, Hindi).
2. It is often ambiguous and subject to nuances, slang, idioms, and cultural context,
which makes it harder for machines to process.
3. Examples include English, Bengali, Chinese.
Key Differences:
1. Formal languages are syntactically rigid, while natural languages are flexible.
2. Natural languages can be ambiguous, while formal languages aim to eliminate
ambiguity.
79. Explain lexicon, lexeme and the different types of relations that hold between lexemes.
(CO1 BL1) 10 Marks
Lexicon:
o A lexicon is a collection or dictionary of all the words in a language, along with
information about their meanings, pronunciations, and grammatical properties.
o In NLP, a lexicon helps machines understand and process words more effectively.
Lexeme:
o A lexeme is a fundamental unit of meaning in the lexicon, representing a set of
forms related to a single word.
o It is an abstract unit, like the word "run" which includes variations like "runs,"
"ran," "running."
Relations Between Lexemes:
1. Synonymy: Lexemes that have similar meanings (e.g., "big" and "large").
2. Antonymy: Lexemes that have opposite meanings (e.g., "hot" and "cold").
3. Hyponymy: A lexeme that represents a specific instance of a broader category
(e.g., "dog" is a hyponym of "animal").
4. Meronymy: Lexemes that represent a part-whole relationship (e.g., "wheel" is a
part of "car").
80. State the advantages of bottom-up chart parser compared to top-down parsing. (CO1
BL1) 10 Marks
82. Describe the Skip-gram model and its intuition in word embeddings. (CO1 BL2) 10
Marks
Skip-gram Model:
1. The Skip-gram model is a neural network-based technique used in word
embeddings (such as Word2Vec) to learn vector representations of words.
2. Given a target word, the Skip-gram model predicts the surrounding words
(context words) within a defined window size in a sentence.
3. For example, in the sentence "The cat sat on the mat," if "sat" is the target word,
the model tries to predict context words like "cat," "on," "the," "mat."
Intuition:
1. The core idea is that words appearing in similar contexts tend to have similar
meanings, which is captured through vector representations.
2. By training the model to predict surrounding words, the Skip-gram model learns
to assign words with similar meanings to similar vector positions in the
embedding space.
3. It is particularly useful for learning word vectors when there is a large corpus, as
it focuses on predicting the context for each target word in the text.
83. Explain the concept of Term Frequency-Inverse Document Frequency (TF-IDF) based
ranking in information retrieval. (CO1 BL2) 10 Marks
IDF=log(NDF)\text{IDF} = \log\left(\frac{N}{DF}\right)IDF=log(DFN)
where NNN is the total number of documents, and DFDFDF is the number of
documents containing the word.
Ranking:
1. Documents with higher TF-IDF scores for a query term are considered more
relevant because they contain terms that are not overly common (low IDF) and
are more significant to the specific document (high TF).
2. This helps in ranking documents by relevance to the user's query, improving the
quality of search results in information retrieval systems.
84. Tokenize and tag the following sentence: (CO1 BL1) 2 Marks
85. What different pronunciations and parts-of-speech are involved? (CO1 BL1) 2 Marks
Pronunciations:
1. Words like "lead" can be pronounced differently depending on their meaning
(present tense verb vs. noun referring to the element).
Parts-of-speech:
1. Homographs, such as "lead," can be both a noun and a verb.
2. The parts-of-speech involved may include nouns (NN), verbs (VB), and
adjectives (JJ).
86. Compute the edit distance (using insertion cost 1, deletion cost 1, substitution cost 1) of
“intention” and “execution”. Show your work using the edit distance grid. (CO1 BL4) 10
Marks
87. What is the purpose of constructing corpora in Natural Language Processing (NLP)
research? (CO1 BL2) 5 Marks
88. What role do regular expressions play in searching and manipulating text data? (CO1
BL3) 5 Marks
89. Explain the purpose of WordNet in Natural Language Processing (NLP). (CO1 BL4) 10
Marks
Purpose of WordNet:
1. Lexical Database: WordNet is a large lexical database of English, organized into
sets of synonyms called synsets. It groups words based on their meanings and
relationships.
2. Semantic Relationships: It provides semantic relationships between words, such
as synonymy, antonymy, hypernymy, and hyponymy.
3. Word Sense Disambiguation: WordNet is commonly used in tasks like word
sense disambiguation, where the correct meaning of a word is determined based
on context.
4. Information Retrieval: It is used in information retrieval systems to improve
search results by understanding word meanings more deeply and identifying
relevant documents.
5. NLP Applications: WordNet plays a crucial role in machine translation, text
summarization, and sentiment analysis by providing a richer semantic
understanding of words.
91. Describe the class of strings matched by the following regular expressions: a. [a-zA-Z]+ b.
[AZ][a-z] (CO1 BL4) 10 Marks*
a. [a-zA-Z]+:
1. This regular expression matches strings that consist of one or more alphabetical
characters (both lowercase and uppercase).
2. It can match words such as "hello", "HELLO", or "Hello" but will not match
numbers, spaces, or any non-alphabetic characters.
b. [A-Z][a-z]*:
1. This regular expression matches strings where the first character is an uppercase
letter (A-Z), followed by zero or more lowercase letters (a-z).
2. It can match strings like "Hello", "World", "Apple", but not strings like "hello",
"world", or "123".
92. Extract all email addresses from the following: “Contact us at [email protected] or
[email protected].” (CO1 BL4) 10 Marks
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
2. Matches:
[email protected]
[email protected]
93. This regex is intended to match one or more uppercase letters followed by zero or more
digits. [A-Z]+[0-9] However, it has a problem. What is it, and how can it be fixed? (CO1 BL4)
10 Marks*
Problem:
1. The regular expression [A-Z]+[0-9]* can match any string that starts with one or
more uppercase letters and is optionally followed by digits. However, it will also
match strings where there are no digits, which might not be intended.
2. For example, the regex would match "HELLO" (which has no digits).
Fix:
1. To ensure the regex only matches strings that have at least one uppercase letter
followed by at least one digit, you can modify the regular expression:
[A-Z]+[0-9]+
2. This ensures that after the uppercase letters, there must be one or more digits.
94. Write a regex to find all dates in a text. The date formats should include: DD-MM-
YYYY, MM-DD-YYYY, YYYY-MM-DD. (CO1 BL4) 10 Marks
(\d{2})-(\d{2})-(\d{4})|(\d{4})-(\d{2})-(\d{2})
2. This matches:
12-05-2023 (DD-MM-YYYY)
05-12-2023 (MM-DD-YYYY)
2023-12-05 (YYYY-MM-DD)
95. Compute the minimum edit distance between the words MAMA and MADAAM. (CO1
BL5) 10 Marks
96. Evaluate the minimum edit distance in transforming the word ‘kitten’ to ‘sitting’ using
insertion, deletion, and substitution cost as 1. (CO1 BL5) 10 Marks
Module 2
1. What are language models? (2 Marks)
Answer:
A language model is a statistical model that is used to predict the next word or sequence of
words in a sentence. It assigns a probability to a sequence of words, based on the prior
occurrences of the sequence in a corpus. Language models are fundamental in many NLP tasks,
such as speech recognition, text generation, and machine translation.
Answer:
The n-gram model is a type of probabilistic language model where the probability of a word
depends on the previous n-1 words. For example, in a bigram model (n=2), the probability of the
next word depends only on the previous word.
Example: "I am learning NLP."
Answer:
Answer:
In the bigram model, we make the following approximations:
Answer:
The Markov assumption states that the future state of a process depends only on the current state,
and not on the sequence of events that preceded it. In the context of language models, it means
that the probability of a word depends only on the immediately preceding word (or a fixed
number of previous words in the case of n-grams).
Answer:
Maximum Likelihood Estimation (MLE) is used to estimate the parameters of a probability
distribution or model that maximizes the likelihood of observing the given data. In language
modeling, MLE is used to estimate the probabilities of word sequences in n-gram models based
on the frequency of occurrences in the training data.
8. Given a word wnw_nwn and the previous word wn−1w_{n-1}wn−1, how to
normalize the count of bigrams? State the formula for the same. (10 Marks)
Answer:
To normalize the count of bigrams, we use the following formula for the Relative Frequency of
a bigram (wn−1,wn)(w_{n-1}, w_n)(wn−1,wn):
P(wn∣wn−1)=C(wn−1,wn)C(wn−1)P(w_n | w_{n-1}) = \frac{C(w_{n-1}, w_n)}{C(w_{n-
1})}P(wn∣wn−1)=C(wn−1)C(wn−1,wn)
Where:
Answer:
Relative frequency in the n-gram model refers to the ratio of the occurrence of a particular n-
gram to the total number of occurrences of the (n-1)-gram that precedes it. It is used to estimate
the probability of a sequence of words.
Formula:
P(wn∣w1,w2,...,wn−1)=C(w1,w2,...,wn)C(w1,w2,...,wn−1)P(w_n | w_1, w_2, ..., w_{n-1}) =
\frac{C(w_1, w_2, ..., w_n)}{C(w_1, w_2, ..., w_{n-1})}P(wn∣w1,w2,...,wn−1)=C(w1,w2
,...,wn−1)C(w1,w2,...,wn)
Answer:
The building blocks of a semantic system include:
1. Lexical Semantics: It involves understanding the meaning of individual words and their
relationships, including synonyms, antonyms, and polysemy.
2. Compositional Semantics: It is concerned with how words combine to form phrases and
sentences with specific meanings.
3. Pragmatics: This refers to how context influences the interpretation of meaning in
communication.
4. World Knowledge: It involves the real-world knowledge and common-sense reasoning
needed to understand certain expressions or sentences.
5. Reference and Inference: Understanding how terms refer to objects or concepts in the
world and how inferences are drawn from text.
11. Discuss lexical ambiguity. (5 Marks)
Answer:
Lexical ambiguity occurs when a word has multiple meanings or senses. This can happen when
the word belongs to different categories or has several interpretations based on context.
Example:
"Bank":
1. The side of a river (noun).
2. A financial institution (noun).
The correct meaning depends on the surrounding context, making it lexically
ambiguous.
Answer:
Semantic ambiguity arises when a phrase or sentence can be interpreted in multiple ways due to
the meanings of words or the structure of the sentence itself. This is often because of the
vagueness or multiple interpretations of meanings.
Example:
Answer:
Syntactic ambiguity occurs when a sentence or phrase can be interpreted in multiple ways
because of its syntactic structure. Different parse trees or sentence structures lead to multiple
meanings.
Example:
1. Disambiguation: Helps resolve ambiguities and provides clarity about word meanings in
context.
2. Information Retrieval: Improves the accuracy of search results by better understanding
the meaning of queries.
3. Natural Language Understanding: Supports tasks like machine translation, question
answering, and summarization.
15. What is the major difference between lexical analysis and semantic analysis
in NLP? (5 Marks)
Answer:
Lexical Analysis: It involves analyzing the structure of words and breaking them into
tokens. This is the process of converting raw input into meaningful units such as words,
symbols, and punctuation.
Semantic Analysis: It is concerned with extracting meaning from the text. Semantic
analysis involves resolving ambiguity and interpreting the relationships between words
and phrases in a way that machines can understand.
Difference: Lexical analysis focuses on word-level processing, while semantic analysis
deals with understanding the meaning of words and sentences.
Answer:
17. With examples, explain the different types of parts of speech attributes. (5
Marks)
Answer:
Parts of speech (POS) attributes define the role of words in sentences. Common types include:
1. Noun: Denotes a person, place, thing, or idea.
Example: "dog", "city"
2. Verb: Represents an action or state.
Example: "run", "is"
3. Adjective: Describes a noun.
Example: "beautiful", "large"
4. Adverb: Modifies a verb, adjective, or another adverb.
Example: "quickly", "very"
5. Pronoun: Replaces a noun.
Example: "he", "it"
6. Preposition: Shows relationships between nouns and other words.
Example: "in", "on"
7. Conjunction: Connects words or phrases.
Example: "and", "but"
18. Explain extrinsic evaluation of the N-gram model and the difficulties related
to it. (10 Marks)
Answer:
Extrinsic evaluation refers to assessing the N-gram model by evaluating its performance on an
external task, such as speech recognition, text generation, or machine translation, where the
model is applied to real-world problems.
Difficulties:
19. With an example, explain the path-based similarity check for two words. (5
Marks)
Answer:
Path-based similarity checks compare two words by examining the shortest path between them in
a lexical database like WordNet. The similarity score depends on how closely related the words
are in the network.
Example:
For the words "cat" and "dog," a path-based similarity measure might calculate the
shortest path from "cat" to "dog" in WordNet. If they share a common parent (like
"animal"), their similarity score would be higher.
Cat -> Animal -> Dog
Answer:
Homonymy: Words that have the same form but different meanings.
Example: "bat" (flying mammal) vs. "bat" (sports equipment).
Polysemy: A single word has multiple related meanings.
Example: "bank" (financial institution) vs. "bank" (side of a river).
Synonymy: Words that have similar or identical meanings.
Example: "happy" and "joyful."
21. How does WordNet assist in extracting semantic information from a corpus?
(5 Marks)
Answer:
WordNet is a lexical database that organizes words into sets of synonyms (synsets), with
relationships between them. It aids in extracting semantic information in several ways:
1. Synonymy: Helps in identifying words with similar meanings, useful in tasks like word
sense disambiguation and information retrieval.
2. Hyponymy/Hypernymy: Identifies hierarchical relationships between words (e.g., "dog"
is a hyponym of "animal").
3. Meronymy: Identifies part-whole relationships (e.g., "wheel" is a meronym of "car").
4. Path-based similarity: Helps compute semantic similarity between words based on the
paths connecting them in the hierarchy.
22. How does NLP employ computational lexical semantics? Explain. (5 Marks)
Answer:
In computational lexical semantics, NLP systems represent and analyze the meaning of words
and their relationships computationally. This involves:
1. Word Sense Disambiguation (WSD): Resolving ambiguity in words that have multiple
meanings (e.g., "bat" as a flying mammal or a sport's equipment).
2. Semantic Role Labeling (SRL): Assigning roles to words in a sentence, helping the
system understand the relationships (e.g., identifying agents, actions, and objects).
3. Vector Representations: Using techniques like Word2Vec or GloVe, words are
represented in vector space, where semantic similarity can be captured by vector
proximity.
4. Corpus-based Analysis: Using large corpora to model semantic relationships, improve
understanding, and extract meaning from context.
23. What are the problems with basic path-based similarity measures, and how
are they reformed through information content similarity metrics? (10 Marks)
Answer:
Problems with Path-based Similarity Measures:
1. Superficial Relationships: Path-based measures rely on the structural path in the lexical
network, which may not always capture deep semantic similarity.
2. Lack of Context: Path-based methods do not consider the frequency or context of word
usage, potentially leading to misleading similarity scores.
3. Limited Coverage: Some words may not have direct paths between them, making it hard
to compute a meaningful similarity.
Answer:
The Lesk algorithm is a popular method for word sense disambiguation. It compares the
dictionary definitions (or glosses) of word senses to select the most appropriate meaning based
on overlap with surrounding words.
Extended Lesk Algorithm improves this by including contextual words not just in the
surrounding window but throughout the entire document.
Example:
For the word "bank", the basic Lesk algorithm compares the definitions of "bank" (a
financial institution) with surrounding words like "money" and "investment".
In the extended version, it also considers broader context from the text like "river" or
"water" to choose "bank" (side of a river) as the correct sense.
25. State the difference in properties of Rule-based POS tagging and Stochastic
POS tagging. (5 Marks)
Answer:
26. What is stochastic POS tagging? What are the properties of stochastic POS
tagging? (10 Marks)
Answer:
Stochastic POS tagging involves using statistical models (such as Hidden Markov Models or
Conditional Random Fields) to assign POS tags based on observed frequencies and probabilities
in a given corpus. It predicts the most likely POS tags by considering the context of surrounding
words.
Properties:
1. Context-dependent: Unlike rule-based tagging, stochastic tagging takes into account the
context in which a word appears.
2. Probabilistic: It uses probabilities to make decisions, often yielding higher accuracy on
ambiguous words.
3. Data-driven: Stochastic models learn from annotated corpora, improving accuracy as
more data is provided.
4. Handles Ambiguity: Stochastic POS tagging is particularly effective in dealing with
ambiguous words, as it can consider both the word's form and its context.
27. What is rule-based POS tagging and what are the properties of the same? (10
Marks)
Answer:
Rule-based POS tagging relies on manually crafted rules based on linguistic knowledge to
assign POS tags to words. It uses patterns, lexical information, and syntactic rules to tag words in
sentences.
Properties:
28. Give examples to illustrate how the n-gram approach is utilized in word
prediction. (10 Marks)
Answer:
The n-gram approach is widely used in word prediction tasks, where the goal is to predict the
next word in a sequence based on the previous ones.
Example (Bi-gram model):
29. Highlight transformation-based tagging and the working of the same. (10
Marks)
Answer:
Transformation-based tagging is a hybrid POS tagging approach that combines rule-based
methods with machine learning techniques.
1. Initial Tagging: The process starts with an initial tagging using a simple rule-based
approach or a stochastic model.
2. Transformation Rules: Then, a set of transformation rules are applied to correct the
initial tagging. These rules specify conditions under which certain tags can be changed
(e.g., if a noun follows a determiner, it should be tagged as a noun).
3. Iterative Refinement: These transformation rules are learned from the training data and
applied iteratively, improving the accuracy of the POS tags.
30. State the difference between structured data and unstructured data. (10
Marks)
Answer:
Structured Data:
1. Organized in tables or databases with predefined models (e.g., relational
databases).
2. Easily searchable and processed using SQL.
3. Example: Employee records, transaction logs, or sensor data.
Unstructured Data:
1. No predefined structure and cannot be stored in traditional relational databases.
2. Often text-heavy and requires specialized techniques (e.g., NLP) for analysis.
3. Example: Emails, social media posts, images, and videos.
Answer:
Semi-structured data has some form of organization but does not conform to a rigid structure
like structured data. It typically contains tags or markers to separate elements but does not follow
a strict schema.
Example:
XML or JSON files: Data is stored in a flexible format with key-value pairs or tags but
does not fit into a fixed table structure.
o XML example:
<person>
<name>John</name>
<age>30</age>
<city>New York</city>
</person>
Answer:
In text classification, a supervised machine learning algorithm uses labeled data (texts with
predefined categories) to learn patterns and relationships between the input features and output
labels.
1. Training Phase: The algorithm is trained using a labeled dataset, where each document
or text is associated with a category (e.g., spam or not spam).
2. Feature Extraction: The algorithm extracts features (e.g., word frequency, n-grams, or
term frequency) to represent the text data.
3. Prediction: Once trained, the model can classify new, unseen text into predefined
categories based on learned patterns.
Example: A Naive Bayes classifier for spam email detection, where emails labeled as
"spam" or "not spam" are used to train the model.
Answer:
Emotion analytics involves analyzing textual, vocal, or visual data to detect emotions. Some
common uses are:
34. Say you are an employee of a renowned food delivery company and your
superior has asked you to do a market survey to search for potential competition
and zero down to areas where your company needs to improve to be the top
company in the market. How will you approach this task and accomplish the
goal? (10 Marks)
Answer:
1. Identify Competitors:
o Research local and national competitors offering food delivery services. Focus on
market leaders and emerging competitors.
o Examine customer reviews, ratings, and media coverage to identify popular
competitors.
2. Analyze Competitor Services:
o
Compare delivery times, pricing models, menu options, payment methods, app
features, and customer support channels.
o Identify unique selling propositions (USPs) that set competitors apart.
3. Customer Sentiment Analysis:
o Conduct surveys or analyze social media and review platforms to understand
customers' opinions and complaints about competitors.
o Focus on areas like food quality, delivery speed, customer service, and app
usability.
4. Spot Weaknesses and Opportunities:
o Highlight gaps in competitors’ offerings or services where your company can
improve (e.g., more menu variety, loyalty programs, faster delivery times).
o Research potential market opportunities in untapped areas, such as healthier food
options or a focus on sustainability.
5. Recommendations for Improvement:
o Based on research, propose strategies like enhancing delivery times, introducing
better customer support, and upgrading the mobile app to make it more user-
friendly.
o Suggest marketing strategies to increase brand awareness and attract more
customers.
Answer:
A classic search model refers to the way search algorithms work, such as the Breadth-First
Search (BFS) or Depth-First Search (DFS). These models explore search spaces or trees to
find a goal node or solution.
Breadth-First Search (BFS): Explores all neighbors of a node before moving to the next
level.
Depth-First Search (DFS): Explores as far down a branch as possible before
backtracking.
A
/ \
B C
/ \ \
D E F
1. Disambiguation: It helps in distinguishing between words that can serve different parts
of speech (e.g., "bank" as a financial institution or the side of a river).
2. Syntax Structure: It aids in understanding the sentence structure, helping the system
determine subject-object relationships.
3. Text Analysis: POS tagging is used for information extraction, sentiment analysis, and
machine translation by classifying words as nouns, verbs, adjectives, etc.
4. Improves Accuracy: POS tagging increases the accuracy of downstream NLP tasks,
such as parsing and named entity recognition (NER).
Answer:
Vocabulary in NLP refers to the collection of unique words or tokens that are used in a given
text or corpus. The vocabulary is typically built from the set of words found in a dataset and is
crucial for tasks like text classification, word embeddings, and language modeling.
Answer:
Information Extraction (IE) refers to the process of automatically extracting structured
information from unstructured text. It typically involves identifying entities (e.g., names, dates,
locations), relationships, and events within the text to convert it into a more structured form, like
databases or knowledge graphs.
Answer:
Morphological Parsing is the process of analyzing the structure of words to determine their root
forms, prefixes, suffixes, and other morphological features.
Steps of Morphological Parsing:
Answer:
The Bag of Words (BOW) model is a simple representation of text where each document is
represented as an unordered set of words, disregarding grammar and word order but keeping
multiplicity.
41. State the difference between formal language and natural language. (5
Marks)
Answer:
Formal Language:
o Rigorous syntax and semantics (e.g., programming languages, mathematical
expressions).
o Has well-defined rules that do not change.
o Can be processed by computers without ambiguity.
Natural Language:
o Used by humans for everyday communication (e.g., English, Spanish).
o Contains ambiguities, irregularities, and flexible grammar rules.
o Difficult for computers to process due to complexity and variations in meaning.
42. Assume there are 4 topics namely, Cricket, Movies, Politics and Geography
and 4 documents D1, D2, D3, and D4, each containing equal number of words.
These words are taken from a pool of 4 distinct words namely, {Shah Rukh,
Wicket, Mountains, Parliament} and there can be repetitions of these 4 words in
each document. Assume you want to recreate document D3. Explain the process
you would follow to achieve this and reason as how recreating document D3 can
help us understand the topic of D3 (10 Marks)
Answer:
1. Term Frequency (TF):
o First, calculate the term frequency of each word in all documents, including
document D3.
2. Topic Modeling:
o Use techniques like Latent Dirichlet Allocation (LDA) or TF-IDF to associate
words with topics.
o Based on the frequencies of the words, reconstruct the topic for D3. For example,
if "Mountains" and "Parliament" appear frequently in D3, it likely relates to
Geography and Politics.
3. Recreate Document D3:
o After identifying the most frequent terms associated with the topics, recreate D3
by choosing words based on the identified topics.
o The process of recreating D3 helps in understanding the content and topic of
the document, as the word distribution reflects the topic it represents.
Answer:
Text Parsing refers to the process of analyzing a sequence of words or sentences to extract
meaningful structures or components. This typically involves breaking down text into parts like
sentences, phrases, or words, and analyzing their grammatical or syntactic structure. Parsing
helps in identifying parts of speech, sentence components (subjects, predicates), and
relationships between them.
Answer:
Sentiment Analysis in market research refers to the process of analyzing customer opinions,
feedback, or reviews to determine their emotional tone (positive, negative, or neutral). By
understanding sentiment, companies can gain insights into customer perceptions of products,
services, or brands. This helps in improving product offerings, targeting marketing campaigns,
and enhancing customer experiences.
Answer:
Hidden Markov Models (HMMs) are statistical models used to represent systems that follow a
Markov process with unobservable (hidden) states. HMMs consist of:
1. States: The system’s internal state (hidden and not directly observable).
2. Observations: The observed data or output influenced by the hidden states.
3. Transitions: Probabilities of transitioning between states.
4. Emission Probabilities: Likelihood of an observation being generated from a state.
HMMs are widely used in speech recognition, part-of-speech tagging, and time-series analysis.
46. State and explain in details the main advantage of Latent Dirichlet Allocation
methodology over Probabilistic Latent Semantic Analysis for building a
Recommender system? (10 Marks)
Answer:
Latent Dirichlet Allocation (LDA) vs. Probabilistic Latent Semantic Analysis (PLSA):
47. Explain in details how the Matrix Factorization technique used for building
Recommender Systems effectively boils down to solving a Regression problem. (5
Marks)
Answer:
Matrix Factorization in recommender systems involves decomposing a large, sparse matrix
(user-item ratings matrix) into two smaller matrices: one representing user features and the
other representing item features.
1. Goal: The goal is to approximate the original matrix by multiplying the user and item
matrices. The factorization aims to minimize the difference between the predicted ratings
and the actual ratings (i.e., minimize the error).
2. Regression Problem: The matrix factorization can be viewed as a regression problem,
where the predicted rating for a user-item pair is a function of the dot product of the
user's and item's feature vectors. The error in prediction can be minimized using
techniques like Stochastic Gradient Descent (SGD), which adjusts the feature vectors
(user and item) based on the error, similar to how regression models adjust coefficients.
3. Optimization: The optimization process, similar to regression, tries to minimize the sum
of squared errors between the actual ratings and the predicted ratings.
48. What are the two main approaches used in computational linguistics for Part
of Speech (POS) tagging? (5 Marks)
Answer:
The two main approaches for Part of Speech (POS) tagging are:
Answer:
WordNet is a lexical database of the English language. It organizes words into sets of synonyms
called synsets, each representing a distinct concept. Words in WordNet are related to each other
through various semantic relations, such as hypernyms (generalizations), hyponyms (specific
instances), meronyms (part-whole relationships), and antonyms.
Answer:
WordNet organizes words into a network of interrelated synsets (sets of synonyms) and defines
several semantic relationships between them:
These relationships form a hierarchical structure in WordNet that helps in semantic analysis.
Answer:
Morphological operations in NLP involve analyzing and processing words into their
components, such as stems, roots, prefixes, and suffixes. These operations are applied to
understand word meanings, word variations, and structure.
These operations are essential for standardizing words and improving text analysis tasks like
classification and information retrieval.
Answer:
1. Hypernyms:
o A hypernym is a word that denotes a broader category or class. In WordNet,
hypernyms represent generalizations of specific concepts.
o Example: "Vehicle" is a hypernym of "Car".
2. Hyponyms:
o A hyponym is a more specific word that falls under a broader category. In
WordNet, hyponyms are more specific instances of a general concept.
o Example: "Apple" is a hyponym of "Fruit".
3. Heteronyms:
o Heteronyms are words that have the same spelling but different meanings and
pronunciations depending on the context.
o Example: "Lead" (to guide) vs. "Lead" (the metal).
These relationships help in word sense disambiguation and improving semantic understanding in
NLP.
53. Discuss the advantages and disadvantages of CBOW and Skip-gram models.
(10 Marks)
Answer:
Continuous Bag of Words (CBOW) and Skip-gram are two models used for word embeddings
in Word2Vec:
1. CBOW Model:
o Advantages:
Efficient for small datasets: CBOW works well when training on smaller
datasets.
Faster training: It tends to be faster since it uses the surrounding context
to predict the target word.
o Disadvantages:
Less effective for rare words: It struggles to capture the meanings of rare
words.
2. Skip-gram Model:
o Advantages:
Good for larger datasets: Skip-gram works well with large corpora and
can capture richer relationships between words.
Better for rare words: It excels at learning embeddings for rare words.
o Disadvantages:
Slower training: It is computationally expensive and slower compared to
CBOW.
54. Explain the process of text classification, focusing on Naïve Bayes' Text
Classification algorithm. (10 Marks)
Answer:
Text classification is the process of categorizing text into predefined labels or classes. The
Naïve Bayes algorithm is a simple and effective approach for text classification, based on Bayes'
theorem.
55. How do you use naïve bayes model for collaborative filtering? (5 Marks)
Answer:
Naïve Bayes can be used for Collaborative Filtering by treating users’ interactions with items
as categorical data. The idea is to predict a user’s preferences based on the probabilistic
relationships between items and user behaviors.
Answer:
Lexical Analysis and Semantic Analysis are two distinct stages in Natural Language Processing
(NLP), each serving different purposes:
1. Lexical Analysis:
o Definition: Lexical analysis refers to the process of breaking down the input text
into tokens, which are the smallest units of meaningful data (words, punctuation,
etc.).
o Purpose: The primary aim is to identify the structure and parts of speech (nouns,
verbs, etc.), normalize them (e.g., stemming or lemmatization), and convert them
into a machine-readable format.
o Example: Converting the sentence "Cats are playing" into tokens: ["Cats", "are",
"playing"].
2. Semantic Analysis:
o Definition: Semantic analysis deals with understanding the meaning of words,
phrases, and sentences in context. It involves deriving the meaning or "sense" of
words and how they relate to each other.
o Purpose: The goal is to extract the deeper meaning or context behind words,
resolve ambiguities, and link words to concepts.
o Example: The word “bat” could refer to a flying mammal or a piece of sports
equipment. Semantic analysis disambiguates this based on context (e.g., "He
swung the bat" vs. "A bat flew across the sky").
Key Difference:
57. Define what N-grams are in the context of Natural Language Processing
(NLP). (5 Marks)
Answer:
N-grams in NLP are contiguous sequences of n items (words, characters, etc.) from a given text
or speech. They are used for various tasks like language modeling, text prediction, and feature
extraction. The value of n determines the length of the sequence:
N-grams are used to model the probability of the occurrence of words or sequences, often
applied in text classification and predictive text tasks.
58. What are word embeddings in the context of Natural Language Processing
(NLP)? (10 Marks)
Answer:
Word Embeddings are a type of word representation that allows words to be represented as
vectors in a continuous vector space. Word embeddings capture the semantic meaning of words
by placing similar words closer together in the vector space.
1. Definition: Word embeddings map words to dense vectors (compared to sparse one-hot
encoding), where words with similar meanings have similar vector representations.
2. How Word Embeddings Work:
o They are learned from large text corpora using models like Word2Vec, GloVe, or
FastText.
o These models learn embeddings by capturing contextual information about words
based on their co-occurrence patterns in large datasets.
3. Example:
o In Word2Vec, there are two main models:
Skip-gram: Given a word, predict the surrounding words.
CBOW (Continuous Bag of Words): Given surrounding words, predict
the target word.
4. Applications of Word Embeddings:
o Semantic Similarity: Words with similar meanings will have similar embeddings
(e.g., "king" and "queen").
o Word Analogies: Embeddings allow for operations like "King - Man + Woman =
Queen".
o Text Classification: Word embeddings can be used as features in machine
learning models for text classification, sentiment analysis, etc.
59. What is "vector semantics" in NLP, and why is it useful for understanding
word meanings? (10 Marks)
Answer:
Vector Semantics in NLP refers to the representation of words as vectors in a high-dimensional
space, where the geometric relationships between vectors capture their semantic relationships.
1. Definition: Words are represented as vectors, and their meanings are interpreted based on
their position in this vector space. Similar words have similar vector representations (i.e.,
they are closer in the vector space).
2. How It Works:
o Words that share similar contexts or occur in similar contexts (in large corpora)
are placed closer together in the vector space.
o Word2Vec and GloVe are popular techniques used to generate vector
embeddings for words.
3. Applications:
o Word Similarity: Words that are semantically similar (e.g., "dog" and "puppy")
will have similar vector representations.
o Word Analogies: Vector semantics can be used for tasks like solving analogies
(e.g., "king" - "man" + "woman" = "queen").
o Text Classification: Word vectors are used to capture word meaning for
classification tasks, helping machines understand context and intent.
Answer:
A significant limitation of TF-IDF (Term Frequency-Inverse Document Frequency) is that it
does not capture semantic meaning or context between words. TF-IDF is purely based on the
frequency of terms in documents and does not account for the relationships or meanings of
words beyond their occurrence.
1. Example: The words "dog" and "hound" may have similar meanings but will be treated
as completely independent terms in TF-IDF, ignoring their semantic relationship.
2. Impact: This limitation can affect tasks like information retrieval and text classification,
where understanding word meaning and context is crucial.
Answer:
Regular expressions (regex) are used extensively in Natural Language Processing (NLP) for
text processing tasks. They provide a powerful mechanism to identify, match, and extract
specific patterns from text.
62. Explain the concept of N-grams in NLP and with examples discuss their
importance in language modelling to demonstrate how N-grams capture
sequential patterns in text data. (10 Marks)
Answer:
N-grams are contiguous sequences of n words in a given text, widely used in language
modeling for capturing sequential relationships in text.
1. Definition:
o An N-gram is a sequence of n words that appear together in a text corpus.
Unigram: Single word (e.g., "I", "am", "hungry").
Bigram: Two consecutive words (e.g., "I am", "am hungry").
Trigram: Three consecutive words (e.g., "I am hungry").
2. Importance in Language Modeling:
o N-grams help capture local dependencies between words, such as common word
pairs or triplets.
o They are used in statistical language models (e.g., n-gram models) to predict
the next word in a sequence based on the previous words.
3. Example:
o In the sentence "I am happy today", the bigrams would be: "I am", "am happy",
"happy today".
o These help the model learn context and predict the next word.
4. Sequential Pattern Capture:
o N-grams help in capturing the sequential relationships in language. For example,
"I am" is a common phrase, whereas "am I" might not be as frequent,
demonstrating how N-grams capture word order.
63. Explain the significance of n-grams in the design of any text classification
system using examples. (5 Marks)
Answer:
N-grams play a significant role in text classification systems by providing a way to represent
text data that accounts for word sequences and context, which is essential for tasks like sentiment
analysis, spam detection, etc.
1. N-grams in Text Classification:
o N-grams help capture the local context of words, which improves the classifier’s
ability to understand sequences and relationships between terms in a document.
o They are often used as features for classification algorithms like Naive Bayes or
SVM.
2. Example:
o In sentiment analysis, bigrams like "good movie" or "not good" could provide
important clues about sentiment.
"good movie" might be classified as positive, while "not good" would
indicate negative sentiment.
3. Why N-grams Are Useful:
o They preserve the word order and help capture contextual dependencies that
unigrams (single words) cannot.
o N-grams allow the classifier to learn which word combinations are relevant for
distinguishing categories.
Answer:
The unigram model treats each word as independent, ignoring word order and context, which
can lead to several disadvantages in information extraction tasks.
1. Disadvantages:
o Lack of Context: Unigrams cannot capture the relationship between words. For
example, "New York" as a location would be treated as two separate words
("New" and "York"), missing the context.
o Ambiguity: Unigrams do not resolve word ambiguities. For example, "bank"
could mean a financial institution or the side of a river, but unigrams cannot
distinguish between the two without context.
o Inaccurate Feature Representation: Information extraction often relies on
understanding the order and proximity of words, which unigrams fail to capture.
2. Example:
o In the sentence "I went to the bank", unigram models might fail to understand the
meaning because "bank" could refer to a financial institution or the side of a river.
Answer:
Homographs are words that are spelled the same but have different meanings and sometimes
different pronunciations, depending on the context.
1. Example:
o Lead (to guide) vs. Lead (a heavy metal).
o In "She will lead the team," "lead" refers to guiding. In "The lead pipe," "lead"
refers to the metal.
66. How is the Levenshtein distance algorithm used to find similar words to a
given word? (10 Marks)
Answer:
The Levenshtein distance algorithm calculates the minimum number of single-character
edits (insertions, deletions, or substitutions) required to convert one string into another. It is
useful for finding similar words by measuring how closely a word matches another.
1. Steps:
o The algorithm computes a matrix where each cell represents the minimum
number of edits needed to transform one word into another.
o It then uses dynamic programming to fill the matrix based on the choices of
insertion, deletion, or substitution.
2. Use in Finding Similar Words:
o Words with lower Levenshtein distances from a given word are considered more
similar.
o For example, given the word "kitten," similar words like "sitting" or "mitten"
would have a small Levenshtein distance.
3. Example:
o Transform "kitten" to "sitting":
Insert "s" at the beginning: "skitten" (1 edit)
Substitute "k" with "s": "sitten" (1 edit)
Substitute "e" with "i": "sittin" (1 edit)
Insert "g" at the end: "sitting" (1 edit)
Total distance = 3 edits.
Answer:
Heteronyms are words that are spelled the same but have different meanings and are
pronounced differently depending on the context.
1. Example:
o Lead (to guide) vs. Lead (a type of metal).
"She will lead the team" (pronounced /liːd/).
"The pipe is made of lead" (pronounced /lɛd/).
68. Explain the concept of polysemy and provide an example. (2 Marks)
Answer:
Polysemy refers to a single word that has multiple meanings, which are related by extension or
metaphor.
1. Example:
o Bank can mean:
A financial institution.
The side of a river (e.g., "river bank").
A place to store or accumulate (e.g., "blood bank").
69. Define synonyms and antonyms and provide examples of each. (2 Marks)
Answer:
Synonyms: Words that have the same or nearly the same meaning.
o Example: "Happy" and "Joyful".
Antonyms: Words that have opposite meanings.
o Example: "Happy" and "Sad".
1. Bigram Count:
o Count of the bigram (am, Sam): 3 (since "am Sam" occurs 3 times in the corpus).
o Total number of bigrams: 11 (since the corpus has 11 bigrams in total,
considering the <s> and </s> tokens).
2. Add-One Smoothing:
o To apply add-one smoothing, we add 1 to each count, including the total bigram
count.
3. Formula:
o P(Sam | am) = (count(am, Sam) + 1) / (count(am) + V)
where V is the vocabulary size (number of unique words, including <s>
and </s>).
4. Calculation:
o Count of (am, Sam) = 3.
o Count of "am" = 4 (appears 4 times in the corpus).
o Vocabulary size (V) = 6 (unique words: "I", "am", "Sam", "do", "not", "like").
This is incorrect. Rule-based taggers are typically deterministic, as they use predefined
linguistic rules to tag words.
This is incorrect. Stochastic taggers, such as Hidden Markov Models (HMMs), depend
on probabilistic models that are trained on specific languages and thus are not language-
independent.
This is correct. Brill’s tagger is a rule-based tagger that applies transformation rules to
the output of an initial tagger to improve accuracy.
Module 3
1. In the context of natural language processing, how can we leverage the concepts of TF-
IDF, training set, validation set, test set, and stop words to improve the accuracy and
effectiveness of machine learning models and algorithms? Additionally, what are some
potential challenges and considerations when working with these concepts, and how can we
address them? (5 Marks)
TF-IDF (Term Frequency-Inverse Document Frequency): It is a statistical measure
used to evaluate the importance of a word in a document relative to a corpus. By using
TF-IDF, we can reduce the weight of commonly occurring words (like stop words) and
increase the weight of rare but meaningful words. This improves the feature
representation for models, enhancing the accuracy.
Training Set: This is used to train the machine learning model. A good training set
ensures that the model learns the underlying patterns.
Validation Set: This set helps in tuning hyperparameters, avoiding overfitting, and
evaluating the model’s performance during training. It's crucial for model selection.
Test Set: This set is used after training and validation to evaluate the final model
performance.
Stop Words: Common words like "the", "and", "is", etc., can often be removed to avoid
unnecessary noise and reduce dimensionality in NLP tasks.
Overfitting: Model may perform well on the training set but fail on unseen data. This can
be avoided using cross-validation and regularization.
Imbalanced Data: If the training set contains more examples from one class than the
other, the model can be biased. This can be addressed using resampling techniques or
class weights.
Text Classification: It is the task of categorizing text into predefined labels or categories.
For example, classifying emails as spam or non-spam.
Named Entity Recognition (NER): Identifying entities like names, dates, and
organizations.
Part-of-Speech Tagging (POS): Assigning parts of speech to each word.
Chunking: Grouping words into meaningful chunks like noun phrases.
Relation Extraction: Identifying relationships between entities in a text.
Text Segmentation: Dividing text into smaller, manageable pieces.
Ad-hoc Retrieval Problems: These are search problems where the query is not pre-
defined, and the goal is to retrieve relevant documents or information based on an
individual query. It requires efficient retrieval mechanisms to match documents to a
query.
Hand-Coded Rules: These are manually created rules that specify patterns or conditions
for classifying text. They typically involve looking for certain keywords, phrases, or
structures in the text and assigning it to a specific class based on these predefined
patterns.
Example: In spam detection, a rule might look for words like "free", "buy now", or "win"
to classify an email as spam. Hand-coded rules can be effective when the data is simple
or well-defined.
Challenges: Hand-coded rules are labor-intensive to create and maintain, and they may
not generalize well to unseen data. They are also limited in handling complex language
patterns.
9. What are the machine learning approaches used for text classification? (5 Marks)
Supervised Learning: This is the most common approach, where a labeled training
dataset is used to train a model. Examples include:
o Naive Bayes Classifier: A probabilistic classifier based on Bayes' theorem, often
used for text classification.
o Support Vector Machines (SVM): SVMs use hyperplanes to classify texts into
different categories.
o Decision Trees: These models use tree-like structures to make decisions based on
feature values.
o Deep Learning: Neural networks, especially recurrent neural networks (RNNs)
or convolutional neural networks (CNNs), have been highly effective for text
classification tasks.
10. What is/are the drawback/s of the Naive Bayes classifier? (5 Marks)
Assumption of Independence: Naive Bayes assumes that the features (words) are
independent of each other, which often does not hold in real-world text data.
Limited to Simple Models: It does not capture complex relationships between words or
higher-order dependencies in the text.
Sensitivity to Imbalanced Data: Naive Bayes can be biased towards the majority class
in imbalanced datasets, leading to poor performance on the minority class.
11. Explain the result of Multinomial Naïve Bayes Independence Assumptions. (5 Marks)
Multinomial Naïve Bayes Assumptions: This variation of Naive Bayes assumes that the
features (words) are generated from a multinomial distribution and that each word in a
document is generated independently.
Result: The model calculates the probability of a document belonging to a class based on
the frequency of words in the document. This assumption simplifies the computation, but
it can lead to poor performance when word dependencies exist in the text.
12. Write two NLP applications where we can use the bag-of-words technique. (5 Marks)
13. What is the problem with the maximum likelihood for the Multinomial Naive Bayes
classifier? How to resolve? (10 Marks)
Problem with Maximum Likelihood: In the Multinomial Naive Bayes classifier, the
maximum likelihood estimation (MLE) of probabilities can assign zero probability to
words that do not appear in the training set. This leads to issues when such words appear
in the test set, as it results in an overall zero probability for the document.
Solution: This can be addressed by applying Laplace Smoothing (or add-one
smoothing), which adds a small constant (usually 1) to all word counts, ensuring that no
word has a zero probability. This adjustment ensures that words not seen in the training
data still have a non-zero probability.
14. Explain the confusion matrix that can be generated in terms of a spam detector. (5
Marks)
This matrix helps in calculating performance metrics like accuracy, precision, recall, and F1-
score.
15. How k-fold cross validation is used for evaluating a text classifier? (5 Marks)
16. Explain practical issues of a text classifier and how to solve them. (5 Marks)
Practical Issues:
o Data Imbalance: If one class is much more frequent than the other, the model
may bias towards the majority class. This can be addressed using techniques like
SMOTE (Synthetic Minority Over-sampling Technique) or class weights.
o Noise and Irrelevant Features: Unimportant or noisy features can degrade
performance. This can be mitigated through feature selection and removing stop
words.
o Overfitting: If a model performs well on the training data but poorly on the test
set, it may be overfitting. This can be reduced by using regularization techniques
or increasing training data.
Rule-Based Methods: Involves manually coded rules to classify text. These methods are
easy to interpret but can be labor-intensive.
Statistical Methods: These include machine learning algorithms like Naive Bayes, SVM,
and decision trees.
Deep Learning Methods: These use neural networks like CNNs, RNNs, or transformers
for text classification tasks. They are effective for handling large datasets and complex
language features.
18. Give any 3 different evaluation metrics available for text classification. Explain with
examples. (10 Marks)
19. What are the evaluation measures to be undertaken to judge the performance of a
matrix? (2 Marks)
Evaluation Measures:
o Accuracy: Correct predictions over total predictions.
o Precision, Recall, and F1-Score: Evaluates the classifier’s performance in
identifying each class, especially in imbalanced datasets.
20. With a schematic diagram explain Word2vec type of word embedding. (5 Marks)
21. Explain the working of Doc2Vec type of word embedding with labelled diagram. (5
Marks)
22. With example explain the following word to sequence analysis:- (5 Marks)
a) Vector Semantic:
Opinion Mining: Opinion mining, also known as sentiment analysis, is the process of
determining the sentiment expressed in a piece of text (positive, negative, or neutral). It
typically involves analyzing social media posts, product reviews, or customer feedback to
understand people's attitudes toward a subject.
Example: A review that states, "This phone is amazing!" would be classified as positive
sentiment, while "This phone is terrible!" would be negative sentiment.
24. What are the aspects taken into account while collecting feedback of brands for
sentiment analysis? (5 Marks)
Aspects Considered:
o Text Source: Feedback sources such as social media, product reviews, surveys, or
customer service interactions are crucial for analyzing sentiment.
o Tone: The tone of the feedback (positive, negative, neutral) helps in
understanding the overall sentiment about a brand.
o Keywords/Phrases: Keywords related to product features, customer service, or
brand perception are important in sentiment analysis.
o Context: The context in which feedback is provided (e.g., after a product launch
or service experience) can influence sentiment interpretation.
o Entity Recognition: Identifying the specific product or service mentioned in the
feedback, such as brand name, product types, or features.
Intent Analysis: Intent analysis is the process of determining the purpose or goal behind
a piece of text. It helps in understanding why a user is interacting with a system or a
brand.
Example: In customer support, intent analysis can determine whether the customer is
seeking help, making a purchase, or giving feedback.
Emotion Analysis: Emotion analysis involves detecting the emotional tone behind words
in a text to understand the feelings expressed by the writer, such as happiness, sadness,
anger, or surprise.
Example: A review like "I am so happy with this product!" would be identified as
expressing joy.
27. How does emotional analytics work? (5 Marks)
Explanation: The term "naive" in Naïve Bayes classifier refers to the assumption of
feature independence, which is often unrealistic in real-world data. However, despite this
assumption, the classifier performs well in many practical applications, such as spam
detection and sentiment analysis, because it simplifies the problem significantly.
Reason for success: Even with the naive assumption, the algorithm often yields
surprisingly good results by using probabilistic reasoning, especially when combined
with techniques like Laplace smoothing.
29. With detailed steps explain the working of Multinomial Naive Bayes learning. (5
Marks)
1. Data Preprocessing:
o Convert all text data into numerical format using techniques like TF-IDF or bag-
of-words.
2. Calculate Prior Probabilities:
o Calculate the prior probability of each class (e.g., spam or not spam) based on the
frequency of each class in the training data.
3. Calculate Conditional Probabilities:
o For each class, calculate the likelihood of each word appearing in the class using
the training data.
4. Bayes’ Theorem Application:
o Use Bayes' Theorem to combine prior and conditional probabilities to make
predictions for new data.
5. Classification:
o Choose the class with the highest posterior probability for classification.
30. What is micro averaging and macro averaging? Explain with an example. (10 Marks)
Micro Averaging:
o In micro-averaging, the individual class predictions are aggregated first (i.e., the
true positives, false positives, and false negatives across all classes), and then the
metrics (precision, recall, F1 score) are computed based on the aggregated values.
Macro Averaging:
o In macro-averaging, the metrics for each class are calculated independently, and
then the average of the metrics for each class is taken.
Example:
o If we have two classes, A and B, and for each class, we calculate precision and
recall:
Micro Average: We sum up all true positives, false positives, and false
negatives across both classes, and then compute precision and recall.
Macro Average: We calculate precision and recall for each class
separately, and then compute the average of the precision and recall
values.
31. State 3 opinion mining techniques with proper explanation. (10 Marks)
1. Lexicon-based Approach:
o Uses predefined lists of positive and negative words to determine sentiment. Each
word is assigned a sentiment score, and the overall sentiment of a text is
calculated based on the words present in the text.
2. Machine Learning-based Approach:
o Utilizes classification algorithms like Naive Bayes, SVM, or deep learning
models to classify text into positive, negative, or neutral sentiments. The model is
trained on labeled datasets to learn sentiment patterns.
3. Hybrid Approach:
o Combines both lexicon-based and machine learning-based methods. It uses
lexicon for sentiment scoring and machine learning for classification, improving
accuracy by leveraging both techniques.
32. What issue crops up for Information Retrieval based on keyword search in case of a
huge size document? (5 Marks)
Issue:
o As the size of the document increases, the retrieval process becomes slower and
more resource-intensive.
o Keyword search becomes inefficient due to the sheer volume of data, leading to
longer response times and increased computational costs.
o Handling complex queries and providing relevant results in large documents can
be challenging, especially when the terms are ambiguous or have multiple
meanings.
Solution:
o Indexing: Using efficient indexing techniques like inverted indexing to quickly
locate keywords and their occurrences in the document.
o Document Partitioning: Breaking down large documents into smaller,
manageable chunks to improve retrieval performance.
o Vector-based Search: Employing vector space models, such as TF-IDF, for
better ranking and relevance in large documents.
33. What are the initial stages of text processing? (10 Marks)
1. Text Collection:
o Gather raw text data from various sources like websites, books, articles, or social
media platforms.
2. Text Normalization:
o Standardizing the text by converting it to lowercase, removing punctuation,
special characters, and unnecessary white spaces.
3. Tokenization:
o Breaking the text into smaller units such as words, sentences, or subword tokens.
4. Stop Word Removal:
o Removing common words such as "the", "is", "in", which do not carry significant
meaning in the context of the analysis.
5. Stemming and Lemmatization:
o Reducing words to their base or root form to consolidate similar terms. For
example, "running" becomes "run".
6. POS Tagging:
o Assigning part-of-speech tags to words (e.g., noun, verb, adjective) to understand
their role in the sentence.
7. Named Entity Recognition (NER):
o Identifying and classifying named entities (e.g., persons, organizations, locations)
in the text.
8. Vectorization:
o Converting the processed text into a numerical form, such as using bag-of-words
or TF-IDF techniques, for further analysis or modeling.
35. What are the different ways to use Bag-of-words representation for text classification?
(10 Marks)
1. Simple Count Vectorization:
o Count the occurrences of each word in the text and use these counts as feature
vectors for classification.
2. TF-IDF (Term Frequency-Inverse Document Frequency):
o Calculate the term frequency (TF) and multiply it by the inverse document
frequency (IDF) to give weight to less common, more informative words.
3. N-Grams Representation:
o Extend the bag-of-words model by using n-grams (unigrams, bigrams, trigrams)
to capture local word order and contextual information.
4. Sparse Matrix Representation:
o Represent the text data as a sparse matrix, where most of the elements are zero, to
save memory and computational resources.
5. Dimensionality Reduction:
o Apply techniques like Principal Component Analysis (PCA) or Latent Semantic
Analysis (LSA) to reduce the dimensionality of the feature space and focus on the
most important features.
36. State the difference between sentiment analysis, intent analysis, and emotion analysis.
(10 Marks)
Sentiment Analysis:
o Focuses on determining the overall sentiment or opinion expressed in the text
(e.g., positive, negative, neutral).
o Example: Analyzing customer reviews to determine whether the feedback is
positive or negative about a product.
Intent Analysis:
o Aims to understand the underlying purpose or goal behind the text (e.g., asking
for help, making a purchase, giving feedback).
o Example: Identifying whether a user's query to a chatbot is about troubleshooting
or product inquiry.
Emotion Analysis:
o Detects and classifies the emotional tone in the text (e.g., joy, anger, sadness,
surprise).
o Example: Analyzing social media posts to understand the public's emotional
reaction to an event or news.
37. How is sentiment analysis used by different brands to assess the status of the market
after launching a product? (10 Marks)
38. Mention a few practical applications of emotion analysis by emotion recognition. (10
Marks)
1. Customer Experience:
o Emotion analysis can be applied in customer support to analyze emotions in
interactions and improve service quality based on emotional cues.
2. Market Research:
o Brands use emotion analysis to understand consumer emotions towards their
products or advertisements, influencing marketing strategies.
3. Healthcare:
o Emotion recognition can help in mental health diagnosis and therapy,
understanding emotional states of patients and offering appropriate interventions.
4. Human-Computer Interaction:
o Emotion analysis enhances user experience by making systems more responsive
to the user’s emotional state, such as in virtual assistants or gaming.
5. Education:
o Emotion recognition can help educators understand students' emotional
engagement and adapt teaching strategies accordingly.
39. Step by step explain how Naive Bayes classifier can be used for text classification. (10
Marks)
1. Preprocessing:
o Collect the dataset, clean the text data, and convert it into a numerical form using
techniques like bag-of-words or TF-IDF.
2. Feature Extraction:
o Extract features (e.g., word frequencies) from the text for use in the classifier.
3. Model Training:
o For each class (e.g., spam, not spam), calculate the prior probability and
likelihood of each word in the class using the training data.
4. Apply Bayes' Theorem:
o Use Bayes' theorem to compute the posterior probability for each class given the
word frequencies in the text.
5. Classification:
o Choose the class with the highest posterior probability as the predicted class for
the new text.
6. Model Evaluation:
o Evaluate the model’s performance using metrics like accuracy, precision, recall,
and F1 score on the test data.
1. Lowercasing:
o Convert all text to lowercase to ensure uniformity and avoid duplicate word
entries (e.g., "Apple" and "apple").
2. Removing Punctuation:
o Remove all punctuation marks (e.g., periods, commas, exclamation marks) as
they don't contribute to the text analysis.
3. Removing Stop Words:
o Remove common words like "and", "the", "is" that don't carry significant meaning
in the analysis.
4. Stemming/Lemmatization:
o Reduce words to their root form (e.g., "running" becomes "run") to consolidate
similar terms.
Email Filtering:
o Classifying emails as spam or non-spam using text classification algorithms. This
helps in organizing incoming emails and preventing spam from cluttering the
inbox.
Sentiment Analysis:
o Text classification can be used to analyze customer feedback, product reviews, or
social media posts to determine whether the sentiment is positive, negative, or
neutral.
Document Categorization:
o Categorizing large collections of documents into predefined categories (e.g., news
articles classified into topics like sports, politics, technology) for easier retrieval.
Content Recommendation Systems:
o Based on user preferences or reading history, text classification can recommend
similar articles, books, or movies by categorizing content based on the user's
interest.
Chatbot Responses:
o Text classification helps chatbots understand user queries and respond
accordingly by classifying user input into predefined categories of intents (e.g.,
booking a ticket, checking weather).
Search Engines:
o NER helps in improving search engine results by identifying and indexing named
entities, enabling better content retrieval related to specific entities like people or
places.
Question Answering Systems:
o In question-answering applications, NER aids in extracting relevant named
entities from a user's question to provide accurate answers.
Information Extraction:
o NER assists in automatically extracting important information, such as people,
places, or events, from large unstructured datasets or documents.
Sentiment Analysis:
o By recognizing named entities, sentiment analysis can be more accurate in
determining the sentiment expressed towards specific individuals or
organizations.
44. How k-fold cross validation is used for evaluating a text classifier? (10 Marks)
1. Data Splitting:
o Divide the entire dataset into k equal-sized "folds". Typically, k is chosen as 5 or
10, but it can be adjusted based on dataset size.
2. Training and Testing:
o For each fold, use k-1 folds for training the model and the remaining fold for
testing the model.
3. Model Evaluation:
o Repeat this process for each fold and calculate the evaluation metric (e.g.,
accuracy, precision, recall) for each iteration.
4. Result Averaging:
o Average the evaluation scores obtained from all k iterations to get a more robust
estimate of the model's performance.
5. Model Selection:
o The final averaged score can be used to select the best model configuration or
tune hyperparameters.
45. Explain the fundamental concepts of Natural Language Processing (NLP) and discuss
its significance in today's digital era, providing examples of real-world applications and
potential future advancements. (5 Marks)
Fundamental Concepts:
o Text Preprocessing: Cleaning and preparing text data by tokenizing, stemming,
and removing stop words.
o Text Representation: Converting text into numerical representations like bag-of-
words, TF-IDF, or word embeddings (e.g., Word2Vec).
o Language Models: Building models that understand and generate human
language, such as n-grams or deep learning-based models like GPT.
Significance:
o NLP enables machines to interact with humans in natural language, which is
crucial for applications such as virtual assistants (e.g., Siri, Alexa), translation
services (e.g., Google Translate), and content recommendations.
o Real-World Applications:
Sentiment analysis in social media monitoring.
Chatbots for customer service in e-commerce.
Text summarization for news articles.
o Future Advancements:
More sophisticated conversational AI with deeper understanding and
emotional intelligence.
Real-time multilingual translation and more accurate voice recognition.
Ambiguity in NLP refers to the phenomenon where a single word or sentence can have
multiple meanings depending on the context in which it is used.
Types of Ambiguity:
o Lexical Ambiguity:
Occurs when a word has multiple meanings. For example, "bank" can
refer to a financial institution or the side of a river.
o Syntactic Ambiguity:
Arises when the structure of a sentence allows for multiple interpretations.
For example, "I saw the man with the telescope" could mean either "I saw
a man who had a telescope" or "I used a telescope to see the man."
o Semantic Ambiguity:
Occurs when a sentence can have different meanings due to the
interpretation of words. For example, "The chicken is ready to eat" could
mean the chicken is cooked and ready to be eaten, or the chicken is ready
to eat something else.
o Pragmatic Ambiguity:
Refers to the ambiguity arising from the context or the speaker’s intent.
For example, "Can you pass the salt?" may seem like a question but is
often interpreted as a request.
47. What are the benefits of a text classification system? Give an example. (5 Marks)
Benefits:
o Automation: Automates the categorization of large text datasets, saving time and
effort compared to manual classification.
o Accuracy: A well-trained classifier can achieve high accuracy in identifying and
categorizing text, reducing human error.
o Scalability: It can scale to handle large amounts of data efficiently.
Example:
o A spam filter in an email system uses text classification to automatically identify
and move spam emails to the spam folder, keeping the inbox clean and organized.
48. Explain the Building Blocks of Semantic System? (5 Marks)
1. Lexical Semantics:
o Understanding the meaning of individual words and how they combine to form
phrases and sentences. It involves analyzing word meanings, synonyms, and
antonyms.
2. Compositional Semantics:
o The process of deriving the meaning of a sentence or text based on the meanings
of its individual components (words and phrases).
3. Pragmatics:
o Understanding meaning in context, such as how the meaning of a sentence may
change depending on the situation or speaker's intent.
4. World Knowledge:
o Incorporating external knowledge or common sense to enhance the understanding
of text, which is essential for disambiguating meanings.
Definition:
o Dependency parsing is the process of analyzing the grammatical structure of a
sentence by establishing relationships between "head" words and their
dependents. It helps in understanding the syntactic structure and relationships
within the sentence.
How it works:
o Dependency Tree: A dependency tree is constructed where the root of the tree
represents the main verb or predicate of the sentence, and all other words are
connected to the root based on syntactic dependencies.
o Dependencies: The words in the sentence are linked to their syntactic head (e.g.,
the subject of the verb, the object of the verb) with directed edges. Each word in
the sentence has a specific dependency relation with its head.
o Example:
Sentence: "The cat sat on the mat."
"sat" is the root of the sentence. "The" is a dependent of "cat", and "on" is
the head of the prepositional phrase "on the mat."
Applications:
o Information Extraction: Dependency parsing helps in identifying relations
between entities, such as "John (subject) gave (verb) a book (object) to Mary
(indirect object)".
o Machine Translation: Understanding the grammatical structure of a source
sentence and translating it into another language while maintaining meaning.
o Question Answering: Helps in identifying the relevant components of a question
and matching them to the answer correctly.
51. What are the steps involved in pre-processing data for NLP? (5 Marks)
1. Text Cleaning:
o Remove unnecessary elements such as HTML tags, special characters, and
irrelevant text like advertisements.
2. Tokenization:
o Break the text into smaller units like words or sentences. For example, "I love
NLP" is tokenized into ["I", "love", "NLP"].
3. Lowercasing:
o Convert all text to lowercase to ensure that the model treats words like "Cat" and
"cat" as the same.
4. Removing Stopwords:
o Remove common words (e.g., "the", "is", "and") that do not contribute much
meaning to the text, which helps in reducing noise.
5. Stemming or Lemmatization:
o Reduce words to their root form (e.g., "running" becomes "run") to treat related
words as the same.
6. Vectorization:
o Convert the processed text into numerical representations (e.g., TF-IDF, bag-of-
words, or word embeddings) for further machine learning processing.
52. What are some common applications of chatbots in various industries? (10 Marks)
Customer Service:
o Chatbots are used by companies like Amazon and Zappos to handle customer
queries, resolve issues, and provide product recommendations without human
intervention.
Healthcare:
o Chatbots like Babylon Health offer medical consultations by analyzing symptoms
and providing possible diagnoses, helping patients get quick medical advice.
E-commerce:
o Many e-commerce websites use chatbots to assist customers in browsing
products, answering product-related questions, and helping with purchases.
Banking:
o Chatbots help customers with balance inquiries, recent transactions, bill
payments, and loan application processes, making banking services more
accessible and efficient.
Education:
o Chatbots are used in online courses and platforms to assist students by answering
common questions, providing study materials, and helping with exam preparation.
Travel and Hospitality:
o Travel agencies use chatbots to assist customers with booking flights, hotels, and
even offering personalized travel suggestions based on preferences.
53. Compute the minimum edit distance in transforming the word DOG to COW using
Levenshtein distance, i.e., insertion = deletion = 1 and substitution = 2. (10 Marks)
Steps:
54. What are word embeddings in NLP and how can they be used in various NLP
applications? (10 Marks)
Word Embeddings:
o Word embeddings are a type of word representation that allows words to be
represented as vectors in a continuous vector space. These vectors capture the
semantic meaning of the words, with similar words having similar vector
representations.
How they work:
o Embeddings are typically learned using unsupervised learning techniques such as
Word2Vec, GloVe, or fastText. These models learn the relationship between
words based on their context within large corpora.
Applications:
o Semantic Similarity: Word embeddings enable the calculation of word similarity
by comparing the cosine similarity between word vectors. For instance, "king"
and "queen" will be closer in the vector space than "king" and "car".
o Machine Translation: Embeddings improve translation by capturing the meaning
of words in different languages, helping to map semantically similar words across
languages.
o Text Classification: Word embeddings improve text classification tasks by
capturing the semantic meaning of words, enhancing the model's ability to
classify text into categories like sentiment or topic.
o Named Entity Recognition (NER): Word embeddings help recognize entities
(e.g., person names, locations) in text by understanding their context and
relationships with other words.
o Chatbots and Question Answering: Embeddings improve the chatbot’s
understanding of user input, helping the system generate more accurate and
relevant responses
55. Do you believe there are any distinctions between prediction and classification?
Illustrate with an example. (5 Marks)
Prediction:
o Prediction is the process of estimating a continuous output variable based on input
data. The output can take any real value within a range.
o Example: Predicting the temperature tomorrow based on historical weather data.
The output could be a temperature value, say 30°C.
Classification:
o Classification, on the other hand, involves categorizing the data into discrete
classes or labels. The output variable is a class or category rather than a
continuous value.
o Example: Classifying whether an email is spam or not based on certain features.
The output will be either "spam" or "not spam" (binary classification).
Given email: "Offer = Yes, Win = Yes, Money = Yes" (to classify)
1. Prior Probabilities:
o P(Pos) = 2/4
o P(Neg) = 2/4
2. Likelihoods:
o P(Good = Yes | Pos) = 1/2
o P(Fast = Yes | Pos) = 1/2
o P(Cheap = No | Pos) = 1/2
o P(Good = Yes | Neg) = 1/2
o P(Fast = Yes | Neg) = 0/2 = 0
o P(Cheap = No | Neg) = 1/2
3. Naive Bayes Calculation:
o P(Pos | "Good" = Yes, "Fast" = Yes, "Cheap" = No) ∝ P(Good = Yes | Pos) *
P(Fast = Yes | Pos) * P(Cheap = No | Pos) * P(Pos)
o P(Pos | "Good" = Yes, "Fast" = Yes, "Cheap" = No) ∝ (1/2) * (1/2) * (1/2) * (2/4)
= 0.125
o P(Neg | "Good" = Yes, "Fast" = Yes, "Cheap" = No) ∝ P(Good = Yes | Neg) *
P(Fast = Yes | Neg) * P(Cheap = No | Neg) * P(Neg)
o P(Neg | "Good" = Yes, "Fast" = Yes, "Cheap" = No) ∝ (1/2) * (0) * (1/2) * (2/4)
=0
4. Conclusion:
o Since P(Pos) > P(Neg), the feedback is classified as Positive.
60. A weather dataset is given for predicting whether a person will play tennis. (10 Marks)
1. Prior Probabilities:
o P(Play Tennis = Yes) = 3/6
o P(Play Tennis = No) = 3/6
2. Likelihoods:
o P(Outlook = Rain | Yes) = 2/3
o P(Temperature = Mild | Yes) = 1/3
o P(Humidity = High | Yes) = 1/3
o P(Wind = Strong | Yes) = 1/3
o P(Outlook = Rain | No) = 1/3
o P(Temperature = Mild | No) = 1/3
o P(Humidity = High | No) = 2/3
o P(Wind = Strong | No) = 1/3
3. Naive Bayes Calculation:
o P(Play Tennis = Yes | Outlook = Rain, Temperature = Mild, Humidity = High,
Wind = Strong) ∝ P(Outlook = Rain | Yes) * P(Temperature = Mild | Yes) *
P(Humidity = High | Yes) * P(Wind = Strong | Yes) * P(Yes)
o P(Play Tennis = Yes | Outlook = Rain, Temperature = Mild, Humidity = High,
Wind = Strong) ∝ (2/3) * (1/3) * (1/3) * (1/3) * (3/6) = 0.027
o P(Play Tennis = No | Outlook = Rain, Temperature = Mild, Humidity = High,
Wind = Strong) ∝ P(Outlook = Rain | No) * P(Temperature = Mild | No) *
P(Humidity = High | No) * P(Wind = Strong | No) * P(No)
o P(Play Tennis = No | Outlook = Rain, Temperature = Mild, Humidity = High,
Wind = Strong) ∝ (1/3) * (1/3) * (2/3) * (1/3) * (3/6) = 0.027
4. Conclusion:
o Since P(Play Tennis = Yes) = P(Play Tennis = No), the model cannot confidently
predict. Based on the data, it can go either way.
Module 4
Content-Based Filtering works by recommending items similar to the ones the user has
interacted with based on the item's features or content:
1. Feature Extraction: The system identifies key features of the items, such as
genre, keywords, or price.
2. User Profile Creation: It builds a profile for each user, capturing their
preferences based on the items they've liked or interacted with.
3. Similarity Calculation: The system computes the similarity between items using
measures like cosine similarity or Euclidean distance.
4. Recommendation Generation: Items with the highest similarity to the user’s
profile are recommended.
o Example: If a user likes action movies, the system will recommend more action
movies using genre as the key feature.
Text Summarization is the process of reducing a long piece of text to a shorter version,
capturing the key points and important information. It can be done in two ways:
1. Extractive Summarization: Selects important sentences or segments directly
from the original text.
2. Abstractive Summarization: Generates new sentences that capture the meaning
of the original text, often using NLP techniques.
A Chatbot is a type of Conversational Agent that interacts with users through text or
voice-based communication. It is designed to simulate human conversations, answering
queries or performing tasks autonomously.
1. Types:
Rule-Based Chatbots: Use predefined rules for responses.
AI-Based Chatbots: Use machine learning and natural language
processing to understand and respond to complex queries.
2. Applications:
Customer service, virtual assistants, e-commerce support, and
conversational interfaces for various services.
Example: A chatbot on an e-commerce site assisting users in finding
products.
Advantages of AI in Chatbots:
1. Natural Language Understanding: AI-powered chatbots can understand and
process natural language, enabling more human-like conversations.
2. Contextual Understanding: AI chatbots can remember previous interactions and
provide contextually relevant responses, making them more personalized.
3. 24/7 Availability: AI chatbots can operate round-the-clock, providing instant
support and responses at any time.
4. Scalability: AI chatbots can handle a large volume of interactions simultaneously,
making them scalable for businesses with high customer interaction volumes.
5. Learning and Improvement: AI chatbots improve over time as they learn from
user interactions, refining their responses for better accuracy.
Retrieval-Based Model:
o A Retrieval-Based Model is a type of chatbot or conversational agent that
provides responses by retrieving the most appropriate answer from a predefined
set of responses or templates. It does not generate new content but selects the
most relevant response from a database or a knowledge base.
o Example: When asked about business hours, the chatbot retrieves the exact
response from a set of pre-stored information.
17. State the concept of Information retrieval (IR) based question answering. (2 Marks)
Key Differences:
o Scope: Web search focuses on the internet, while IR can be applied to any
document-based data.
o Ranking Algorithms: Web search engines use complex algorithms like
PageRank to rank results, while IR typically uses relevance measures like cosine
similarity or TF-IDF.
19. Define Recommendation based on User Ratings using appropriate example. (5 Marks)
Sentiment Analysis is the process of determining the emotional tone or sentiment behind
a piece of text. It is often used in analyzing customer feedback, social media posts, or
product reviews.
1. Positive Sentiment: Indicates approval or satisfaction.
2. Negative Sentiment: Indicates disapproval or dissatisfaction.
3. Neutral Sentiment: Indicates a lack of strong opinion.
Example:
22. Explain the concept of the Recommendation System with real-life examples. (5 Marks)
Real-life Examples:
1. Netflix: Recommends movies and TV shows based on the user’s viewing history
and ratings, using collaborative filtering and content-based methods.
2. Amazon: Recommends products based on the items you have previously
purchased or viewed, as well as what other customers with similar behaviors have
bought.
3. Spotify: Suggests music based on the user’s listening patterns, favorite artists, or
genres, combining collaborative filtering and content-based recommendations.
Example:
o Movie Recommendations: If User A and User B both liked the same movies, and
User A also liked a movie that User B hasn’t seen, the system will recommend
that movie to User B.
o Amazon’s Product Recommendations: If users who bought a specific book also
bought another one, the system will recommend the second book to others who
bought the first.
26. What are steps involved in Latent Dirichlet Allocation (LDA)? (5 Marks)
Example: Analyzing tweets about a new product launch to assess public opinion and
consumer reactions.
Chatbot Architectures:
1. Rule-Based Architecture:
Relies on predefined responses and rules based on keyword matching.
Suitable for basic queries but lacks the ability to handle complex or
ambiguous inputs.
2. Retrieval-Based Architecture:
Selects the most relevant response from a set of pre-stored responses
based on the user’s input.
Does not generate new responses but relies on existing ones in the
database.
3. Generative Architecture:
Uses machine learning models (e.g., neural networks) to generate dynamic
responses.
Can handle more open-ended conversations and learn from user
interactions.
4. Hybrid Architecture:
Combines both rule-based and generative approaches to offer a more
robust chatbot experience, enabling better handling of diverse queries.
29. Illustrate Multi-document summarization. (5 Marks)
Multi-document Summarization:
o Concept: In multi-document summarization, the goal is to create a single, concise
summary that represents the most important information from multiple
documents.
o Process:
1. Document Collection: Collect a set of related documents on a specific
topic.
2. Text Preprocessing: Clean the documents by removing stop words,
stemming, and other irrelevant elements.
3. Text Segmentation and Clustering: Divide the documents into segments
or clusters based on similarity.
4. Summarization: Extract key sentences or information from each
document and combine them into a coherent summary.
Example: Summarizing multiple news articles about the same event into a single,
concise summary.
Topic Modeling:
o Concept: Topic modeling is a technique used in natural language processing
(NLP) to discover the hidden thematic structure in a large collection of text. It
helps in identifying topics or themes that occur across the documents.
o Common Methods:
1. Latent Dirichlet Allocation (LDA): A probabilistic model that assumes
each document is a mixture of topics, and each word in the document is
attributable to one of the document's topics.
2. Non-Negative Matrix Factorization (NMF): Factorizes the document-
term matrix into two lower-dimensional matrices, one for the topic
distribution and one for word distribution.
Example: Using topic modeling to categorize articles in a news dataset into topics like
politics, sports, entertainment, etc.
Extraction-Based Summarization:
o Concept: This technique involves selecting important sentences or segments from
the original text and combining them to create a summary. It extracts the most
relevant parts without altering or generating new content.
o Process:
1. Text Preprocessing: Clean and preprocess the text, removing unnecessary
elements like stop words.
2. Sentence Ranking: Use algorithms to rank sentences based on their
relevance to the main topic (e.g., TF-IDF, TextRank).
3. Extraction: Select the top-ranked sentences or segments and combine
them to form a coherent summary.
Example: A news article about a political event may extract key sentences that highlight
the main points such as the event's location, time, and major developments.
38. What is LDA and how is it different from others? (10 Marks)
Abstractive Summarization:
o Concept: Abstractive summarization involves generating new sentences that
convey the most important information from the original text. Unlike extractive
summarization, which directly selects parts of the text, abstractive summarization
rewrites the content.
o Example:
Original Text: "The Eiffel Tower is one of the most famous landmarks in
Paris, France. It was designed by engineer Gustave Eiffel and completed
in 1889. The tower stands at a height of 324 meters and attracts millions of
visitors every year."
Abstractive Summary: "The Eiffel Tower in Paris, designed by Gustave
Eiffel in 1889, is a popular tourist attraction, standing 324 meters tall."
o Methods:
Seq2Seq Models: Use sequence-to-sequence models, often based on
LSTM or Transformer networks, to generate summaries.
GPT-based Models: Pre-trained models like GPT-3 can be used to
generate abstractive summaries by learning the language and context.
o Challenges:
Maintaining coherence and meaning while generating new sentences.
Handling long documents where the summary should still capture the
essence.
Content-based Filtering:
o Advantages:
1. No Need for User Data: Can recommend items to new users who have
little or no prior data.
2. Transparency: The rationale for recommendations is based on item
features, making it easier to explain why an item was recommended.
3. Personalization: Can tailor recommendations based on the individual’s
preferences and behaviors.
o Disadvantages:
1. Limited to User’s Past Preferences: Recommendations are often limited
to similar items, which might not help in discovering new content.
2. Cold Start Problem for Items: If there is no information about an item, it
can’t be recommended, which might limit the system's effectiveness.
3. Lack of Diversity: The system might recommend similar items too often,
leading to a lack of variety.
Collaborative Filtering:
o Advantages:
1. No Need for Item Features: Works based on user-item interactions, so it
can recommend items even if detailed item features are not available.
2. Discover New Items: Can recommend items that a user might not have
found on their own, leading to serendipitous discoveries.
3. Widely Applicable: Works well in domains where user preferences are
critical, such as movies, music, or e-commerce.
o Disadvantages:
1. Cold Start Problem for New Users: If there is not enough data about a
user, it can be challenging to make recommendations.
2. Scalability Issues: As the number of users and items increases,
collaborative filtering models can become computationally expensive.
3. Sparsity: In large systems, many users may not have interacted with
enough items, leading to sparse user-item interaction matrices.
42. In the context of natural language processing, how can we leverage the concepts of TF-
IDF, training set, validation set, test set, and stop words to improve the accuracy and
effectiveness of machine learning models and algorithms? Additionally, what are some
potential challenges and considerations when working with these concepts, and how can we
address them? (10 Marks)
Leverage in NLP:
1. TF-IDF (Term Frequency-Inverse Document Frequency):
How it helps: TF-IDF helps quantify the importance of words in a
document relative to a collection (corpus). It reduces the influence of
common words (e.g., "the," "and") while highlighting unique terms
relevant to the context, aiding in text classification, clustering, and other
NLP tasks.
Application: Use TF-IDF for feature extraction in tasks like document
classification, topic modeling, and search engines.
2. Training, Validation, and Test Sets:
Training Set: Used to train the machine learning model, adjusting its
parameters based on the data.
Validation Set: Helps tune the hyperparameters and select the best model
by evaluating performance on unseen data.
Test Set: Evaluates the final model’s performance to assess its
generalization ability.
How they help: These sets help prevent overfitting, ensuring the model is
trained well and can generalize to new, unseen data.
3. Stop Words:
How it helps: Stop words (like "and," "is," "in") carry little meaning for
certain NLP tasks, so removing them can improve the signal-to-noise ratio
and reduce computational complexity.
Application: Stop word removal is critical in tasks like text classification,
information retrieval, and sentiment analysis.
Challenges and Considerations:
1. Overfitting: The model may memorize the training data and fail to generalize to
new data.
Solution: Use regularization techniques and cross-validation to detect and
mitigate overfitting.
2. Data Imbalance: If the dataset has class imbalances, the model might be biased
toward the majority class.
Solution: Use techniques like oversampling, undersampling, or class
weighting to address this issue.
3. Stop Words in Context: In some cases, stop words carry meaning (e.g., in
sentiment analysis, "not good").
Solution: Carefully evaluate which stop words are relevant to the task
before removal.
43. Describe how extraction-based and abstraction-based summarizations vary from one
another. How would you go about creating an extractive summarization system? (10
Marks)
Extraction-Based Summarization:
o Concept: In extraction-based summarization, important sentences or phrases are
extracted directly from the original document to form a summary. It involves
selecting the most significant portions without altering the original content.
o Example: Selecting key sentences from a news article and concatenating them to
form a brief summary.
Abstraction-Based Summarization:
o Concept: In abstraction-based summarization, new sentences are generated by
paraphrasing the content of the original document. It involves rewriting and
synthesizing the key information.
o Example: Generating a short summary that conveys the same meaning of an
article but with different words or structure.
Creating an Extractive Summarization System:
1. Text Preprocessing: Clean the data by removing stop words, punctuation, and
stemming or lemmatizing the words.
2. Feature Extraction: Convert the text into numerical features using methods like
TF-IDF or word embeddings.
3. Sentence Scoring: Use techniques like TF-IDF, PageRank (TextRank), or
clustering to score and rank sentences based on their importance.
4. Sentence Selection: Select the highest-scoring sentences to form the summary
while ensuring coherence and readability.
5. Post-processing: Optionally, reformat the sentences for smoother readability and
coherence.
44. Explain how pretraining techniques such as GPT (Generative Pretrained Transformer)
contribute to improving natural language understanding tasks. Discuss the key components
and training objectives of GPT models. (10 Marks)
45. How do you use naïve bayes model for collaborative filtering? (10 Marks)
Lexical Analysis:
o Concept: Lexical analysis is the process of breaking down a sequence of
characters (text) into tokens, such as words, punctuation, and symbols, which are
the basic units for syntax analysis.
o Goal: Identifies the structure of words and symbols in a language.
Semantic Analysis:
o Concept: Semantic analysis deals with interpreting the meanings of words,
phrases, or sentences in context. It focuses on the relationships between symbols
to derive meaning.
o Goal: Understanding the meaning behind the words and phrases in context.
Difference:
o Lexical Analysis: Concerned with the form and structure of the language
(syntax).
o Semantic Analysis: Concerned with the meaning and interpretation of the content
(semantics).
48. Define what N-grams are in the context of Natural Language Processing (NLP). (2
Marks)
N-grams:
o Definition: N-grams are contiguous sequences of "n" items (words, characters, or
tokens) from a given text or speech. For example:
Unigram (1 word): "I", "love", "machine", "learning"
Bigram (2 words): "I love", "love machine", "machine learning"
Trigram (3 words): "I love machine", "love machine learning"
o N-grams are used for feature extraction, language modeling, and improving text
predictions.
49. How are N-grams utilized in Natural Language Processing (NLP)? (2 Marks)
Utilization of N-grams:
1. Feature Extraction: N-grams are used as features in machine learning models for
tasks such as text classification, sentiment analysis, and language modeling. They
capture local context by considering adjacent words.
2. Language Modeling: N-grams are employed to predict the likelihood of the next
word in a sequence by analyzing previous words in a sentence.
3. Text Generation: N-grams help in generating coherent text by predicting the next
word based on the given sequence of words.
4. Improving Accuracy: They can capture dependencies between words, helping
models understand the structure and semantics of a sentence better.
Data Augmentation:
o Definition: Data augmentation refers to techniques that artificially expand the
size of a training dataset by applying transformations to the existing data. These
transformations can include adding noise, changing the order of words, or
replacing words with synonyms.
o Purpose: It is mainly used to improve model generalization, prevent overfitting,
and enhance performance, especially when data is scarce.
51. How would you create a Recommender System for text inputs? (5 Marks)
52. Discuss how the popular word embedding technique Word2Vec is implemented in the
algorithms: Continuous Bag of Words (CBOW) model, and the Skip-Gram model. (5
Marks)
Word2Vec Models:
1. Continuous Bag of Words (CBOW):
Concept: The CBOW model predicts the target word given a context
(surrounding words). It uses the surrounding words in a fixed window to
predict the center word.
Example: In the sentence "The cat sat on the mat," the model will use the
words "The," "cat," "on," "the" to predict the center word "sat."
2. Skip-Gram Model:
Concept: The Skip-Gram model does the inverse of CBOW by predicting
the context words given a target word.
Example: Given the word "sat," the model will predict the context words
like "The," "cat," "on," "the."
3. Training Process: Both models use a neural network to map words to vectors
(embeddings), and the embeddings are updated through backpropagation based on
prediction accuracy.
Collaborative Filtering:
o Concept: Collaborative filtering recommends items based on past interactions or
ratings of users. It assumes that if users agree on one item, they will also agree on
other items.
o Types:
User-Based: Recommends items by finding similar users.
Item-Based: Recommends items that are similar to items the user has
interacted with.
o Advantages:
Works well for new items (cold-start problem is less relevant).
Can capture complex patterns of user preferences.
o Disadvantages:
Suffers from the cold-start problem for new users or items.
Requires large amounts of user data.
Content-Based Filtering:
o Concept: Content-based filtering recommends items based on the features of the
items and the user’s previous interactions. It compares item features like
keywords, categories, or attributes to make recommendations.
o Advantages:
Works well for new users since it doesn’t require user history.
Can be highly personalized, focusing on item attributes.
o Disadvantages:
Limited to recommending items similar to what the user has already
interacted with.
May lead to narrow recommendations (lack of diversity).
55. Analyze different chatbot architectures in NLP, such as rule-based, retrieval-based, and
generative models, assessing their effectiveness based on scalability, response quality, and
adaptability. (10 Marks)
58. Describe five different applications of Natural Language Processing (NLP) in various
fields such as healthcare, finance, customer service, and education. (5 Marks)
Applications of NLP:
1. Healthcare:
Clinical Text Analysis: NLP is used to analyze medical records and
clinical notes to extract relevant information, such as diagnoses, treatment
plans, and drug prescriptions.
Medical Chatbots: NLP-powered chatbots can assist patients by
answering medical questions and providing basic health advice.
2. Finance:
Sentiment Analysis: NLP is used to analyze news articles, social media
posts, and financial reports to gauge market sentiment, helping investors
make informed decisions.
Fraud Detection: NLP can be applied to identify suspicious patterns in
financial transactions by analyzing the text in communication or
transaction details.
3. Customer Service:
Chatbots and Virtual Assistants: NLP-based chatbots help automate
customer support by understanding and responding to customer queries in
real-time.
Email and Ticket Classification: NLP is used to categorize and prioritize
customer service emails and tickets, directing them to the appropriate
support teams.
4. Education:
Automated Grading: NLP is used to evaluate and grade open-ended
responses or essays based on predefined rubrics.
Language Learning: NLP-powered tools help in language learning by
providing context-aware feedback and translating text.
5. E-Commerce:
Product Recommendation: NLP techniques analyze user reviews and
product descriptions to suggest relevant products to customers based on
their preferences and past behaviors.
59. Explain the importance of natural language understanding (NLU) in chatbot
development? (5 Marks)
60. Explain the architecture of the ChatGPT model in Natural Language Processing (NLP).
(5 Marks)
ChatGPT Architecture:
1. Transformer Model: ChatGPT is based on the transformer architecture, which
employs self-attention mechanisms to understand and process text data. The
transformer allows the model to focus on relevant parts of the input sequence,
making it highly effective for natural language tasks.
2. Encoder-Decoder Structure: In earlier models, transformers utilized an encoder-
decoder structure. ChatGPT, however, is based on a decoder-only architecture,
where it generates text by predicting the next word based on the given input
sequence.
3. Pretraining: The model is pretrained on large amounts of text data to learn the
statistical properties of language. This helps ChatGPT understand grammar,
syntax, and context.
4. Fine-tuning: After pretraining, ChatGPT is fine-tuned on specific datasets and
human feedback to improve its conversational abilities and make it more aligned
with user expectations.
5. Response Generation: ChatGPT generates responses using a process called
autoregression, where the model predicts one word at a time, using the previously
generated words as context.
61. What are some of the ways in which data augmentation can be done in NLP projects?
(5 Marks)
67. Can statistical techniques be used to perform the task of machine translation? If so,
explain in brief. (10 Marks)
2. Process:
Text Summarization:
1. Definition: Text summarization is the process of reducing a large body of text
into a shorter version while preserving the key information. It can be done in two
ways:
Extractive Summarization: Extracting important sentences or phrases
directly from the original text and combining them to form a summary.
Abstractive Summarization: Generating a new summary that
paraphrases the original text by understanding its meaning.
2. Multiple Document Text Summarization:
Definition: Multiple document summarization aims to create a concise
summary from several documents on the same topic. The goal is to
combine and filter out redundant information while ensuring all key points
are captured.
Process:
Document Clustering: Group similar documents together based
on their content.
Sentence Extraction: Identify key sentences from each document.
Redundancy Removal: Eliminate duplicated information to
ensure a compact summary.
3. Diagram:
A diagram for multiple document summarization could depict:
Step 1: Multiple documents are fed into the system.
Step 2: Key sentences are extracted from each document.
Step 3: Similar content across documents is identified.
Step 4: Redundant information is removed to form the final
summary.
Abstraction-Based Summarization:
1. Definition: Abstraction-based summarization involves generating a summary that
paraphrases the original text, creating a more natural and concise representation of
the information. This method requires deeper understanding and language
generation.
2. Process:
Understanding the Input: The system processes the input text to extract
meaning rather than simply selecting key sentences.
Paraphrasing: The system then generates a new, shorter version of the
text that captures the essence of the original content.
3. Example:
Original Text: “The rain caused severe flooding across the city. People
had to be rescued from their homes as the water levels rose.”
Abstraction-Based Summary: “Heavy rain led to widespread flooding,
requiring rescue operations.”
4. Key Points:
Language Generation: It is based on language generation techniques that
may include machine learning models like transformers.
Fluency: Abstraction-based summaries tend to be more fluent and natural,
unlike extractive methods that directly copy sentences from the text.
Content-Based Filtering:
1. Advantages:
Personalization: Recommends items similar to those a user has interacted
with, providing a personalized experience.
No Cold Start for Items: New items can be recommended if they have
enough metadata (e.g., genre, description).
Transparency: The reasons for recommending items are clearer because
they are based on the content the user has liked.
2. Disadvantages:
Limited Discovery: Users may only be recommended items similar to
what they’ve seen, leading to a lack of variety.
Dependency on Item Metadata: It requires detailed metadata, which may
not always be available.
Collaborative Filtering:
1. Advantages:
Discover New Items: It can recommend items that the user may never
have come across, expanding their preferences.
No Need for Item Metadata: It doesn’t rely on item characteristics,
making it useful for items with sparse metadata.
2. Disadvantages:
Cold Start Problem: It struggles with new users or new items that lack
sufficient interaction data.
Scalability Issues: As the number of users and items increases,
collaborative filtering models can become computationally expensive.
THANK YOU