0% found this document useful (0 votes)
10 views

Natural Language Processing All Question

The document is a comprehensive question bank on Natural Language Processing (NLP) covering various topics such as Named Entity Recognition (NER), tokenization, stop words, and regular expressions. It includes practical applications, definitions, examples, and comparisons of tools like NLTK and Spacy. Each question is accompanied by detailed answers, making it a valuable resource for understanding NLP concepts.

Uploaded by

manishamajhi2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Natural Language Processing All Question

The document is a comprehensive question bank on Natural Language Processing (NLP) covering various topics such as Named Entity Recognition (NER), tokenization, stop words, and regular expressions. It includes practical applications, definitions, examples, and comparisons of tools like NLTK and Spacy. Each question is accompanied by detailed answers, making it a valuable resource for understanding NLP concepts.

Uploaded by

manishamajhi2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Natural Language Processing(NLP)

Questions Bank All Questions Module


Wise ---->
By Sourav

Module 1 ->

1. Mention two practical applications of NER. (2 Marks)

 Information Extraction: NER is used to extract structured information from


unstructured text, such as identifying company names or dates in news articles.
 Question Answering Systems: NER helps to identify entities such as persons, locations,
or organizations to enhance the accuracy of answers in AI-driven question-answering
systems.

2. With examples, explain the different types of NER attributes. (10 Marks)

 Person (PER): Refers to the identification of people's names, such as "Barack Obama" or
"Albert Einstein."
 Organization (ORG): Identifies names of companies, institutions, and other
organizations. For example, "Google," "Apple," and "United Nations."
 Location (LOC): Refers to places such as countries, cities, or landmarks. Example:
"New York," "India," "Mount Everest."
 Date/Time (DATE/TIME): Identifies expressions of time, such as "Monday," "April
5th," or "2022."
 Money (MONEY): Refers to monetary amounts, like "$500," "₹1000," or "£200."
 Percentage (PERCENT): Identifies percentages, e.g., "5%" or "20 percent."
 Miscellaneous (MISC): Other entities that do not fit into the above categories, like event
names or product names. Example: "World Cup," "iPhone."
3. What do you understand about Natural Language Processing? (2 Marks)

 Natural Language Processing (NLP) is a field of artificial intelligence that focuses on


the interaction between computers and human languages, enabling machines to read,
understand, and generate human language.

4. What are stop words? (2 Marks)

 Stop words are commonly occurring words such as "the," "is," "in," and "and," which are
usually filtered out in text processing because they do not carry significant meaning for
analysis.

5. List any two real-life applications of NLP. (2 Marks)

 Machine Translation: Translating text from one language to another, like Google
Translate.
 Sentiment Analysis: Analyzing social media posts, reviews, or customer feedback to
determine sentiment (positive, negative, neutral).

6. Explain the difference between precision and recall in information retrieval. (5


Marks)

 Precision: Measures the accuracy of the retrieved results. It is the ratio of relevant
documents retrieved to the total number of documents retrieved. Formula:
Precision=True PositivesTrue Positives + False Positives\text{Precision} =
\frac{\text{True Positives}}{\text{True Positives + False
Positives}}Precision=True Positives + False PositivesTrue Positives
 Recall: Measures how many relevant documents were retrieved out of all the relevant
documents available. Formula:
Recall=True PositivesTrue Positives + False Negatives\text{Recall} = \frac{\text{True
Positives}}{\text{True Positives + False
Negatives}}Recall=True Positives + False NegativesTrue Positives
 Difference: Precision focuses on minimizing false positives, while recall focuses on
minimizing false negatives. An ideal system balances both.

7. What is NLTK? (2 Marks)


 NLTK (Natural Language Toolkit) is a Python library used for working with human
language data. It provides tools for text processing, tokenization, classification, and more.

8. What is Multi Word Tokenization? (2 Marks)

 Multi Word Tokenization is the process of breaking a text into meaningful multi-word
units (e.g., "New York," "United States") rather than just individual words.

9. What are stems? (2 Marks)

 Stems are the base or root forms of words obtained by removing prefixes and suffixes.
For example, "running" becomes "run," and "better" becomes "good."

10. What are called affixes? (2 Marks)

 Affixes are prefixes, suffixes, infixes, or circumfixes attached to a root word to modify its
meaning. Examples: "un-" (prefix) in "undo," "-ing" (suffix) in "running."

11. What is lexicon? (2 Marks)

 A lexicon is a collection or database of words and their meanings, used by a language


processing system to understand and generate text.

12. Why is Multi-word tokenization preferred over single word tokenization? (2


Marks)

 Multi-word tokenization is preferred because certain concepts or names are represented


by multiple words (e.g., "San Francisco" or "United Nations"), and treating them as a
single token ensures accuracy in understanding.

13. What is sentence segmentation? (2 Marks)


 Sentence segmentation is the process of dividing a text into individual sentences based
on punctuation marks, such as periods, exclamation points, or question marks.

14. Why is sentence segmentation important? (2 Marks)

 Sentence segmentation is crucial because it helps in parsing, understanding context, and


breaking down complex text into manageable parts for further processing.

15. What is morphology in NLP? (2 Marks)

 Morphology in NLP refers to the study of the structure of words, including the study of
prefixes, suffixes, and root forms of words.

16. List the different types of morphology available. (2 Marks)

 Inflectional Morphology: Changes that occur in words to express grammatical features


such as tense, case, or number.
 Derivational Morphology: Changes that create new words by adding prefixes or
suffixes, altering the meaning or part of speech.

17. What is the difference between NLP and NLU? (2 Marks)

 NLP (Natural Language Processing) focuses on enabling machines to process and


analyze human language.
 NLU (Natural Language Understanding) is a subfield of NLP that focuses specifically
on understanding the meaning behind the text, such as semantics and intent.

18. Give some popular examples of Corpus. (2 Marks)

 Corpus Examples:
o Brown Corpus: A collection of texts that covers different genres, such as news
articles and fiction.
o Reuters Corpus: A large collection of news stories used for text classification
tasks.
19. State the difference between word and sentence tokenization. (2 Marks)

 Word Tokenization: Divides a text into individual words.


 Sentence Tokenization: Divides a text into individual sentences.

20. What are the phases of problem-solving in NLP? (5 Marks)

 Text Preprocessing: Tokenization, stemming, stop word removal.


 Feature Extraction: Identifying relevant features from text data.
 Model Building: Applying machine learning algorithms to solve specific tasks like
classification or translation.
 Post-processing: Enhancing the output based on the results of the model.

21. Explain the process of word tokenization with an example. (5 Marks)

Word tokenization is the process of splitting a sentence or text into individual words or tokens. This is a
fundamental step in Natural Language Processing (NLP) that enables further text analysis.

Process of Word Tokenization:

1. Input Text: The text, such as a sentence, is provided for tokenization.


Example: "I love NLP!"
2. Splitting by Spaces: The text is split based on spaces and punctuation marks to separate words.
3. Handling Punctuation: Punctuation marks are either treated as separate tokens or attached to
words. For example, "I love NLP!" becomes ["I", "love", "NLP", "!"].
4. Token Output: The result is a list of words or tokens that can be used for further analysis.

Example:

 Sentence: "I love NLP."


 Tokens: ["I", "love", "NLP", "."]

Importance:

 Tokenization breaks down text into manageable units, which is essential for tasks like text
classification, Named Entity Recognition (NER), and machine learning models.

22. How does Named Entity Recognizer work? (5 Marks)

Named Entity Recognition (NER) is a subtask of information extraction that identifies and classifies
named entities in text into predefined categories like persons, organizations, locations, dates, etc.
How NER Works:

1. Preprocessing: The text is first tokenized and processed to identify words, punctuation, and
other structural components.
2. Feature Extraction: Features such as part-of-speech (POS) tags, word shapes (e.g., capitalized
words), and surrounding words are extracted.
3. NER Model: Machine learning models (like CRFs, HMMs, or deep learning models) are used to
classify each token in the sentence into a named entity or non-entity.
4. Classification: The tokens are classified into categories such as:
o Person: "Barack Obama"
o Location: "New York"
o Organization: "Google"
o Date: "January 1, 2020"

Example:

 Input: "Apple Inc. was founded by Steve Jobs in Cupertino."


 NER Output:
o Organization: "Apple Inc."
o Person: "Steve Jobs"
o Location: "Cupertino"

23. What are the benefits of eliminating stop words? Give some examples where
stop word elimination may be harmful. (5 Marks)

Stop words are common words (such as "the", "is", "at", etc.) that are usually filtered out during text
processing because they don’t carry significant meaning for certain tasks like information retrieval or
classification.

Benefits of Eliminating Stop Words:

1. Improved Model Efficiency: Reduces the size of the text data, which can speed up the analysis
and improve processing efficiency.
2. Better Focus on Important Terms: Helps focus on more meaningful words, which are crucial for
understanding the main topics or context.
3. Reduced Noise: Removes common words that may introduce noise into machine learning
models.

Example of Harmful Elimination:

1. Context Loss: In some cases, removing stop words may alter the meaning. For example, “I am
reading a book” becomes “reading book,” which can change the intended meaning.
2. For Sentiment Analysis: Words like "not" or "very" can significantly alter sentiment (e.g., "not
good" vs. "good"). Removing such words can lead to misinterpretation.
24. What do you mean by RegEx? Explain with example. (5 Marks)

Regular Expressions (RegEx) are sequences of characters that form search patterns. They are used for
pattern matching within strings, allowing tasks like search, replace, or validate data formats.

Key Concepts in RegEx:

1. Pattern Matching: Defines a search pattern to find specific character sequences in strings.
2. Special Characters:
o ".": Matches any single character.
o "*": Matches zero or more occurrences of the preceding character.
o "^": Matches the beginning of a string.
o "$": Matches the end of a string.

Example:

 Regex: \d{3}-\d{2}-\d{4}
o Explanation: This matches a pattern like "123-45-6789" (social security number format).
o Input: "My SSN is 123-45-6789."
o Matches: "123-45-6789"

25. Explain Dependency Parsing in NLP? (5 Marks)

Dependency Parsing is a technique in NLP that establishes relationships between words in a sentence,
where the words are connected based on grammatical structure. Each word depends on another word,
forming a tree-like structure.

How Dependency Parsing Works:

1. Sentence Input: A sentence like "I love Natural Language Processing."


2. Word Relationships: Identify the dependency relationships between words. For example, "love"
is the root verb, and "I" is the subject of "love."
3. Parse Tree Construction: A tree is created where each word points to its head word, forming
parent-child relationships.
o Root: "love"
o Subject: "I" → "love"
o Object: "Processing" → "love"
Importance:

 Understanding Sentence Structure: It helps understand syntactic structures, which is useful for
tasks like machine translation and question answering.

26. Write a regular expression to represent a set of all strings over {a, b} of even
length. (5 Marks)

To create a regular expression for all strings over the alphabet {a, b} of even length, we need to ensure
that every string consists of pairs of characters.

Regular Expression:

 Regex: ^(aa|bb|ab|ba)*$

Explanation:

 aa, bb, ab, and ba represent valid pairs of characters.


 * means zero or more repetitions of these pairs.
 ^ and $ ensure that the entire string is considered from start to end.

Example:

 Valid strings: "aa", "bb", "abab", "aabb"


 Invalid strings: "a", "ab", "baaa"

27. Write a regular expression to represent a set of all strings over {a, b} of
length 4 starting with an a. (5 Marks)

Regular Expression:

 Regex: ^a(b|a){3}$

Explanation:

 ^a: The string must start with the character "a."


 (b|a){3}: Followed by exactly 3 characters, each of which can be either "a" or "b."
 $: End of the string.
Example:

 Valid strings: "aaaa", "abaa", "abba", "abbb"


 Invalid strings: "baba", "aaab"

28. Write a regular expression to represent a set of all strings over {a, b}
containing at least one 'a'. (5 Marks)

Regular Expression:

 Regex: ^(.*a.*)$

Explanation:

 .*: Matches zero or more occurrences of any character (a or b).


 a: Ensures that at least one 'a' is present somewhere in the string.
 ^ and $: Ensure the string starts and ends with any sequence of characters containing at least
one 'a'.

Example:

 Valid strings: "a", "ab", "ba", "baba"


 Invalid strings: "b", "bb"

29. Compare and contrast NLTK and Spacy, highlighting their differences. (5
Marks)

NLTK (Natural Language Toolkit) vs. Spacy:

1. Ease of Use:
o NLTK: More beginner-friendly, with a wide range of tools and libraries.
o Spacy: Designed for real-world applications, with a focus on speed and efficiency.
2. Speed:
o NLTK: Slower, as it is more focused on research and educational use.
o Spacy: Optimized for performance and production-level tasks.
3. Pre-trained Models:
o NLTK: Limited pre-trained models for tasks like NER.
o Spacy: Offers state-of-the-art pre-trained models for tasks like POS tagging, dependency
parsing, and NER.
4. Use Cases:
o NLTK: Ideal for educational purposes, research, and exploration.
o Spacy: Better suited for production systems and applications requiring speed and
scalability.
30. What is a Bag of Words? Explain with examples. (5 Marks)

Bag of Words (BoW) is a simple representation of text data where each unique word is treated as a
feature, ignoring grammar and word order but keeping track of word frequencies.

How it works:

1. Tokenization: Break the text into individual words (tokens).


2. Create Vocabulary: List all unique words in the dataset.
3. Count Frequencies: Count the frequency of each word in each document.

Example:

 Documents:
o Doc 1: "I love NLP"
o Doc 2: "NLP is fun"
 Vocabulary: ["I", "love", "NLP", "is", "fun"]
 BoW Representation:
o Doc 1: [1, 1, 1, 0, 0] ("I"=1, "love"=1, "NLP"=1, "is"=0, "fun"=0)
o Doc 2: [0, 0, 1, 1, 1] ("I"=0, "love"=0, "NLP"=1, "is"=1, "fun"=1)

31. Differentiate regular grammar and regular expression. (5 Marks)

Regular Grammar vs. Regular Expression:

1. Definition:
o Regular Grammar: A formal grammar that can be used to describe regular languages,
typically used in computational theory.
o Regular Expression (RegEx): A sequence of characters that forms a search pattern, used
for matching strings in text.
2. Use:
o Regular Grammar: Defines rules to generate strings in a language.
o Regular Expression: Defines patterns to search or match strings.
3. Application:
o Regular Grammar: Often used in theoretical computer science to describe languages.
o Regular Expression: Used in programming for pattern matching, text searching, and
validation.
4. Syntax:
o Regular Grammar: Involves production rules like S → aS | bS | ε.
o Regular Expression: Uses symbols like *, +, ?, etc., for matching text.
32. Describe the word and sentence tokenization steps with the help of an
example. (10 Marks)

Word Tokenization:

1. Input Text: "I love NLP."


2. Splitting by Spaces: The text is split into words using spaces as delimiters.
3. Handling Punctuation: Punctuation is treated as separate tokens.
o Tokens: ["I", "love", "NLP", "."]

Sentence Tokenization:

1. Input Text: "I love NLP. It's amazing!"


2. Sentence Splitting: The text is split into sentences based on punctuation marks like periods,
exclamations, etc.
o Sentences: ["I love NLP.", "It's amazing!"]

Example of Both:

 Input: "I love NLP. It's amazing!"


 Word Tokens: ["I", "love", "NLP", ".", "It's", "amazing", "!"]
 Sentence Tokens: ["I love NLP.", "It's amazing!"]

33. How can the common challenges faced in morphological analysis in natural
language processing be overcome? (10 Marks)

Challenges in Morphological Analysis:

1. Ambiguity: Words may have multiple forms or meanings, e.g., "run" can be a verb or noun.
2. Rich Morphology: Languages like Turkish or Finnish have many forms of a word based on
inflections.
3. Out-of-Vocabulary Words: New or rare words that don’t appear in dictionaries.

Solutions:

1. Use of Morphological Analyzers: Tools like the Porter Stemmer or Snowball Stemmer can
handle many common inflections.
2. Contextual Analysis: Implementing context-based analysis to disambiguate word meanings,
such as using POS tagging.
3. Machine Learning Models: Train models on large datasets to predict rare word forms and
handle out-of-vocabulary words.
4. Lexical Resources: Use comprehensive lexicons or dictionaries for handling rich morphology in
specific languages.
34. Derive Minimum Edit Distance Algorithm and compute the minimum edit
distance between the words "MAM" and "MADAM". (10 Marks)

Minimum Edit Distance Algorithm:

The Minimum Edit Distance algorithm computes the minimum number of operations (insertions,
deletions, or substitutions) required to convert one string into another.

1. Initialization: Create a matrix where the cell (i, j) represents the edit distance between the
first i characters of the first word and the first j characters of the second word.
2. Recurrence Relation:
o If the characters are equal: cost(i, j) = cost(i-1, j-1)
o If the characters are different: cost(i, j) = 1 + min(cost(i-1, j), cost(i,
j-1), cost(i-1, j-1))
3. Result: The bottom-right cell of the matrix will contain the minimum edit distance.

Example:

 Words: "MAM" and "MADAM"


 Initial Matrix:

MADAM

M0 1 2 3 4

A 1 0123

M2 1 2 2 3

 The minimum edit distance is 2 (Insert 'D' at position 3 and Insert 'A' at position 4).

35. Discuss the problem-solving approaches of any two real-life applications of


Information Extraction and NER in Natural Language Processing. (10 Marks)

Applications of Information Extraction (IE) and NER:

1. Medical Data Extraction:


o Approach: NER is used to extract named entities like drug names, diseases, and patient
information from medical documents and research papers.
o Challenges: Medical terms can be complex, and synonyms must be handled carefully.
o Solution: Implementing domain-specific models and fine-tuning pre-trained NER
systems.
2. News Article Categorization:
o Approach: IE and NER are used to extract relevant entities like locations, organizations,
and dates from news articles.
o Challenges: Handling ambiguous terms and varied formats in news articles.
o Solution: Building custom NER models for each category and using post-processing
techniques to refine results.

36. How to solve any application of NLP? Justify with an example. (10 Marks)

Solving an NLP Application:

1. Define the Problem:


o Understand the problem and the type of NLP task (e.g., text classification, sentiment
analysis, named entity recognition, etc.).
o Example: Sentiment Analysis for customer reviews.
2. Data Collection:
o Collect relevant datasets. For sentiment analysis, this could be customer reviews from
an e-commerce platform.
3. Preprocessing:
o Clean the text by removing noise such as special characters, stop words, and irrelevant
data.
o Tokenize the text, convert to lowercase, and lemmatize the words.
o Example: "I love this product!" becomes ["love", "product"].
4. Feature Extraction:
o Extract meaningful features from the text using techniques such as Bag of Words (BoW),
TF-IDF, or word embeddings (Word2Vec).
o Example: Convert the processed text into a feature vector.
5. Model Building:
o Choose an appropriate model (e.g., Naive Bayes, Support Vector Machines, or Neural
Networks).
o Train the model using the labeled data (positive/negative reviews).
6. Evaluation:
o Evaluate the model’s performance using metrics such as accuracy, precision, recall, and
F1-score.
o Example: Model predicts positive/negative sentiment for reviews and is tested using a
validation set.
7. Deployment:
o Once the model is validated, deploy it in a real-world environment (e.g., a web
application that automatically analyzes customer feedback).

37. What is Corpora? Define the steps of creating a corpus for a specific task. (10
Marks)
Corpora:

 Definition: A corpus is a large and structured set of texts that is used for statistical analysis,
natural language processing, and machine learning applications. It provides a foundation for
training NLP models.

Steps for Creating a Corpus:

1. Identify the Task:


o Clearly define the NLP task (e.g., named entity recognition, machine translation,
sentiment analysis).
2. Collect Text Data:
o Gather text data relevant to the task from sources such as websites, books, news
articles, or social media.
3. Text Cleaning and Preprocessing:
o Clean the data by removing noise, irrelevant information, and non-text elements.
o Tokenize, remove stop words, and standardize the format (e.g., lowercase, punctuation
handling).
4. Annotation:
o For supervised learning, annotate the data with the necessary labels (e.g., for sentiment
analysis, label reviews as positive or negative).
5. Formatting:
o Organize the corpus into a structured format, such as CSV, JSON, or XML, depending on
the application.
6. Validation:
o Verify the accuracy of the annotations and the relevance of the text data for the task.
7. Corpus Augmentation:
o Augment the corpus by adding additional data or using techniques like data
augmentation (e.g., paraphrasing or back-translation).

38. What is Information Extraction? (5 Marks)

Information Extraction (IE) is the process of automatically extracting structured information from
unstructured text. It aims to identify specific entities, relationships, and events within the text to convert
it into a structured format.

Key Tasks:

1. Named Entity Recognition (NER): Identifying proper nouns like names of people, places,
organizations, dates, etc.
2. Relationship Extraction: Identifying relationships between entities (e.g., "Alice works at XYZ
Corp").
3. Event Extraction: Identifying events and their participants (e.g., "Bob met Alice on January 1st").
Example:

 Text: "Apple announced its new iPhone in San Francisco on September 10, 2020."
 Extracted Information:
o Entity: "Apple" (organization)
o Event: "announced"
o Location: "San Francisco"
o Date: "September 10, 2020"

39. State the different applications of Sentiment analysis and Opinion mining
with examples. Write down the variations as well. (10 Marks)

Applications of Sentiment Analysis and Opinion Mining:

1. Customer Feedback Analysis:


o Sentiment analysis is used to analyze customer reviews and feedback to determine
overall satisfaction.
o Example: Analyzing restaurant reviews to gauge customer sentiments ("Positive: Good
food", "Negative: Slow service").
2. Social Media Monitoring:
o Analyzing social media platforms like Twitter or Facebook for public sentiment on
brands, products, or topics.
o Example: Tracking sentiment around a product launch ("Excited" vs "Disappointed").
3. Market Research:
o Opinion mining is used to analyze consumer opinions on products or services for market
insights.
o Example: Analyzing opinions on a newly launched smartphone to assess its market
reception.
4. Political Sentiment Analysis:
o Analyzing public sentiment toward political figures, parties, or policies.
o Example: Analyzing Twitter data to measure public support for a political candidate.
5. Brand Reputation Management:
o Businesses use sentiment analysis to monitor public sentiment toward their brand and
products.
o Example: Monitoring mentions of a brand to detect negative sentiments that might
affect its reputation.

Variations:

1. Fine-Grained Sentiment Analysis: Detects sentiments at a more granular level (e.g., determining
if a customer is happy with specific features).
2. Aspect-Based Sentiment Analysis: Focuses on identifying sentiments about specific aspects
(e.g., "The battery life of the phone is great" vs "The camera quality is poor").
40. State a few applications of Information Retrieval. (5 Marks)

Applications of Information Retrieval (IR):

1. Web Search Engines:


o Example: Google, Bing use IR to retrieve the most relevant web pages based on user
queries.
2. Document Search:
o Example: Legal or academic research where IR helps retrieve documents based on
keywords or phrases.
3. Recommendation Systems:
o Example: Platforms like Amazon and Netflix use IR techniques to recommend products
or movies based on user preferences.
4. Medical Information Retrieval:
o Example: Searching medical databases to retrieve relevant articles or papers based on
specific queries.

41. What is text normalization? (10 Marks)

Text Normalization:

Text normalization is the process of transforming text into a standardized format to reduce variations
and make it easier to process. It involves tasks like:

1. Lowercasing: Converting all text to lowercase to ensure uniformity.


o Example: "HELLO" becomes "hello".
2. Removing Special Characters: Eliminating punctuation, special characters, or any irrelevant
symbols.
o Example: "Hello!!" becomes "Hello".
3. Stemming: Reducing words to their base or root form.
o Example: "running" becomes "run".
4. Lemmatization: Similar to stemming but considers the word’s context to convert it to a proper
root form.
o Example: "better" becomes "good".
5. Expanding Contractions: Replacing contractions like "don’t" with "do not".
6. Removing Stop Words: Eliminating common words that do not add meaningful information.
o Example: "The cat sat on the mat" becomes "cat sat mat".

42. Do you think any differences present between tokenization and


normalization? Justify your answer with examples. (10 Marks)
Differences between Tokenization and Normalization:

1. Tokenization:
o Definition: The process of splitting text into smaller units, such as words, phrases, or
sentences.
o Purpose: Helps break down text into manageable chunks for further analysis.
o Example:
 Input: "I love NLP!"
 Tokens: ["I", "love", "NLP", "!"]
2. Normalization:
o Definition: The process of standardizing text by converting it into a uniform format.
o Purpose: Helps reduce variations by making the text consistent for analysis.
o Example:
 Input: "I'm loving NLP!"
 Normalized: "I am loving NLP"

Justification:

 Tokenization deals with breaking text into units, while Normalization ensures uniformity by
addressing case, punctuation, and other inconsistencies.

43. What makes part-of-speech (POS) tagging crucial in NLP, in your opinion?
Give an example to back up your response. (5 Marks)

Importance of Part-of-Speech (POS) Tagging in NLP:

1. Understanding Sentence Structure:


o POS tagging is essential for understanding the syntactic structure of a sentence, as it
helps identify the role of each word.
o It aids in identifying nouns, verbs, adjectives, etc., which is necessary for tasks like
parsing, machine translation, and sentiment analysis.
2. Improves Accuracy in NLP Tasks:
o POS tagging is critical for various applications such as Named Entity Recognition (NER),
machine translation, and information retrieval, as it helps determine the grammatical
structure.
3. Example:
o Sentence: "The bank is near the river."
o POS Tags: "The" (Determiner), "bank" (Noun), "is" (Verb), "near" (Preposition), "the"
(Determiner), "river" (Noun).
o Without POS tagging, "bank" could be interpreted as a financial institution or the side of
a river, and understanding its meaning depends on its context in the sentence.

44. Criticize the shortcomings of the fundamental Top-Down Parser. (5 Marks)


Shortcomings of the Top-Down Parser:

1. Inefficiency in Handling Ambiguity:


o Top-down parsers struggle when faced with ambiguous sentences because they attempt
to expand all possible rules first, which leads to excessive backtracking.
2. Excessive Search Space:
o It searches for all possible derivations of the sentence, resulting in a large search space.
This is computationally expensive and inefficient, especially for complex sentences.
3. Lack of Robustness:
o Top-down parsers may fail if the input sentence does not match the grammar exactly.
This often leads to incorrect parsing or failure to parse the sentence.
4. Example:
o For the sentence "I saw the man with a telescope," a top-down parser may misinterpret
the sentence structure due to ambiguity in "with a telescope." The parser would need to
try different rule applications, leading to inefficiency.

45. Do you believe there are any distinctions between prediction and
classification? Illustrate with an example. (5 Marks)

Differences Between Prediction and Classification:

1. Prediction:
o Definition: Prediction refers to estimating a continuous value based on input features. It
typically involves regression models where the output is a real number.
o Example: Predicting the temperature for tomorrow based on historical data.
2. Classification:
o Definition: Classification is the process of categorizing input data into predefined classes
or categories. It involves assigning labels to data points.
o Example: Classifying emails as either "spam" or "not spam."

Key Distinction:

 Prediction deals with continuous values, while classification deals with discrete labels or
categories.

46. Explain the connection between word tokenization and phrase tokenization
using examples. How do both tokenization methods contribute to the
development of NLP applications? (10 Marks)

Connection Between Word Tokenization and Phrase Tokenization:

1. Word Tokenization:
o Definition: The process of breaking down a text into individual words.
o Purpose: It’s often the first step in NLP preprocessing, allowing the model to understand
and process each word independently.
o Example:
 Sentence: "I love NLP."
 Tokens: ["I", "love", "NLP"]
2. Phrase Tokenization:
o Definition: The process of grouping multiple words into meaningful units or phrases,
often to preserve context and meaning.
o Purpose: Useful for recognizing multi-word expressions like named entities, common
phrases, or keywords.
o Example:
 Sentence: "Natural Language Processing is fascinating."
 Phrases: ["Natural Language Processing", "fascinating"]

Connection and Contribution to NLP Applications:

 Word Tokenization focuses on breaking down the sentence into words, while Phrase
Tokenization focuses on understanding and grouping important multi-word units (such as
entities, technical terms, etc.).
 Both tokenization techniques contribute to NLP applications such as Named Entity Recognition
(NER), machine translation, sentiment analysis, and information retrieval by ensuring
meaningful segmentation of the text, preserving context, and improving accuracy.

47. “Natural Language Processing (NLP) has many real-life applications across
various industries.”- List any two real-life applications of Natural Language
Processing. (5 Marks)

Two Real-life Applications of NLP:

1. Customer Support Automation:


o NLP is used in chatbots and virtual assistants to handle customer queries automatically
and improve user experience.
o Example: Chatbots in customer service to respond to FAQs and troubleshoot issues,
reducing the need for human intervention.
2. Medical Diagnosis and Documentation:
o NLP is used to process medical texts like clinical notes, research papers, and patient
records. It helps in extracting meaningful information, improving diagnosis, and
organizing medical records.
o Example: Extracting symptoms and diseases from patient records to assist healthcare
professionals.
48. "Find all strings of length 5 or less in the regular set represented by the
following regular expressions:

(a) (ab + a)(aa + b)


(b) (ab + b*a)*a" (5 Marks)

Solution:

1. *(ab + a) (aa + b)**:


o This represents strings consisting of "ab" or "a", followed by either "aa" or "b".
o Strings of length 5 or less:
 "a", "b", "ab", "aa", "aaa", "abab", "aab", "bba", "ababb"
2. (ab + ba) a*:
o This represents strings where "a" can appear zero or more times, followed by "b" zero
or more times, and the string ends with "a".
o Strings of length 5 or less:
 "a", "b", "ab", "ba", "aaa", "baa", "abab", "bba", "abbaa"

49. "Write regular expressions for the following languages.

1. The set of all alphabetic strings;


2. The set of all lowercase alphabetic strings ending in a b;
3. The set of all strings from the alphabet a, b such that each a is immediately preceded by and
immediately followed by a b" (10 Marks)

Solutions:

1. Set of all alphabetic strings:


o Regular Expression: ^[a-zA-Z]+$
2. Set of all lowercase alphabetic strings ending in a b:
o Regular Expression: ^[a-z]*b$
3. Set of all strings from the alphabet a, b such that each a is immediately preceded by and
immediately followed by a b:
o Regular Expression: b*(ab)*b*

50. Explain Rule-based POS tagging. (5 Marks)

Rule-based Part-of-Speech (POS) Tagging:

1. Definition:
o Rule-based POS tagging uses a set of handcrafted rules to assign POS tags to words in a
sentence. These rules are based on the word's context, surrounding words, and
syntactic patterns.
2. Process:
o A word is assigned a tag based on predefined patterns, such as if a word follows a
determiner (DT), it is tagged as a noun (NN).
o Example: In the sentence "The cat is on the mat," "The" would be tagged as a
determiner (DT), "cat" as a noun (NN), and "is" as a verb (VB).
3. Advantages:
o Simple and interpretable.
o Can be highly accurate with a well-defined set of rules.
4. Disadvantages:
o Rules must be manually created, which can be time-consuming.
o Limited flexibility to handle ambiguities in language.

51. Differentiate regular grammar and regular expression (5 Marks)

Regular Grammar:

 Definition: A formal grammar that generates a regular language. It consists of production rules
where the left-hand side is a non-terminal and the right-hand side is either a terminal or a
combination of terminal and non-terminal symbols.
 Example:
o Production: S→aSS \rightarrow aSS→aS | bbb
 Usage: Used for defining the structure of languages in automata theory.

Regular Expression:

 Definition: A sequence of characters that defines a search pattern, primarily for string matching
within texts. It is more compact and concise for pattern matching.
 Example:
o Regular expression: a(b|a)*
 Usage: Used in text searching, pattern matching, and string manipulation tasks.

Key Differences:

 Concept: Regular grammar is used to generate strings, while regular expressions are used to
match patterns in strings.
 Flexibility: Regular expressions provide more succinct and flexible ways to represent patterns
than regular grammar.

52. What is NLTK? (2 Marks)


 Definition: The Natural Language Toolkit (NLTK) is a Python library used for working with human
language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources,
along with tools for text processing, classification, tokenization, stemming, tagging, and parsing.
 Usage: It's widely used in NLP research and education for text analysis, linguistic data
processing, and building NLP applications.

53. What is Multi Word Tokenization? (2 Marks)

 Definition: Multi-Word Tokenization refers to the process of identifying and grouping multiple
words that together represent a single unit or meaning, such as named entities (e.g., "New
York") or multi-word expressions (e.g., "high school").
 Example:
o Tokenizing "New York City" as a single token rather than three separate words "New,"
"York," and "City."

54. What is sentence segmentation? (2 Marks)

 Definition: Sentence segmentation is the process of dividing a text into individual sentences. It
involves identifying sentence boundaries, typically by recognizing punctuation marks such as
periods, question marks, and exclamation points.
 Example:
o Given the text: "Hello! How are you? I am fine.", the segmentation would be: ["Hello!",
"How are you?", "I am fine."]

55. What is morphology in NLP? (2 Marks)

 Definition: Morphology is the study of the structure and form of words. In NLP, it refers to the
process of analyzing and understanding the internal structure of words, including their prefixes,
suffixes, and root forms.
 Example: The word "running" can be broken down into the root word "run" and the suffix "-
ing."

56. Give some popular examples of Corpus. (2 Marks)

 Examples:
1. Brown Corpus: A collection of texts representing a wide variety of written genres.
2. WordNet: A lexical database of English, where words are grouped into sets of
synonyms.
3. Reuters-21578: A collection of news documents, commonly used for text classification
tasks.

57. What do you mean by word tokenization? (2 Marks)

 Definition: Word tokenization is the process of splitting a text into individual words or tokens.
This is often the first step in text preprocessing for NLP tasks.
 Example:
o Sentence: "I love NLP."
o Tokens: ["I", "love", "NLP"]

58. Find the minimum edit distance between two strings ELEPHANT and
RELEVANT? (10 Marks)

Solution:

The minimum edit distance between two strings is calculated using the Levenshtein Distance algorithm,
which finds the minimum number of operations (insertions, deletions, and substitutions) required to
transform one string into another.

1. Strings:
o ELEPHANT
o RELEVANT
2. Operations:
o Insert, delete, or substitute characters to transform one string into the other.
3. Calculation:
o Using dynamic programming, the minimum edit distance between "ELEPHANT" and
"RELEVANT" is calculated as 2 (substitute 'E' with 'R' and substitute 'P' with 'V').

59. If str1 = "SUNDAY" and str2 = "SATURDAY" is given, calculate the


minimum edit distance between the two strings. (10 Marks)

Solution:

1. Strings:
o SUNDAY
o SATURDAY
2. Operations:
o Insertions, deletions, or substitutions are required to convert one string into the other.
3. Minimum Edit Distance:
o The minimum edit distance between "SUNDAY" and "SATURDAY" is 3 (substitute 'S' with
'S', insert 'A', insert 'R', and insert 'Y').

60. List the different types of morphology available. (5 Marks)

Types of Morphology:

1. Inflectional Morphology:
o Deals with the modification of words to express different grammatical categories such
as tense, case, or number.
o Example: "running" (verb) → "runs" (third-person singular).
2. Derivational Morphology:
o Involves the creation of new words by adding prefixes or suffixes to existing words.
o Example: "happy" → "unhappy" (prefix), "teach" → "teacher" (suffix).
3. Compounding:
o The process of combining two or more words to create a new word.
o Example: "toothbrush" (tooth + brush).
4. Clitics:
o Words or morphemes that cannot stand alone but attach to other words (e.g.,
contractions).
o Example: "I'll" (I + will), "he's" (he + is).

61. What is Stemming? (2 Marks)

 Definition: Stemming is the process of reducing a word to its root form by removing affixes
(prefixes or suffixes).
 Example: "running" → "run", "happier" → "happi".

62. What is Corpus in NLP? (2 Marks)

 Definition: A corpus is a large and structured set of texts used for linguistic analysis or as
training data for NLP models. It contains a collection of written or spoken texts, often annotated
with additional linguistic information.
 Example: The Brown Corpus, containing a diverse collection of text from different genres.

63. State with example the difference between stemming and lemmatization. (5
Marks)
Stemming:

 Definition: The process of removing affixes from words to get the root form, which might not
always be a valid word.
 Example: "better" → "bett", "running" → "run".

Lemmatization:

 Definition: The process of reducing a word to its base or dictionary form, called the lemma. It
considers the word's context and part of speech.
 Example: "better" → "good", "running" → "run".

Key Differences:

 Stemming may produce non-dictionary words, while lemmatization always results in a valid
word.
 Lemmatization considers the word's meaning and context, whereas stemming only focuses on
removing affixes.

64. Write down the different stages of NLP pipeline. (10 Marks)

Stages of NLP Pipeline:

1. Text Acquisition:
o Collecting text data from various sources such as websites, documents, or databases.
2. Text Preprocessing:
o Includes tasks like tokenization, removing stop words, punctuation, and special
characters.
o Example: "Hello, World!" → ["Hello", "World"]
3. Part-of-Speech Tagging:
o Assigning POS tags (like noun, verb, adjective) to each token.
4. Named Entity Recognition (NER):
o Identifying entities such as names, dates, or locations in the text.
5. Syntactic Parsing:
o Analyzing the syntactic structure of the sentence to understand grammar relationships.
6. Semantic Analysis:
o Understanding the meaning of the text using techniques like sentiment analysis or word
embeddings.
7. Text Generation:
o Generating new text, such as summarization or text completion, based on the processed
data.
8. Output:
o The final output such as translated text, summarized text, or chatbot responses.
65. What is your understanding about Chatbot in the context of NLP? (10
Marks)

 Definition: A chatbot is an AI-powered application that simulates human conversation through


text or voice interactions. It uses Natural Language Processing (NLP) to interpret user inputs and
generate appropriate responses.

Key Components of a Chatbot:

1. Natural Language Understanding (NLU):


o The chatbot processes the user’s input to understand the meaning. It involves tasks like
intent detection and entity recognition.
o Example: "What's the weather today?" → Intent: Weather Inquiry, Entity: Today
2. Dialogue Management:
o Handles the flow of conversation, deciding the next best response based on the context
and user history.
o Example: After answering a weather query, the chatbot may ask, "Would you like to
know more?"
3. Natural Language Generation (NLG):
o Converts structured data into natural language responses.
o Example: Given a weather report, the chatbot might respond, "The weather today is
sunny with a high of 25°C."

Types of Chatbots:

 Rule-based: Follows predefined scripts and rules to respond to queries.


 AI-based: Uses machine learning and NLP models to generate responses and adapt over time.

66. Write short note on text pre-processing in the context of NLP. Discuss
outliers and how to handle them (10 Marks)

 Text Preprocessing: It refers to the series of steps taken to clean and prepare raw text data for
further analysis or modeling. Preprocessing ensures that the data is in a usable form for machine
learning algorithms.

Steps in Text Preprocessing:

1. Tokenization: Splitting text into individual words or sentences.


2. Lowercasing: Converting all characters to lowercase to ensure uniformity.
3. Removing Stop Words: Stop words (e.g., "the", "is", "in") are removed as they do not contribute
to meaningful analysis.
4. Stemming/Lemmatization: Reducing words to their base form.
5. Removing Punctuation: Removing symbols like commas, periods, etc., that do not add value in
analysis.
Outliers in Text Data:

 Definition: Outliers are unusual or inconsistent data points that deviate significantly from the
rest of the dataset.
 Handling Outliers:
o Removing Outliers: Remove text entries that are too short or too long.
o Normalizing Outliers: Transforming outliers into a more standardized format, such as
lowercasing or fixing spelling mistakes.

67. Explain with example the challenges with sentence tokenization. (5 Marks)

 Sentence Tokenization: The process of dividing a block of text into individual sentences.

Challenges:

1. Punctuation Ambiguities:
o Example: "I saw the movie. It was great." can be easily tokenized into two sentences.
But, "Dr. Smith is a doctor." could be incorrectly split at the period.
2. Abbreviations:
o Abbreviations like "U.S." or "e.g." could cause incorrect sentence boundary detection.
They don’t indicate the end of a sentence.
3. Complex Sentence Structures:
o Sentences with quotes, parentheses, or embedded clauses can confuse tokenizers.
Example: "He said, 'I will help you later.'" might be incorrectly tokenized.
4. Multilingual Issues:
o Some languages like Chinese do not have explicit sentence-ending punctuation, making
sentence segmentation difficult.

68. Explain some of the common NLP tasks. (5 Marks)

Common NLP Tasks:

1. Tokenization:
o Splitting text into smaller units like words, sentences, or sub-words.
o Example: "I love NLP" → ["I", "love", "NLP"]
2. Part-of-Speech (POS) Tagging:
o Assigning grammatical categories (noun, verb, adjective) to words in a sentence.
o Example: "The cat sleeps." → [("The", "DT"), ("cat", "NN"), ("sleeps", "VBZ")]
3. Named Entity Recognition (NER):
o Identifying and classifying named entities (e.g., person names, locations, dates).
o Example: "Barack Obama was born in Hawaii." → [("Barack Obama", "PERSON"),
("Hawaii", "LOCATION")]
4. Sentiment Analysis:
o Determining the sentiment (positive, negative, neutral) of a given text.
o Example: "I love this product!" → Positive sentiment.
5. Machine Translation:
o Translating text from one language to another.
o Example: "Bonjour" → "Hello" (French to English).

69. What do you mean by text extraction and cleanup? Discuss with examples.
(10 Marks)

 Text Extraction: The process of extracting relevant pieces of text from a larger corpus or
document. It involves identifying and retrieving specific information.

Text Extraction Example:

 Extracting all email addresses from a document.


o Example: From the text "Contact us at [email protected]", extracting
"[email protected]".
 Text Cleanup: Refers to removing unnecessary or irrelevant data from the text, such as HTML
tags, special characters, or whitespace, to make the text clean and consistent for analysis.

Text Cleanup Example:

 Original: "Hello!! How are you? <br> Have a good day! <br> "
 Cleaned: "Hello How are you Have a good day"

70. What is word sense ambiguity in NLP? Explain with examples. (5 Marks)

 Word Sense Ambiguity: Occurs when a word has multiple meanings or senses, and determining
the correct sense depends on context.

Example 1:

 The word "bank":


o Sense 1: A financial institution.
o Sense 2: The side of a river.
 Example 2:
o The word "bark":
 Sense 1: The sound a dog makes.
 Sense 2: The outer covering of a tree.
Challenge: Resolving word sense ambiguity is essential for accurate meaning extraction in NLP
tasks like machine translation and text classification.

71. Write short note on Bag of Words (BOW). (10 Marks)

 Bag of Words (BOW): A popular model used for representing text data where each word in a
document is treated as a unique token without considering the order of words. It focuses on the
frequency of words to build a vector representation.

How BOW Works:

1. Tokenization: Split the text into individual words.


2. Vocabulary Construction: Create a vocabulary of all unique words in the document corpus.
3. Vectorization: Represent each document as a vector where each dimension corresponds to a
word in the vocabulary, and the value is the frequency of the word in that document.

Example:

 Document 1: "I love NLP"


 Document 2: "NLP is great"

Vocabulary: ["I", "love", "NLP", "is", "great"]

Document 1: [1, 1, 1, 0, 0]

Document 2: [0, 0, 1, 1, 1]

72. Explain Homonymy with example. (CO1 BL1) 2 Marks

 Homonymy refers to the phenomenon where two or more words share the same spelling
or pronunciation but have different meanings.
 Example:
o "Bank" (a financial institution) vs. "Bank" (the side of a river).
Homonyms can create ambiguity in text, which requires disambiguation in NLP
tasks.

73. Define WordNet. (CO1 BL1) 2 Marks

 WordNet is a large lexical database of the English language, which groups words into
sets of synonyms called synsets.
 Words in WordNet are interlinked through semantic relationships like synonymy,
antonymy, hyponymy, and meronymy, helping machines understand word meanings.
 Example: The synset for the word "dog" includes words like "canine", "pooch", and
"pup".
74. Consider a document containing 100 words wherein the word apple appears 5 times
and assume we have 10 million documents and the word apple appears in one thousandth
of these. Then, calculate the term frequency and inverse document frequency. (CO4 BL5)
10 Marks

 Term Frequency (TF): The number of times a term appears in a document divided by
the total number of terms in the document.
o TF = (5 / 100) = 0.05
 Inverse Document Frequency (IDF): Measures how important a term is in the entire
corpus.
o IDF = log(Total Documents / Documents containing the word)
o IDF = log(10,000,000 / 10,000) = log(1000) ≈ 3
 TF-IDF: TF * IDF = 0.05 * 3 = 0.15
 The term "apple" has a relatively low TF but a moderate IDF, indicating its relevance in
the context of the larger corpus.

75. Explain the relationship between Singular Value Decomposition, Matrix Completion,
and Matrix Factorization. (CO1 BL3) 5 Marks

 Singular Value Decomposition (SVD) is a matrix factorization technique that


decomposes a matrix into three components: U (left singular vectors), Σ (diagonal matrix
of singular values), and V (right singular vectors).
 Matrix Completion involves filling in missing values in a matrix. SVD can be used to
approximate the missing values by low-rank matrix completion.
 Matrix Factorization is a broader technique that expresses a matrix as the product of
two or more smaller matrices. SVD is one approach for matrix factorization.
 In NLP, these techniques are used in collaborative filtering (e.g., recommendation
systems) where user-item matrices have missing values that need to be filled.

76. Give two examples that illustrate the significance of regular expressions in NLP. (CO1
BL1) 5 Marks

1. Text Cleaning: Regular expressions are used to remove unwanted characters (e.g.,
punctuation, special symbols) from raw text data to clean the input before processing it.
Example: Removing digits or punctuation from text: text.replace(r"[^a-zA-Z\s]",
"").
2. Tokenization: Regular expressions can help in splitting a text into words or phrases
based on specific patterns.
Example: Tokenizing a sentence into words using a regular expression pattern like
r"\b\w+\b".

77. Why is multiword tokenization preferable over single word tokenization in NLP? Give
examples. (CO1 BL1) 5 Marks

 Multiword Tokenization is important because many phrases in natural language have


meanings that cannot be inferred from individual words alone.
 Example 1: "New York" should be tokenized as a single entity (named location) rather
than two separate tokens ("New" and "York").
 Example 2: "United States" should be recognized as a single unit, not two tokens.
 Reason: Multiword tokenization helps maintain the semantic meaning of phrases and
improves the accuracy of downstream tasks like Named Entity Recognition (NER) and
machine translation.

78. Differentiate between formal language and natural language. (CO3 BL1) 10 Marks

 Formal Language:
1. A formal language is a set of strings that follow a specific syntactic structure
defined by a set of rules or grammar (e.g., programming languages, mathematical
expressions).
2. It is unambiguous and precise, allowing machines to process it without errors.
3. Examples include languages like C, Python, and SQL.
 Natural Language:
1. A natural language is a language that has evolved naturally among humans for
communication (e.g., English, Spanish, Hindi).
2. It is often ambiguous and subject to nuances, slang, idioms, and cultural context,
which makes it harder for machines to process.
3. Examples include English, Bengali, Chinese.
 Key Differences:
1. Formal languages are syntactically rigid, while natural languages are flexible.
2. Natural languages can be ambiguous, while formal languages aim to eliminate
ambiguity.

79. Explain lexicon, lexeme and the different types of relations that hold between lexemes.
(CO1 BL1) 10 Marks

 Lexicon:
o A lexicon is a collection or dictionary of all the words in a language, along with
information about their meanings, pronunciations, and grammatical properties.
o In NLP, a lexicon helps machines understand and process words more effectively.
 Lexeme:
o A lexeme is a fundamental unit of meaning in the lexicon, representing a set of
forms related to a single word.
o It is an abstract unit, like the word "run" which includes variations like "runs,"
"ran," "running."
 Relations Between Lexemes:
1. Synonymy: Lexemes that have similar meanings (e.g., "big" and "large").
2. Antonymy: Lexemes that have opposite meanings (e.g., "hot" and "cold").
3. Hyponymy: A lexeme that represents a specific instance of a broader category
(e.g., "dog" is a hyponym of "animal").
4. Meronymy: Lexemes that represent a part-whole relationship (e.g., "wheel" is a
part of "car").

80. State the advantages of bottom-up chart parser compared to top-down parsing. (CO1
BL1) 10 Marks

 Bottom-Up Chart Parser:


o Advantages:
1. More efficient in handling ambiguity in the input, as it builds syntactic
structures incrementally from the leaves (words) upwards.
2. Handles incomplete input more effectively and can build parse trees
without requiring full knowledge of the entire sentence upfront.
3. Works well with grammars that are left-recursive, unlike top-down
parsers which may fail on such grammars.
4. Can easily accommodate ungrammatical inputs, by making partial
parse trees.
5. It can parse more types of grammars (including context-free grammars)
and offers greater flexibility in syntax analysis.
 Comparison with Top-Down Parsing:
o Top-Down Parsing:
1. Starts with the start symbol and tries to derive the sentence by expanding
non-terminals, which can be inefficient for ambiguous or incomplete
sentences.
2. May require backtracking if it makes wrong assumptions about the
sentence structure.
3. Can struggle with left-recursive rules, where recursion occurs at the start
of the sentence, causing infinite recursion.

82. Describe the Skip-gram model and its intuition in word embeddings. (CO1 BL2) 10
Marks

 Skip-gram Model:
1. The Skip-gram model is a neural network-based technique used in word
embeddings (such as Word2Vec) to learn vector representations of words.
2. Given a target word, the Skip-gram model predicts the surrounding words
(context words) within a defined window size in a sentence.
3. For example, in the sentence "The cat sat on the mat," if "sat" is the target word,
the model tries to predict context words like "cat," "on," "the," "mat."
 Intuition:
1. The core idea is that words appearing in similar contexts tend to have similar
meanings, which is captured through vector representations.
2. By training the model to predict surrounding words, the Skip-gram model learns
to assign words with similar meanings to similar vector positions in the
embedding space.
3. It is particularly useful for learning word vectors when there is a large corpus, as
it focuses on predicting the context for each target word in the text.

83. Explain the concept of Term Frequency-Inverse Document Frequency (TF-IDF) based
ranking in information retrieval. (CO1 BL2) 10 Marks

 TF-IDF (Term Frequency-Inverse Document Frequency):


1. Term Frequency (TF) measures how frequently a term appears in a document.
2. Inverse Document Frequency (IDF) measures how important a term is by
considering how rare it is across all documents in a corpus.
3. The TF-IDF score of a word in a document is calculated as:

TF-IDF=TF×IDF\text{TF-IDF} = \text{TF} \times \text{IDF}TF-IDF=TF×IDF

4. Term Frequency (TF) is simply the number of times a word appears in a


document divided by the total number of words in that document.
5. Inverse Document Frequency (IDF) is calculated as:

IDF=log⁡(NDF)\text{IDF} = \log\left(\frac{N}{DF}\right)IDF=log(DFN)

where NNN is the total number of documents, and DFDFDF is the number of
documents containing the word.

 Ranking:
1. Documents with higher TF-IDF scores for a query term are considered more
relevant because they contain terms that are not overly common (low IDF) and
are more significant to the specific document (high TF).
2. This helps in ranking documents by relevance to the user's query, improving the
quality of search results in information retrieval systems.

84. Tokenize and tag the following sentence: (CO1 BL1) 2 Marks

 Sentence: "I love programming."


 Tokenization: ["I", "love", "programming", "."]
 POS Tagging: [("I", "PRP"), ("love", "VBP"), ("programming", "NN"), (".", ".")]

85. What different pronunciations and parts-of-speech are involved? (CO1 BL1) 2 Marks

 Pronunciations:
1. Words like "lead" can be pronounced differently depending on their meaning
(present tense verb vs. noun referring to the element).
 Parts-of-speech:
1. Homographs, such as "lead," can be both a noun and a verb.
2. The parts-of-speech involved may include nouns (NN), verbs (VB), and
adjectives (JJ).

86. Compute the edit distance (using insertion cost 1, deletion cost 1, substitution cost 1) of
“intention” and “execution”. Show your work using the edit distance grid. (CO1 BL4) 10
Marks

87. What is the purpose of constructing corpora in Natural Language Processing (NLP)
research? (CO1 BL2) 5 Marks

 Purpose of Constructing Corpora:


1. Training and Testing: Corpora serve as the primary data source for training and
testing NLP models. They allow researchers to evaluate the effectiveness of
algorithms in real-world scenarios.
2. Resource for Lexicons: Corpora help in building lexicons, which are essential for
tasks like part-of-speech tagging, named entity recognition, and word sense
disambiguation.
3. Understanding Linguistic Patterns: Large corpora help researchers understand
linguistic patterns and structures that are crucial for tasks like syntactic parsing,
machine translation, and speech recognition.
4. Evaluation of Models: Corpora provide standard benchmarks for evaluating the
performance of various NLP systems.

88. What role do regular expressions play in searching and manipulating text data? (CO1
BL3) 5 Marks

 Role of Regular Expressions:


1. Pattern Matching: Regular expressions enable efficient searching for specific
patterns within text, such as dates, phone numbers, or email addresses.
2. Text Manipulation: They can be used for text replacements, formatting, and
extractions.
3. Data Validation: Regular expressions are used to validate input, such as checking
if an email follows the correct format.
4. Data Cleaning: In NLP, they are used to clean and preprocess text, removing
unwanted characters or formatting errors.

89. Explain the purpose of WordNet in Natural Language Processing (NLP). (CO1 BL4) 10
Marks

 Purpose of WordNet:
1. Lexical Database: WordNet is a large lexical database of English, organized into
sets of synonyms called synsets. It groups words based on their meanings and
relationships.
2. Semantic Relationships: It provides semantic relationships between words, such
as synonymy, antonymy, hypernymy, and hyponymy.
3. Word Sense Disambiguation: WordNet is commonly used in tasks like word
sense disambiguation, where the correct meaning of a word is determined based
on context.
4. Information Retrieval: It is used in information retrieval systems to improve
search results by understanding word meanings more deeply and identifying
relevant documents.
5. NLP Applications: WordNet plays a crucial role in machine translation, text
summarization, and sentiment analysis by providing a richer semantic
understanding of words.

90. What is Pragmatic Ambiguity in NLP? (CO1 BL4) 10 Marks


 Pragmatic Ambiguity:
1. Pragmatic ambiguity occurs when a word or sentence can have multiple meanings
depending on the context, particularly in how the meaning is interpreted based on
the situation.
2. This ambiguity is related to how language is used in real-world scenarios,
including aspects such as intentions, assumptions, and social context.
3. For example, the sentence "Can you pass the salt?" could be interpreted as a
question or a request depending on the situation.
4. Resolving pragmatic ambiguity involves understanding the broader context of a
conversation, including the speaker’s intent, the relationship between participants,
and prior knowledge shared between them.
5. Handling this type of ambiguity in NLP often requires sophisticated techniques
such as discourse analysis or incorporating world knowledge into models.

91. Describe the class of strings matched by the following regular expressions: a. [a-zA-Z]+ b.
[AZ][a-z] (CO1 BL4) 10 Marks*

 a. [a-zA-Z]+:
1. This regular expression matches strings that consist of one or more alphabetical
characters (both lowercase and uppercase).
2. It can match words such as "hello", "HELLO", or "Hello" but will not match
numbers, spaces, or any non-alphabetic characters.
 b. [A-Z][a-z]*:
1. This regular expression matches strings where the first character is an uppercase
letter (A-Z), followed by zero or more lowercase letters (a-z).
2. It can match strings like "Hello", "World", "Apple", but not strings like "hello",
"world", or "123".

92. Extract all email addresses from the following: “Contact us at [email protected] or
[email protected].” (CO1 BL4) 10 Marks

 Regular Expression for Email Extraction:


1. A typical regular expression for extracting email addresses can be:

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

2. Matches:
[email protected]
[email protected]
93. This regex is intended to match one or more uppercase letters followed by zero or more
digits. [A-Z]+[0-9] However, it has a problem. What is it, and how can it be fixed? (CO1 BL4)
10 Marks*

 Problem:
1. The regular expression [A-Z]+[0-9]* can match any string that starts with one or
more uppercase letters and is optionally followed by digits. However, it will also
match strings where there are no digits, which might not be intended.
2. For example, the regex would match "HELLO" (which has no digits).
 Fix:
1. To ensure the regex only matches strings that have at least one uppercase letter
followed by at least one digit, you can modify the regular expression:

[A-Z]+[0-9]+

2. This ensures that after the uppercase letters, there must be one or more digits.

94. Write a regex to find all dates in a text. The date formats should include: DD-MM-
YYYY, MM-DD-YYYY, YYYY-MM-DD. (CO1 BL4) 10 Marks

 Regular Expression for Dates:


1. To match dates in the formats DD-MM-YYYY, MM-DD-YYYY, and YYYY-
MM-DD, the regular expression can be:

(\d{2})-(\d{2})-(\d{4})|(\d{4})-(\d{2})-(\d{2})

2. This matches:
 12-05-2023 (DD-MM-YYYY)
 05-12-2023 (MM-DD-YYYY)
 2023-12-05 (YYYY-MM-DD)

95. Compute the minimum edit distance between the words MAMA and MADAAM. (CO1
BL5) 10 Marks

 Edit Distance Grid Calculation:


1. Using insertion cost = 1, deletion cost = 1, and substitution cost = 1:
 Start with a grid for "MAMA" vs "MADAAM".
2. The final edit distance is 3.

96. Evaluate the minimum edit distance in transforming the word ‘kitten’ to ‘sitting’ using
insertion, deletion, and substitution cost as 1. (CO1 BL5) 10 Marks

 Edit Distance Calculation:


1. Construct an edit distance grid for "kitten" and "sitting":
2. The final edit distance is 3.

Module 2 
1. What are language models? (2 Marks)

Answer:
A language model is a statistical model that is used to predict the next word or sequence of
words in a sentence. It assigns a probability to a sequence of words, based on the prior
occurrences of the sequence in a corpus. Language models are fundamental in many NLP tasks,
such as speech recognition, text generation, and machine translation.

2. Describe the n-gram model with a specific example. (2 Marks)

Answer:
The n-gram model is a type of probabilistic language model where the probability of a word
depends on the previous n-1 words. For example, in a bigram model (n=2), the probability of the
next word depends only on the previous word.
Example: "I am learning NLP."

 Unigram: P(I), P(am), P(learning), P(NLP)


 Bigram: P(I, am), P(am, learning), P(learning, NLP)

3. Write two differences between bi-gram and tri-gram models. (2 Marks)

Answer:

1. Number of words considered:


o Bigram considers pairs of consecutive words (n=2).
o Trigram considers triples of consecutive words (n=3).
2. Context length:
o Bigram provides context from just one preceding word.
o Trigram provides context from two preceding words.

4. What is chain rule of probability? (2 Marks)


Answer:
The chain rule of probability is a way to break down a joint probability distribution into a
product of conditional probabilities. It helps in computing the probability of a sequence of events
by considering the conditional probabilities of each event given the previous ones.
Mathematically, for a sequence of words w1,w2,...,wnw_1, w_2, ..., w_nw1,w2,...,wn:
P(w1,w2,...,wn)=P(w1)P(w2∣w1)...P(wn∣w1,w2,...,wn−1)P(w_1, w_2, ..., w_n) =
P(w_1)P(w_2|w_1)...P(w_n|w_1, w_2, ..., w_{n-1})P(w1,w2,...,wn)=P(w1)P(w2∣w1)...P(wn∣w1
,w2,...,wn−1)

5. When we are considering the bigram model, what approximation/s do we


make to the actual formula to calculate probability? (5 Marks)

Answer:
In the bigram model, we make the following approximations:

1. Markov Assumption: The probability of a word depends only on the immediately


preceding word. This approximation ignores longer dependencies in the sequence.
2. Simplified Conditional Probability: The probability of a word given the entire history
is approximated by the probability of the word given only the previous word.
Formula:
P(wn∣w1,w2,...,wn−1)≈P(wn∣wn−1)P(w_n | w_1, w_2, ..., w_{n-1}) \approx P(w_n |
w_{n-1})P(wn∣w1,w2,...,wn−1)≈P(wn∣wn−1)

6. What is a Markov assumption? (2 Marks)

Answer:
The Markov assumption states that the future state of a process depends only on the current state,
and not on the sequence of events that preceded it. In the context of language models, it means
that the probability of a word depends only on the immediately preceding word (or a fixed
number of previous words in the case of n-grams).

7. What is maximum likelihood estimation used for? (2 Marks)

Answer:
Maximum Likelihood Estimation (MLE) is used to estimate the parameters of a probability
distribution or model that maximizes the likelihood of observing the given data. In language
modeling, MLE is used to estimate the probabilities of word sequences in n-gram models based
on the frequency of occurrences in the training data.
8. Given a word wnw_nwn and the previous word wn−1w_{n-1}wn−1, how to
normalize the count of bigrams? State the formula for the same. (10 Marks)

Answer:
To normalize the count of bigrams, we use the following formula for the Relative Frequency of
a bigram (wn−1,wn)(w_{n-1}, w_n)(wn−1,wn):
P(wn∣wn−1)=C(wn−1,wn)C(wn−1)P(w_n | w_{n-1}) = \frac{C(w_{n-1}, w_n)}{C(w_{n-
1})}P(wn∣wn−1)=C(wn−1)C(wn−1,wn)
Where:

 C(wn−1,wn)C(w_{n-1}, w_n)C(wn−1,wn) is the count of the bigram (wn−1,wn)(w_{n-


1}, w_n)(wn−1,wn)
 C(wn−1)C(w_{n-1})C(wn−1) is the count of the word wn−1w_{n-1}wn−1
This formula normalizes the bigram count by dividing it by the frequency of the first
word in the bigram.

9. What is relative frequency in n-gram model? (2 Marks)

Answer:
Relative frequency in the n-gram model refers to the ratio of the occurrence of a particular n-
gram to the total number of occurrences of the (n-1)-gram that precedes it. It is used to estimate
the probability of a sequence of words.
Formula:
P(wn∣w1,w2,...,wn−1)=C(w1,w2,...,wn)C(w1,w2,...,wn−1)P(w_n | w_1, w_2, ..., w_{n-1}) =
\frac{C(w_1, w_2, ..., w_n)}{C(w_1, w_2, ..., w_{n-1})}P(wn∣w1,w2,...,wn−1)=C(w1,w2
,...,wn−1)C(w1,w2,...,wn)

10. What are the building blocks of semantic system? (5 Marks)

Answer:
The building blocks of a semantic system include:

1. Lexical Semantics: It involves understanding the meaning of individual words and their
relationships, including synonyms, antonyms, and polysemy.
2. Compositional Semantics: It is concerned with how words combine to form phrases and
sentences with specific meanings.
3. Pragmatics: This refers to how context influences the interpretation of meaning in
communication.
4. World Knowledge: It involves the real-world knowledge and common-sense reasoning
needed to understand certain expressions or sentences.
5. Reference and Inference: Understanding how terms refer to objects or concepts in the
world and how inferences are drawn from text.
11. Discuss lexical ambiguity. (5 Marks)

Answer:
Lexical ambiguity occurs when a word has multiple meanings or senses. This can happen when
the word belongs to different categories or has several interpretations based on context.
Example:

 "Bank":
1. The side of a river (noun).
2. A financial institution (noun).
The correct meaning depends on the surrounding context, making it lexically
ambiguous.

12. Discuss semantic ambiguity. (5 Marks)

Answer:
Semantic ambiguity arises when a phrase or sentence can be interpreted in multiple ways due to
the meanings of words or the structure of the sentence itself. This is often because of the
vagueness or multiple interpretations of meanings.
Example:

 "I saw the man with the telescope."


o Interpretation 1: I saw a man who was holding a telescope.
o Interpretation 2: I used a telescope to see the man.
This ambiguity comes from the way the words are structured or interpreted.

13. Discuss syntactic ambiguity. (5 Marks)

Answer:
Syntactic ambiguity occurs when a sentence or phrase can be interpreted in multiple ways
because of its syntactic structure. Different parse trees or sentence structures lead to multiple
meanings.
Example:

 "I shot an elephant in my pajamas."


o Interpretation 1: I shot an elephant while wearing pajamas.
o Interpretation 2: I shot an elephant that was in my pajamas.
The ambiguity arises from the placement of the phrase "in my pajamas."

14. What is the need for meaning representation? (5 Marks)


Answer:
Meaning representation is necessary for machines to understand and process human language. It
allows the system to encode knowledge in a structured form that facilitates reasoning,
interpretation, and inference.
Key reasons:

1. Disambiguation: Helps resolve ambiguities and provides clarity about word meanings in
context.
2. Information Retrieval: Improves the accuracy of search results by better understanding
the meaning of queries.
3. Natural Language Understanding: Supports tasks like machine translation, question
answering, and summarization.

15. What is the major difference between lexical analysis and semantic analysis
in NLP? (5 Marks)

Answer:

 Lexical Analysis: It involves analyzing the structure of words and breaking them into
tokens. This is the process of converting raw input into meaningful units such as words,
symbols, and punctuation.
 Semantic Analysis: It is concerned with extracting meaning from the text. Semantic
analysis involves resolving ambiguity and interpreting the relationships between words
and phrases in a way that machines can understand.
Difference: Lexical analysis focuses on word-level processing, while semantic analysis
deals with understanding the meaning of words and sentences.

16. Name two language modeling toolkits. (2 Marks)

Answer:

1. NLTK (Natural Language Toolkit)


2. Stanford NLP

17. With examples, explain the different types of parts of speech attributes. (5
Marks)

Answer:
Parts of speech (POS) attributes define the role of words in sentences. Common types include:
1. Noun: Denotes a person, place, thing, or idea.
Example: "dog", "city"
2. Verb: Represents an action or state.
Example: "run", "is"
3. Adjective: Describes a noun.
Example: "beautiful", "large"
4. Adverb: Modifies a verb, adjective, or another adverb.
Example: "quickly", "very"
5. Pronoun: Replaces a noun.
Example: "he", "it"
6. Preposition: Shows relationships between nouns and other words.
Example: "in", "on"
7. Conjunction: Connects words or phrases.
Example: "and", "but"

18. Explain extrinsic evaluation of the N-gram model and the difficulties related
to it. (10 Marks)

Answer:
Extrinsic evaluation refers to assessing the N-gram model by evaluating its performance on an
external task, such as speech recognition, text generation, or machine translation, where the
model is applied to real-world problems.
Difficulties:

1. Dependence on Task: The evaluation is task-dependent, and the model's performance


may vary across different tasks.
2. Evaluation Metrics: Choosing the right evaluation metric (e.g., BLEU for translation) is
crucial and may not always reflect the model's true effectiveness.
3. Data Dependence: The N-gram model's performance heavily depends on the quality and
size of the training data, which might not always be representative of real-world
scenarios.
4. Overfitting: The model might overfit to specific patterns in the training data, reducing
generalizability.

19. With an example, explain the path-based similarity check for two words. (5
Marks)

Answer:
Path-based similarity checks compare two words by examining the shortest path between them in
a lexical database like WordNet. The similarity score depends on how closely related the words
are in the network.
Example:
 For the words "cat" and "dog," a path-based similarity measure might calculate the
shortest path from "cat" to "dog" in WordNet. If they share a common parent (like
"animal"), their similarity score would be higher.
 Cat -> Animal -> Dog

20. Define Homonymy, Polysemy, and Synonymy with examples. (5 Marks)

Answer:

 Homonymy: Words that have the same form but different meanings.
Example: "bat" (flying mammal) vs. "bat" (sports equipment).
 Polysemy: A single word has multiple related meanings.
Example: "bank" (financial institution) vs. "bank" (side of a river).
 Synonymy: Words that have similar or identical meanings.
Example: "happy" and "joyful."

21. How does WordNet assist in extracting semantic information from a corpus?
(5 Marks)

Answer:
WordNet is a lexical database that organizes words into sets of synonyms (synsets), with
relationships between them. It aids in extracting semantic information in several ways:

1. Synonymy: Helps in identifying words with similar meanings, useful in tasks like word
sense disambiguation and information retrieval.
2. Hyponymy/Hypernymy: Identifies hierarchical relationships between words (e.g., "dog"
is a hyponym of "animal").
3. Meronymy: Identifies part-whole relationships (e.g., "wheel" is a meronym of "car").
4. Path-based similarity: Helps compute semantic similarity between words based on the
paths connecting them in the hierarchy.

22. How does NLP employ computational lexical semantics? Explain. (5 Marks)

Answer:
In computational lexical semantics, NLP systems represent and analyze the meaning of words
and their relationships computationally. This involves:

1. Word Sense Disambiguation (WSD): Resolving ambiguity in words that have multiple
meanings (e.g., "bat" as a flying mammal or a sport's equipment).
2. Semantic Role Labeling (SRL): Assigning roles to words in a sentence, helping the
system understand the relationships (e.g., identifying agents, actions, and objects).
3. Vector Representations: Using techniques like Word2Vec or GloVe, words are
represented in vector space, where semantic similarity can be captured by vector
proximity.
4. Corpus-based Analysis: Using large corpora to model semantic relationships, improve
understanding, and extract meaning from context.

23. What are the problems with basic path-based similarity measures, and how
are they reformed through information content similarity metrics? (10 Marks)

Answer:
Problems with Path-based Similarity Measures:

1. Superficial Relationships: Path-based measures rely on the structural path in the lexical
network, which may not always capture deep semantic similarity.
2. Lack of Context: Path-based methods do not consider the frequency or context of word
usage, potentially leading to misleading similarity scores.
3. Limited Coverage: Some words may not have direct paths between them, making it hard
to compute a meaningful similarity.

Reform through Information Content (IC) Similarity:

1. Incorporates Frequency Information: IC-based measures use the frequency of word


occurrences in large corpora, considering less frequent words as more informative.
2. Contextual Relevance: IC measures focus on the amount of information conveyed by a
word based on its occurrence in a corpus, improving semantic accuracy.
3. Better Disambiguation: Helps distinguish between different senses of a word,
improving the robustness of similarity computations.

24. Explain the extended Lesk algorithm with an example. (5 Marks)

Answer:
The Lesk algorithm is a popular method for word sense disambiguation. It compares the
dictionary definitions (or glosses) of word senses to select the most appropriate meaning based
on overlap with surrounding words.
Extended Lesk Algorithm improves this by including contextual words not just in the
surrounding window but throughout the entire document.
Example:

 For the word "bank", the basic Lesk algorithm compares the definitions of "bank" (a
financial institution) with surrounding words like "money" and "investment".
 In the extended version, it also considers broader context from the text like "river" or
"water" to choose "bank" (side of a river) as the correct sense.
25. State the difference in properties of Rule-based POS tagging and Stochastic
POS tagging. (5 Marks)

Answer:

 Rule-based POS tagging:


1. Relies on predefined linguistic rules to assign POS tags.
2. High precision when the rules are well-defined, but may struggle with ambiguities
or edge cases.
3. No learning involved; relies purely on expert knowledge and linguistic structures.
 Stochastic POS tagging:
1. Uses statistical methods (e.g., Hidden Markov Models) to predict POS tags based
on observed probabilities.
2. Handles ambiguity by considering context and probabilities.
3. Can learn from data, improving over time, but may not be as interpretable as rule-
based methods.

26. What is stochastic POS tagging? What are the properties of stochastic POS
tagging? (10 Marks)

Answer:
Stochastic POS tagging involves using statistical models (such as Hidden Markov Models or
Conditional Random Fields) to assign POS tags based on observed frequencies and probabilities
in a given corpus. It predicts the most likely POS tags by considering the context of surrounding
words.
Properties:

1. Context-dependent: Unlike rule-based tagging, stochastic tagging takes into account the
context in which a word appears.
2. Probabilistic: It uses probabilities to make decisions, often yielding higher accuracy on
ambiguous words.
3. Data-driven: Stochastic models learn from annotated corpora, improving accuracy as
more data is provided.
4. Handles Ambiguity: Stochastic POS tagging is particularly effective in dealing with
ambiguous words, as it can consider both the word's form and its context.

27. What is rule-based POS tagging and what are the properties of the same? (10
Marks)
Answer:
Rule-based POS tagging relies on manually crafted rules based on linguistic knowledge to
assign POS tags to words. It uses patterns, lexical information, and syntactic rules to tag words in
sentences.
Properties:

1. Expert Knowledge: Requires domain expertise and linguistic rules to be created.


2. Precision: High precision on texts where the rules are applicable, particularly in formal
and structured text.
3. Limited Flexibility: Struggles with text that doesn't conform to the predefined rules or
includes unusual or ambiguous cases.
4. No Learning: It doesn’t improve with exposure to more data, as it is not based on
statistical learning but on predefined rules.

28. Give examples to illustrate how the n-gram approach is utilized in word
prediction. (10 Marks)

Answer:
The n-gram approach is widely used in word prediction tasks, where the goal is to predict the
next word in a sequence based on the previous ones.
Example (Bi-gram model):

 Context: "I am going to the"


The bi-gram model looks at the last word ("the") and predicts the next word. If the
model has seen "the" followed by "store" in the training data, it will predict "store" as the
next word.

Example (Tri-gram model):

 Context: "She went to the"


The tri-gram model looks at the last two words ("to the") and predicts the next word. If
"to the" was followed by "market" in training data, it will predict "market."

29. Highlight transformation-based tagging and the working of the same. (10
Marks)

Answer:
Transformation-based tagging is a hybrid POS tagging approach that combines rule-based
methods with machine learning techniques.

1. Initial Tagging: The process starts with an initial tagging using a simple rule-based
approach or a stochastic model.
2. Transformation Rules: Then, a set of transformation rules are applied to correct the
initial tagging. These rules specify conditions under which certain tags can be changed
(e.g., if a noun follows a determiner, it should be tagged as a noun).
3. Iterative Refinement: These transformation rules are learned from the training data and
applied iteratively, improving the accuracy of the POS tags.

30. State the difference between structured data and unstructured data. (10
Marks)

Answer:

 Structured Data:
1. Organized in tables or databases with predefined models (e.g., relational
databases).
2. Easily searchable and processed using SQL.
3. Example: Employee records, transaction logs, or sensor data.
 Unstructured Data:
1. No predefined structure and cannot be stored in traditional relational databases.
2. Often text-heavy and requires specialized techniques (e.g., NLP) for analysis.
3. Example: Emails, social media posts, images, and videos.

31. What is semi-structured data? Explain with an example. (5 Marks)

Answer:
Semi-structured data has some form of organization but does not conform to a rigid structure
like structured data. It typically contains tags or markers to separate elements but does not follow
a strict schema.
Example:

 XML or JSON files: Data is stored in a flexible format with key-value pairs or tags but
does not fit into a fixed table structure.
o XML example:

<person>
<name>John</name>
<age>30</age>
<city>New York</city>
</person>

32. How does a supervised machine learning algorithm contribute to text


classification? (5 Marks)

Answer:
In text classification, a supervised machine learning algorithm uses labeled data (texts with
predefined categories) to learn patterns and relationships between the input features and output
labels.

1. Training Phase: The algorithm is trained using a labeled dataset, where each document
or text is associated with a category (e.g., spam or not spam).
2. Feature Extraction: The algorithm extracts features (e.g., word frequency, n-grams, or
term frequency) to represent the text data.
3. Prediction: Once trained, the model can classify new, unseen text into predefined
categories based on learned patterns.
Example: A Naive Bayes classifier for spam email detection, where emails labeled as
"spam" or "not spam" are used to train the model.

33. List the uses of emotion analytics. (5 Marks)

Answer:
Emotion analytics involves analyzing textual, vocal, or visual data to detect emotions. Some
common uses are:

1. Customer Feedback: Analyzing customer reviews, social media comments, or surveys


to gauge customer satisfaction or dissatisfaction.
2. Market Research: Assessing consumer sentiment and emotional reactions to advertising,
branding, and products.
3. Healthcare: Understanding emotional states in patients through speech or text to monitor
mental health.
4. Social Media Monitoring: Tracking public sentiment about trends, news, or events in
real time.
5. Human-Computer Interaction: Improving user experience by tailoring responses based
on emotional feedback.

34. Say you are an employee of a renowned food delivery company and your
superior has asked you to do a market survey to search for potential competition
and zero down to areas where your company needs to improve to be the top
company in the market. How will you approach this task and accomplish the
goal? (10 Marks)

Answer:

1. Identify Competitors:
o Research local and national competitors offering food delivery services. Focus on
market leaders and emerging competitors.
o Examine customer reviews, ratings, and media coverage to identify popular
competitors.
2. Analyze Competitor Services:
o
Compare delivery times, pricing models, menu options, payment methods, app
features, and customer support channels.
o Identify unique selling propositions (USPs) that set competitors apart.
3. Customer Sentiment Analysis:
o Conduct surveys or analyze social media and review platforms to understand
customers' opinions and complaints about competitors.
o Focus on areas like food quality, delivery speed, customer service, and app
usability.
4. Spot Weaknesses and Opportunities:
o Highlight gaps in competitors’ offerings or services where your company can
improve (e.g., more menu variety, loyalty programs, faster delivery times).
o Research potential market opportunities in untapped areas, such as healthier food
options or a focus on sustainability.
5. Recommendations for Improvement:
o Based on research, propose strategies like enhancing delivery times, introducing
better customer support, and upgrading the mobile app to make it more user-
friendly.
o Suggest marketing strategies to increase brand awareness and attract more
customers.

35. Explain a classic search model with a diagram. (5 Marks)

Answer:
A classic search model refers to the way search algorithms work, such as the Breadth-First
Search (BFS) or Depth-First Search (DFS). These models explore search spaces or trees to
find a goal node or solution.

 Breadth-First Search (BFS): Explores all neighbors of a node before moving to the next
level.
 Depth-First Search (DFS): Explores as far down a branch as possible before
backtracking.

Diagram (for BFS):

A
/ \
B C
/ \ \
D E F

 BFS explores nodes level by level: A → B → C → D → E → F.

36. Why is part-of-speech (POS) tagging required in NLP? (5 Marks)


Answer:
POS tagging is essential in Natural Language Processing (NLP) for several reasons:

1. Disambiguation: It helps in distinguishing between words that can serve different parts
of speech (e.g., "bank" as a financial institution or the side of a river).
2. Syntax Structure: It aids in understanding the sentence structure, helping the system
determine subject-object relationships.
3. Text Analysis: POS tagging is used for information extraction, sentiment analysis, and
machine translation by classifying words as nouns, verbs, adjectives, etc.
4. Improves Accuracy: POS tagging increases the accuracy of downstream NLP tasks,
such as parsing and named entity recognition (NER).

37. What is vocabulary in NLP? (2 Marks)

Answer:
Vocabulary in NLP refers to the collection of unique words or tokens that are used in a given
text or corpus. The vocabulary is typically built from the set of words found in a dataset and is
crucial for tasks like text classification, word embeddings, and language modeling.

38. What do you mean by Information Extraction? (2 Marks)

Answer:
Information Extraction (IE) refers to the process of automatically extracting structured
information from unstructured text. It typically involves identifying entities (e.g., names, dates,
locations), relationships, and events within the text to convert it into a more structured form, like
databases or knowledge graphs.

39. What is morphological parsing? Explain the steps of morphological parser?


(5 Marks)

Answer:
Morphological Parsing is the process of analyzing the structure of words to determine their root
forms, prefixes, suffixes, and other morphological features.
Steps of Morphological Parsing:

1. Segmentation: Break the word into morphemes (smallest meaningful units).


2. Lexical Analysis: Identify the root and affixes of a word.
3. Syntactic Analysis: Analyze the syntactic structure of the word.
4. Rule Application: Apply morphological rules (e.g., stemming, inflection, derivation) to
understand the word's meaning.
5. Output: The parser outputs the root form and relevant morphological features.

40. What is BOW (Bag of Words)? (5 Marks)

Answer:
The Bag of Words (BOW) model is a simple representation of text where each document is
represented as an unordered set of words, disregarding grammar and word order but keeping
multiplicity.

 Each word in a document is treated as a separate feature.


 BOW is commonly used in text classification, sentiment analysis, and information
retrieval.
 Example: For the sentence "I love NLP", the BOW representation would be: {"I": 1,
"love": 1, "NLP": 1}.

41. State the difference between formal language and natural language. (5
Marks)

Answer:

 Formal Language:
o Rigorous syntax and semantics (e.g., programming languages, mathematical
expressions).
o Has well-defined rules that do not change.
o Can be processed by computers without ambiguity.
 Natural Language:
o Used by humans for everyday communication (e.g., English, Spanish).
o Contains ambiguities, irregularities, and flexible grammar rules.
o Difficult for computers to process due to complexity and variations in meaning.

42. Assume there are 4 topics namely, Cricket, Movies, Politics and Geography
and 4 documents D1, D2, D3, and D4, each containing equal number of words.
These words are taken from a pool of 4 distinct words namely, {Shah Rukh,
Wicket, Mountains, Parliament} and there can be repetitions of these 4 words in
each document. Assume you want to recreate document D3. Explain the process
you would follow to achieve this and reason as how recreating document D3 can
help us understand the topic of D3 (10 Marks)

Answer:
1. Term Frequency (TF):
o First, calculate the term frequency of each word in all documents, including
document D3.
2. Topic Modeling:
o Use techniques like Latent Dirichlet Allocation (LDA) or TF-IDF to associate
words with topics.
o Based on the frequencies of the words, reconstruct the topic for D3. For example,
if "Mountains" and "Parliament" appear frequently in D3, it likely relates to
Geography and Politics.
3. Recreate Document D3:
o After identifying the most frequent terms associated with the topics, recreate D3
by choosing words based on the identified topics.
o The process of recreating D3 helps in understanding the content and topic of
the document, as the word distribution reflects the topic it represents.

43. What is text parsing? (2 Marks)

Answer:
Text Parsing refers to the process of analyzing a sequence of words or sentences to extract
meaningful structures or components. This typically involves breaking down text into parts like
sentences, phrases, or words, and analyzing their grammatical or syntactic structure. Parsing
helps in identifying parts of speech, sentence components (subjects, predicates), and
relationships between them.

44. Explain Sentiment analysis in market research? (2 Marks)

Answer:
Sentiment Analysis in market research refers to the process of analyzing customer opinions,
feedback, or reviews to determine their emotional tone (positive, negative, or neutral). By
understanding sentiment, companies can gain insights into customer perceptions of products,
services, or brands. This helps in improving product offerings, targeting marketing campaigns,
and enhancing customer experiences.

45. Describe Hidden Markov Models. (2 Marks)

Answer:
Hidden Markov Models (HMMs) are statistical models used to represent systems that follow a
Markov process with unobservable (hidden) states. HMMs consist of:

1. States: The system’s internal state (hidden and not directly observable).
2. Observations: The observed data or output influenced by the hidden states.
3. Transitions: Probabilities of transitioning between states.
4. Emission Probabilities: Likelihood of an observation being generated from a state.

HMMs are widely used in speech recognition, part-of-speech tagging, and time-series analysis.

46. State and explain in details the main advantage of Latent Dirichlet Allocation
methodology over Probabilistic Latent Semantic Analysis for building a
Recommender system? (10 Marks)

Answer:
Latent Dirichlet Allocation (LDA) vs. Probabilistic Latent Semantic Analysis (PLSA):

 LDA is a generative probabilistic model that assumes each document is a mixture of


topics and each topic is a distribution over words. It models the distribution of topics in a
more flexible manner using Dirichlet distributions.
 Main Advantage of LDA over PLSA:
o Dirichlet Prior: LDA uses a Dirichlet distribution for topic distributions, which
helps in regularizing the topic distribution per document. PLSA, on the other
hand, does not have a Dirichlet prior and can overfit the data, especially when
dealing with sparse data.
o Scalability: LDA is more scalable to large datasets due to its ability to estimate
topic distributions in a more structured way.
o Better Generalization: LDA’s use of Dirichlet priors helps in better
generalization to unseen data, improving the quality of recommendations in a
recommender system.
o Interpretability: LDA’s structure allows for more meaningful and interpretable
topics, which is beneficial for understanding the factors influencing user
preferences in a recommender system.

47. Explain in details how the Matrix Factorization technique used for building
Recommender Systems effectively boils down to solving a Regression problem. (5
Marks)

Answer:
Matrix Factorization in recommender systems involves decomposing a large, sparse matrix
(user-item ratings matrix) into two smaller matrices: one representing user features and the
other representing item features.

1. Goal: The goal is to approximate the original matrix by multiplying the user and item
matrices. The factorization aims to minimize the difference between the predicted ratings
and the actual ratings (i.e., minimize the error).
2. Regression Problem: The matrix factorization can be viewed as a regression problem,
where the predicted rating for a user-item pair is a function of the dot product of the
user's and item's feature vectors. The error in prediction can be minimized using
techniques like Stochastic Gradient Descent (SGD), which adjusts the feature vectors
(user and item) based on the error, similar to how regression models adjust coefficients.
3. Optimization: The optimization process, similar to regression, tries to minimize the sum
of squared errors between the actual ratings and the predicted ratings.

48. What are the two main approaches used in computational linguistics for Part
of Speech (POS) tagging? (5 Marks)

Answer:
The two main approaches for Part of Speech (POS) tagging are:

1. Rule-Based POS Tagging:


o Relies on a set of hand-crafted linguistic rules that assign tags based on the
context of a word and its relationship with surrounding words.
o Example: If a word appears after a determiner, it's likely to be a noun.
2. Stochastic POS Tagging:
o Uses probabilistic models like Hidden Markov Models (HMMs) to predict the
most likely POS tags based on a given word's context and the probabilities of
sequences of tags.
o The model is trained on labeled data to learn tag probabilities and transitions
between tags.

49. What is WordNet? (2 Marks)

Answer:
WordNet is a lexical database of the English language. It organizes words into sets of synonyms
called synsets, each representing a distinct concept. Words in WordNet are related to each other
through various semantic relations, such as hypernyms (generalizations), hyponyms (specific
instances), meronyms (part-whole relationships), and antonyms.

50. Describe the hierarchy of relationships in WordNet. (5 Marks)

Answer:
WordNet organizes words into a network of interrelated synsets (sets of synonyms) and defines
several semantic relationships between them:

1. Hypernymy: A hypernym is a more general term for a word. Example: "Animal" is a


hypernym of "Dog".
2. Hyponymy: A hyponym is a more specific term within a category. Example: "Dog" is a
hyponym of "Animal".
3. Meronymy: A meronym represents part-whole relationships. Example: "Wheel" is a
meronym of "Car".
4. Antonymy: Antonyms are words with opposite meanings. Example: "Hot" and "Cold".
5. Troponyms: A troponym describes a manner or way of performing an action. Example:
"Jog" is a troponym of "Run".

These relationships form a hierarchical structure in WordNet that helps in semantic analysis.

51. How are morphological operations applied in NLP? (5 Marks)

Answer:
Morphological operations in NLP involve analyzing and processing words into their
components, such as stems, roots, prefixes, and suffixes. These operations are applied to
understand word meanings, word variations, and structure.

1. Stemming: Reduces words to their root form. Example: "Running" → "Run".


2. Lemmatization: Reduces words to their lemma (dictionary form), considering the word's
context. Example: "Better" → "Good".
3. Inflection: Handles variations in words based on tense, number, or case. Example:
"Walked" → "Walk".
4. Derivation: Creates new words by adding prefixes or suffixes. Example: "Happy" →
"Happiness".

These operations are essential for standardizing words and improving text analysis tasks like
classification and information retrieval.

52. Explain the concept of hypernyms, hyponyms, heteronyms in WordNet. (10


Marks)

Answer:

1. Hypernyms:
o A hypernym is a word that denotes a broader category or class. In WordNet,
hypernyms represent generalizations of specific concepts.
o Example: "Vehicle" is a hypernym of "Car".
2. Hyponyms:
o A hyponym is a more specific word that falls under a broader category. In
WordNet, hyponyms are more specific instances of a general concept.
o Example: "Apple" is a hyponym of "Fruit".
3. Heteronyms:
o Heteronyms are words that have the same spelling but different meanings and
pronunciations depending on the context.
o Example: "Lead" (to guide) vs. "Lead" (the metal).

These relationships help in word sense disambiguation and improving semantic understanding in
NLP.

53. Discuss the advantages and disadvantages of CBOW and Skip-gram models.
(10 Marks)

Answer:
Continuous Bag of Words (CBOW) and Skip-gram are two models used for word embeddings
in Word2Vec:

1. CBOW Model:
o Advantages:
 Efficient for small datasets: CBOW works well when training on smaller
datasets.
 Faster training: It tends to be faster since it uses the surrounding context
to predict the target word.
o Disadvantages:
 Less effective for rare words: It struggles to capture the meanings of rare
words.
2. Skip-gram Model:
o Advantages:
 Good for larger datasets: Skip-gram works well with large corpora and
can capture richer relationships between words.
 Better for rare words: It excels at learning embeddings for rare words.
o Disadvantages:
 Slower training: It is computationally expensive and slower compared to
CBOW.

54. Explain the process of text classification, focusing on Naïve Bayes' Text
Classification algorithm. (10 Marks)

Answer:
Text classification is the process of categorizing text into predefined labels or classes. The
Naïve Bayes algorithm is a simple and effective approach for text classification, based on Bayes'
theorem.

1. Step 1 - Data Preprocessing:


o Tokenize the text into words and remove stop words.
o Convert words into features (using methods like TF-IDF or Bag of Words).
2. Step 2 - Model Training:
o The Naïve Bayes algorithm assumes that words are conditionally independent
given the class. It computes the prior probabilities for each class and the
likelihood of each word given a class.
3. Step 3 - Classification:
o For a new document, calculate the posterior probability for each class and assign
the class with the highest probability.
4. Advantages of Naïve Bayes:
o Works well with high-dimensional datasets like text.
o Computationally efficient.
5. Limitations:
o Assumption of word independence may not hold true in many cases, reducing
accuracy.

55. How do you use naïve bayes model for collaborative filtering? (5 Marks)

Answer:
Naïve Bayes can be used for Collaborative Filtering by treating users’ interactions with items
as categorical data. The idea is to predict a user’s preferences based on the probabilistic
relationships between items and user behaviors.

1. User and Item Interaction:


o Treat each item (movie, product, etc.) as a feature and each user’s rating or
interaction as a class label.
2. Modeling:
o For each user, use Naïve Bayes to predict the likelihood of the user’s interaction
with an item (like a rating, preference, or behavior).
o For example, the probability of user A liking item X is computed based on the
user’s history with similar items (e.g., movies or products).
3. Training:
o Use the historical interactions to train the Naïve Bayes model. It will estimate the
likelihood of a user liking an item based on features like item characteristics and
past preferences.
4. Prediction:
o For a new item, the model calculates the probabilities and predicts whether a user
will interact with or like it.

56. Is lexical analysis different from semantic? How? (10 Marks)

Answer:
Lexical Analysis and Semantic Analysis are two distinct stages in Natural Language Processing
(NLP), each serving different purposes:
1. Lexical Analysis:
o Definition: Lexical analysis refers to the process of breaking down the input text
into tokens, which are the smallest units of meaningful data (words, punctuation,
etc.).
o Purpose: The primary aim is to identify the structure and parts of speech (nouns,
verbs, etc.), normalize them (e.g., stemming or lemmatization), and convert them
into a machine-readable format.
o Example: Converting the sentence "Cats are playing" into tokens: ["Cats", "are",
"playing"].
2. Semantic Analysis:
o Definition: Semantic analysis deals with understanding the meaning of words,
phrases, and sentences in context. It involves deriving the meaning or "sense" of
words and how they relate to each other.
o Purpose: The goal is to extract the deeper meaning or context behind words,
resolve ambiguities, and link words to concepts.
o Example: The word “bat” could refer to a flying mammal or a piece of sports
equipment. Semantic analysis disambiguates this based on context (e.g., "He
swung the bat" vs. "A bat flew across the sky").

Key Difference:

 Lexical analysis focuses on structure (tokenization, parsing), while semantic analysis


focuses on meaning (interpretation, context).

57. Define what N-grams are in the context of Natural Language Processing
(NLP). (5 Marks)

Answer:
N-grams in NLP are contiguous sequences of n items (words, characters, etc.) from a given text
or speech. They are used for various tasks like language modeling, text prediction, and feature
extraction. The value of n determines the length of the sequence:

1. Unigrams (1-grams): Single words or characters. Example: "I", "am", "happy".


2. Bigrams (2-grams): Pairs of consecutive words. Example: "I am", "am happy".
3. Trigrams (3-grams): Triplets of consecutive words. Example: "I am happy".

N-grams are used to model the probability of the occurrence of words or sequences, often
applied in text classification and predictive text tasks.

58. What are word embeddings in the context of Natural Language Processing
(NLP)? (10 Marks)
Answer:
Word Embeddings are a type of word representation that allows words to be represented as
vectors in a continuous vector space. Word embeddings capture the semantic meaning of words
by placing similar words closer together in the vector space.

1. Definition: Word embeddings map words to dense vectors (compared to sparse one-hot
encoding), where words with similar meanings have similar vector representations.
2. How Word Embeddings Work:
o They are learned from large text corpora using models like Word2Vec, GloVe, or
FastText.
o These models learn embeddings by capturing contextual information about words
based on their co-occurrence patterns in large datasets.
3. Example:
o In Word2Vec, there are two main models:
 Skip-gram: Given a word, predict the surrounding words.
 CBOW (Continuous Bag of Words): Given surrounding words, predict
the target word.
4. Applications of Word Embeddings:
o Semantic Similarity: Words with similar meanings will have similar embeddings
(e.g., "king" and "queen").
o Word Analogies: Embeddings allow for operations like "King - Man + Woman =
Queen".
o Text Classification: Word embeddings can be used as features in machine
learning models for text classification, sentiment analysis, etc.

59. What is "vector semantics" in NLP, and why is it useful for understanding
word meanings? (10 Marks)

Answer:
Vector Semantics in NLP refers to the representation of words as vectors in a high-dimensional
space, where the geometric relationships between vectors capture their semantic relationships.

1. Definition: Words are represented as vectors, and their meanings are interpreted based on
their position in this vector space. Similar words have similar vector representations (i.e.,
they are closer in the vector space).
2. How It Works:
o Words that share similar contexts or occur in similar contexts (in large corpora)
are placed closer together in the vector space.
o Word2Vec and GloVe are popular techniques used to generate vector
embeddings for words.
3. Applications:
o Word Similarity: Words that are semantically similar (e.g., "dog" and "puppy")
will have similar vector representations.
o Word Analogies: Vector semantics can be used for tasks like solving analogies
(e.g., "king" - "man" + "woman" = "queen").
o Text Classification: Word vectors are used to capture word meaning for
classification tasks, helping machines understand context and intent.

Why It’s Useful:

 Contextual Understanding: Vector semantics allows for nuanced representation of


words based on context, overcoming the limitations of traditional one-hot encoding.
 Semantic Tasks: It enables tasks like semantic similarity, word analogy, and sentiment
analysis by using geometric properties of word vectors.

60. Discuss a significant limitation of TF-IDF. (2 Marks)

Answer:
A significant limitation of TF-IDF (Term Frequency-Inverse Document Frequency) is that it
does not capture semantic meaning or context between words. TF-IDF is purely based on the
frequency of terms in documents and does not account for the relationships or meanings of
words beyond their occurrence.

1. Example: The words "dog" and "hound" may have similar meanings but will be treated
as completely independent terms in TF-IDF, ignoring their semantic relationship.
2. Impact: This limitation can affect tasks like information retrieval and text classification,
where understanding word meaning and context is crucial.

61. Discuss the application of regular expressions in Natural Language


Processing (NLP), emphasizing their role in text processing tasks. Provide
examples. (5 Marks)

Answer:
Regular expressions (regex) are used extensively in Natural Language Processing (NLP) for
text processing tasks. They provide a powerful mechanism to identify, match, and extract
specific patterns from text.

1. Applications of Regular Expressions in NLP:


o Tokenization: Regex can help split text into words, sentences, or other
meaningful components.
o Text Preprocessing: Removing unwanted characters, special symbols, and
stopwords using regex.
o Pattern Matching: Identifying specific patterns like email addresses, phone
numbers, dates, etc.
o Named Entity Recognition (NER): Extracting entities like names, locations, and
dates.
2. Example:
o To match an email address:
r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'
 This regex will match typical email formats.
3. Advantages:
o Fast pattern matching.
o Flexible for different text-processing needs.
o Helps in data cleaning and extraction.

62. Explain the concept of N-grams in NLP and with examples discuss their
importance in language modelling to demonstrate how N-grams capture
sequential patterns in text data. (10 Marks)

Answer:
N-grams are contiguous sequences of n words in a given text, widely used in language
modeling for capturing sequential relationships in text.

1. Definition:
o An N-gram is a sequence of n words that appear together in a text corpus.
 Unigram: Single word (e.g., "I", "am", "hungry").
 Bigram: Two consecutive words (e.g., "I am", "am hungry").
 Trigram: Three consecutive words (e.g., "I am hungry").
2. Importance in Language Modeling:
o N-grams help capture local dependencies between words, such as common word
pairs or triplets.
o They are used in statistical language models (e.g., n-gram models) to predict
the next word in a sequence based on the previous words.
3. Example:
o In the sentence "I am happy today", the bigrams would be: "I am", "am happy",
"happy today".
o These help the model learn context and predict the next word.
4. Sequential Pattern Capture:
o N-grams help in capturing the sequential relationships in language. For example,
"I am" is a common phrase, whereas "am I" might not be as frequent,
demonstrating how N-grams capture word order.

63. Explain the significance of n-grams in the design of any text classification
system using examples. (5 Marks)

Answer:
N-grams play a significant role in text classification systems by providing a way to represent
text data that accounts for word sequences and context, which is essential for tasks like sentiment
analysis, spam detection, etc.
1. N-grams in Text Classification:
o N-grams help capture the local context of words, which improves the classifier’s
ability to understand sequences and relationships between terms in a document.
o They are often used as features for classification algorithms like Naive Bayes or
SVM.
2. Example:
o In sentiment analysis, bigrams like "good movie" or "not good" could provide
important clues about sentiment.
 "good movie" might be classified as positive, while "not good" would
indicate negative sentiment.
3. Why N-grams Are Useful:
o They preserve the word order and help capture contextual dependencies that
unigrams (single words) cannot.
o N-grams allow the classifier to learn which word combinations are relevant for
distinguishing categories.

64. Discuss the disadvantage of unigram in information extraction. (5 Marks)

Answer:
The unigram model treats each word as independent, ignoring word order and context, which
can lead to several disadvantages in information extraction tasks.

1. Disadvantages:
o Lack of Context: Unigrams cannot capture the relationship between words. For
example, "New York" as a location would be treated as two separate words
("New" and "York"), missing the context.
o Ambiguity: Unigrams do not resolve word ambiguities. For example, "bank"
could mean a financial institution or the side of a river, but unigrams cannot
distinguish between the two without context.
o Inaccurate Feature Representation: Information extraction often relies on
understanding the order and proximity of words, which unigrams fail to capture.
2. Example:
o In the sentence "I went to the bank", unigram models might fail to understand the
meaning because "bank" could refer to a financial institution or the side of a river.

65. Define homographs and provide an example. (2 Marks)

Answer:
Homographs are words that are spelled the same but have different meanings and sometimes
different pronunciations, depending on the context.

1. Example:
o Lead (to guide) vs. Lead (a heavy metal).
o In "She will lead the team," "lead" refers to guiding. In "The lead pipe," "lead"
refers to the metal.

66. How is the Levenshtein distance algorithm used to find similar words to a
given word? (10 Marks)

Answer:
The Levenshtein distance algorithm calculates the minimum number of single-character
edits (insertions, deletions, or substitutions) required to convert one string into another. It is
useful for finding similar words by measuring how closely a word matches another.

1. Steps:
o The algorithm computes a matrix where each cell represents the minimum
number of edits needed to transform one word into another.
o It then uses dynamic programming to fill the matrix based on the choices of
insertion, deletion, or substitution.
2. Use in Finding Similar Words:
o Words with lower Levenshtein distances from a given word are considered more
similar.
o For example, given the word "kitten," similar words like "sitting" or "mitten"
would have a small Levenshtein distance.
3. Example:
o Transform "kitten" to "sitting":
 Insert "s" at the beginning: "skitten" (1 edit)
 Substitute "k" with "s": "sitten" (1 edit)
 Substitute "e" with "i": "sittin" (1 edit)
 Insert "g" at the end: "sitting" (1 edit)
 Total distance = 3 edits.

67. Define heteronyms and provide an example. (2 Marks)

Answer:
Heteronyms are words that are spelled the same but have different meanings and are
pronounced differently depending on the context.

1. Example:
o Lead (to guide) vs. Lead (a type of metal).
 "She will lead the team" (pronounced /liːd/).
 "The pipe is made of lead" (pronounced /lɛd/).
68. Explain the concept of polysemy and provide an example. (2 Marks)

Answer:
Polysemy refers to a single word that has multiple meanings, which are related by extension or
metaphor.

1. Example:
o Bank can mean:
 A financial institution.
 The side of a river (e.g., "river bank").
 A place to store or accumulate (e.g., "blood bank").

69. Define synonyms and antonyms and provide examples of each. (2 Marks)

Answer:

 Synonyms: Words that have the same or nearly the same meaning.
o Example: "Happy" and "Joyful".
 Antonyms: Words that have opposite meanings.
o Example: "Happy" and "Sad".

70. We are given the following corpus:

<s> I am sam </s>


<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam</s>
Using a bigram language model with add-one smoothing, what is P(Sam | am)? Include <s> &
</s> in your counts just like any other token. (10 Marks)
Answer:
To calculate the probability P(Sam | am) using a bigram model with add-one smoothing:

1. Bigram Count:
o Count of the bigram (am, Sam): 3 (since "am Sam" occurs 3 times in the corpus).
o Total number of bigrams: 11 (since the corpus has 11 bigrams in total,
considering the <s> and </s> tokens).
2. Add-One Smoothing:
o To apply add-one smoothing, we add 1 to each count, including the total bigram
count.
3. Formula:
o P(Sam | am) = (count(am, Sam) + 1) / (count(am) + V)
 where V is the vocabulary size (number of unique words, including <s>
and </s>).
4. Calculation:
o Count of (am, Sam) = 3.
o Count of "am" = 4 (appears 4 times in the corpus).
o Vocabulary size (V) = 6 (unique words: "I", "am", "Sam", "do", "not", "like").

P(Sam | am) = (3 + 1) / (4 + 6) = 4 / 10 = 0.4.

71. Comment on the validity of the following statements:

a) Rule-based taggers are non-deterministic


b) Stochastic taggers are language independent
c) Brill’s tagger is a rule-based tagger (10 Marks)
Answer:
a) Rule-based taggers are non-deterministic:

 This is incorrect. Rule-based taggers are typically deterministic, as they use predefined
linguistic rules to tag words.

b) Stochastic taggers are language independent:

 This is incorrect. Stochastic taggers, such as Hidden Markov Models (HMMs), depend
on probabilistic models that are trained on specific languages and thus are not language-
independent.

c) Brill’s tagger is a rule-based tagger:

 This is correct. Brill’s tagger is a rule-based tagger that applies transformation rules to
the output of an initial tagger to improve accuracy.

Module 3 
1. In the context of natural language processing, how can we leverage the concepts of TF-
IDF, training set, validation set, test set, and stop words to improve the accuracy and
effectiveness of machine learning models and algorithms? Additionally, what are some
potential challenges and considerations when working with these concepts, and how can we
address them? (5 Marks)
 TF-IDF (Term Frequency-Inverse Document Frequency): It is a statistical measure
used to evaluate the importance of a word in a document relative to a corpus. By using
TF-IDF, we can reduce the weight of commonly occurring words (like stop words) and
increase the weight of rare but meaningful words. This improves the feature
representation for models, enhancing the accuracy.
 Training Set: This is used to train the machine learning model. A good training set
ensures that the model learns the underlying patterns.
 Validation Set: This set helps in tuning hyperparameters, avoiding overfitting, and
evaluating the model’s performance during training. It's crucial for model selection.
 Test Set: This set is used after training and validation to evaluate the final model
performance.
 Stop Words: Common words like "the", "and", "is", etc., can often be removed to avoid
unnecessary noise and reduce dimensionality in NLP tasks.

Challenges and Considerations:

 Overfitting: Model may perform well on the training set but fail on unseen data. This can
be avoided using cross-validation and regularization.
 Imbalanced Data: If the training set contains more examples from one class than the
other, the model can be biased. This can be addressed using resampling techniques or
class weights.

2. Define text classification. (2 Marks)

 Text Classification: It is the task of categorizing text into predefined labels or categories.
For example, classifying emails as spam or non-spam.

3. Describe the ways of Information Extraction from unstructured text. (5 Marks)

 Named Entity Recognition (NER): Identifying entities like names, dates, and
organizations.
 Part-of-Speech Tagging (POS): Assigning parts of speech to each word.
 Chunking: Grouping words into meaningful chunks like noun phrases.
 Relation Extraction: Identifying relationships between entities in a text.
 Text Segmentation: Dividing text into smaller, manageable pieces.

4. Explain ad-hoc retrieval problems. (2 Marks)

 Ad-hoc Retrieval Problems: These are search problems where the query is not pre-
defined, and the goal is to retrieve relevant documents or information based on an
individual query. It requires efficient retrieval mechanisms to match documents to a
query.

5. What aspects of ad-hoc retrieval problems are addressed by Information Retrieval


research? (2 Marks)
 Information Retrieval Research: It focuses on improving the effectiveness of search
engines by addressing issues like query formulation, ranking algorithms, and retrieval
precision/recall trade-offs.

6. What are the contents of an Information Retrieval model? (2 Marks)

 Contents: It consists of a query processor, an index (like inverted index), a retrieval


algorithm, and ranking functions.

8. Describe how hand-coded rules help in performing text classification. (5 Marks)

 Hand-Coded Rules: These are manually created rules that specify patterns or conditions
for classifying text. They typically involve looking for certain keywords, phrases, or
structures in the text and assigning it to a specific class based on these predefined
patterns.
 Example: In spam detection, a rule might look for words like "free", "buy now", or "win"
to classify an email as spam. Hand-coded rules can be effective when the data is simple
or well-defined.
 Challenges: Hand-coded rules are labor-intensive to create and maintain, and they may
not generalize well to unseen data. They are also limited in handling complex language
patterns.

9. What are the machine learning approaches used for text classification? (5 Marks)

 Supervised Learning: This is the most common approach, where a labeled training
dataset is used to train a model. Examples include:
o Naive Bayes Classifier: A probabilistic classifier based on Bayes' theorem, often
used for text classification.
o Support Vector Machines (SVM): SVMs use hyperplanes to classify texts into
different categories.
o Decision Trees: These models use tree-like structures to make decisions based on
feature values.
o Deep Learning: Neural networks, especially recurrent neural networks (RNNs)
or convolutional neural networks (CNNs), have been highly effective for text
classification tasks.

10. What is/are the drawback/s of the Naive Bayes classifier? (5 Marks)

 Assumption of Independence: Naive Bayes assumes that the features (words) are
independent of each other, which often does not hold in real-world text data.
 Limited to Simple Models: It does not capture complex relationships between words or
higher-order dependencies in the text.
 Sensitivity to Imbalanced Data: Naive Bayes can be biased towards the majority class
in imbalanced datasets, leading to poor performance on the minority class.

11. Explain the result of Multinomial Naïve Bayes Independence Assumptions. (5 Marks)
 Multinomial Naïve Bayes Assumptions: This variation of Naive Bayes assumes that the
features (words) are generated from a multinomial distribution and that each word in a
document is generated independently.
 Result: The model calculates the probability of a document belonging to a class based on
the frequency of words in the document. This assumption simplifies the computation, but
it can lead to poor performance when word dependencies exist in the text.

12. Write two NLP applications where we can use the bag-of-words technique. (5 Marks)

 Spam Email Classification: Bag-of-Words can be used to represent emails as vectors of


word frequencies and classify them as spam or not spam.
 Sentiment Analysis: In sentiment analysis, Bag-of-Words can represent text reviews by
counting the occurrences of positive or negative words, helping classify the sentiment of
the text.

13. What is the problem with the maximum likelihood for the Multinomial Naive Bayes
classifier? How to resolve? (10 Marks)

 Problem with Maximum Likelihood: In the Multinomial Naive Bayes classifier, the
maximum likelihood estimation (MLE) of probabilities can assign zero probability to
words that do not appear in the training set. This leads to issues when such words appear
in the test set, as it results in an overall zero probability for the document.
 Solution: This can be addressed by applying Laplace Smoothing (or add-one
smoothing), which adds a small constant (usually 1) to all word counts, ensuring that no
word has a zero probability. This adjustment ensures that words not seen in the training
data still have a non-zero probability.

14. Explain the confusion matrix that can be generated in terms of a spam detector. (5
Marks)

 Confusion Matrix: A confusion matrix is a table used to evaluate the performance of a


classification algorithm. It summarizes the results of predictions, including:
o True Positives (TP): Emails correctly classified as spam.
o True Negatives (TN): Emails correctly classified as non-spam.
o False Positives (FP): Non-spam emails incorrectly classified as spam (type I
error).
o False Negatives (FN): Spam emails incorrectly classified as non-spam (type II
error).

This matrix helps in calculating performance metrics like accuracy, precision, recall, and F1-
score.

15. How k-fold cross validation is used for evaluating a text classifier? (5 Marks)

 k-Fold Cross-Validation: In k-fold cross-validation, the dataset is divided into k equally


sized subsets. The model is trained on k-1 subsets and tested on the remaining subset.
This process is repeated k times, with each subset used as the test set once. The final
performance is averaged across all k iterations, providing a more robust estimate of the
model's performance.
 Use in Text Classification: k-fold cross-validation helps in evaluating the classifier's
generalization ability and prevents overfitting by ensuring the model is validated on
different subsets of the data.

16. Explain practical issues of a text classifier and how to solve them. (5 Marks)

 Practical Issues:
o Data Imbalance: If one class is much more frequent than the other, the model
may bias towards the majority class. This can be addressed using techniques like
SMOTE (Synthetic Minority Over-sampling Technique) or class weights.
o Noise and Irrelevant Features: Unimportant or noisy features can degrade
performance. This can be mitigated through feature selection and removing stop
words.
o Overfitting: If a model performs well on the training data but poorly on the test
set, it may be overfitting. This can be reduced by using regularization techniques
or increasing training data.

17. What are the types of Text classification techniques? (5 Marks)

 Rule-Based Methods: Involves manually coded rules to classify text. These methods are
easy to interpret but can be labor-intensive.
 Statistical Methods: These include machine learning algorithms like Naive Bayes, SVM,
and decision trees.
 Deep Learning Methods: These use neural networks like CNNs, RNNs, or transformers
for text classification tasks. They are effective for handling large datasets and complex
language features.

18. Give any 3 different evaluation metrics available for text classification. Explain with
examples. (10 Marks)

 Accuracy: The proportion of correct predictions over total predictions. Example: If 90


out of 100 emails are correctly classified, accuracy = 90%.
 Precision: The proportion of true positives out of all predicted positives. Example: If 8
out of 10 emails predicted as spam are actually spam, precision = 80%.
 Recall: The proportion of true positives out of all actual positives. Example: If 8 out of
10 spam emails are detected, recall = 80%.

These metrics help assess different aspects of the classifier’s performance.

19. What are the evaluation measures to be undertaken to judge the performance of a
matrix? (2 Marks)

 Evaluation Measures:
o Accuracy: Correct predictions over total predictions.
o Precision, Recall, and F1-Score: Evaluates the classifier’s performance in
identifying each class, especially in imbalanced datasets.

20. With a schematic diagram explain Word2vec type of word embedding. (5 Marks)

 Word2Vec: Word2Vec is a technique that learns word representations by predicting


surrounding words in a context window (Skip-Gram model) or predicting a word given
its context (Continuous Bag of Words - CBOW). Words that appear in similar contexts
will have similar vector representations.
o Schematic Diagram: A simple diagram would show input words fed into the
neural network, which predicts context words (or vice versa), generating word
embeddings.

21. Explain the working of Doc2Vec type of word embedding with labelled diagram. (5
Marks)

 Doc2Vec: Doc2Vec is an extension of Word2Vec, designed to generate vector


representations not just for words but for entire documents or sentences. It works by
appending a unique "document ID" (or context) vector to each word in the document
during training, which helps capture the semantic meaning of the whole document.
 Working:
o Similar to Word2Vec, the model tries to predict the surrounding words, but it also
learns a vector for the document.
o The model updates the document vector to represent the entire context of the
document.
o The resulting vector is a dense representation of the document, capturing its
semantic meaning.
 Schematic Diagram:
o A diagram showing a document input, which consists of several words and a
document vector, with the neural network learning the relationships between
words and the document.

22. With example explain the following word to sequence analysis:- (5 Marks)

a) Vector Semantic:

 Explanation: Vector semantics involves representing words as vectors in a continuous


vector space. In this space, semantically similar words are placed closer together. This
allows for capturing relationships between words (e.g., king - man + woman = queen).
 Example: Word2Vec uses vector semantics to represent words in a high-dimensional
space where words with similar meanings are close to each other. For instance, "dog" and
"puppy" would have similar vectors.

b) Probabilistic Language Model:


 Explanation: Probabilistic language models estimate the probability of a sequence of
words occurring together. These models capture the likelihood of word sequences,
helping in tasks like text generation or speech recognition.
 Example: A probabilistic language model might calculate the probability of the phrase "I
am going to the store" based on observed frequencies of word sequences in a training
corpus.

23. Define opinion mining. (2 Marks)

 Opinion Mining: Opinion mining, also known as sentiment analysis, is the process of
determining the sentiment expressed in a piece of text (positive, negative, or neutral). It
typically involves analyzing social media posts, product reviews, or customer feedback to
understand people's attitudes toward a subject.
 Example: A review that states, "This phone is amazing!" would be classified as positive
sentiment, while "This phone is terrible!" would be negative sentiment.

24. What are the aspects taken into account while collecting feedback of brands for
sentiment analysis? (5 Marks)

 Aspects Considered:
o Text Source: Feedback sources such as social media, product reviews, surveys, or
customer service interactions are crucial for analyzing sentiment.
o Tone: The tone of the feedback (positive, negative, neutral) helps in
understanding the overall sentiment about a brand.
o Keywords/Phrases: Keywords related to product features, customer service, or
brand perception are important in sentiment analysis.
o Context: The context in which feedback is provided (e.g., after a product launch
or service experience) can influence sentiment interpretation.
o Entity Recognition: Identifying the specific product or service mentioned in the
feedback, such as brand name, product types, or features.

25. What is intent analysis? (2 Marks)

 Intent Analysis: Intent analysis is the process of determining the purpose or goal behind
a piece of text. It helps in understanding why a user is interacting with a system or a
brand.
 Example: In customer support, intent analysis can determine whether the customer is
seeking help, making a purchase, or giving feedback.

26. Explain emotion analysis. (2 Marks)

 Emotion Analysis: Emotion analysis involves detecting the emotional tone behind words
in a text to understand the feelings expressed by the writer, such as happiness, sadness,
anger, or surprise.
 Example: A review like "I am so happy with this product!" would be identified as
expressing joy.
27. How does emotional analytics work? (5 Marks)

 Emotional Analytics works by:


1. Data Collection: Collecting text data from various sources like social media,
reviews, or conversations.
2. Emotion Classification: Using models trained to recognize emotions such as joy,
anger, or sadness in the text.
3. Natural Language Processing: Applying NLP techniques (tokenization,
sentiment analysis) to process and analyze the text data.
4. Emotion Detection: The model assigns an emotion label to the text based on its
content.
5. Results Interpretation: Analyzing trends in emotions for applications like
market sentiment or customer feedback.

28. Naïve Bayes classifier is not so naïve – explain. (5 Marks)

 Explanation: The term "naive" in Naïve Bayes classifier refers to the assumption of
feature independence, which is often unrealistic in real-world data. However, despite this
assumption, the classifier performs well in many practical applications, such as spam
detection and sentiment analysis, because it simplifies the problem significantly.
 Reason for success: Even with the naive assumption, the algorithm often yields
surprisingly good results by using probabilistic reasoning, especially when combined
with techniques like Laplace smoothing.

29. With detailed steps explain the working of Multinomial Naive Bayes learning. (5
Marks)

1. Data Preprocessing:
o Convert all text data into numerical format using techniques like TF-IDF or bag-
of-words.
2. Calculate Prior Probabilities:
o Calculate the prior probability of each class (e.g., spam or not spam) based on the
frequency of each class in the training data.
3. Calculate Conditional Probabilities:
o For each class, calculate the likelihood of each word appearing in the class using
the training data.
4. Bayes’ Theorem Application:
o Use Bayes' Theorem to combine prior and conditional probabilities to make
predictions for new data.
5. Classification:
o Choose the class with the highest posterior probability for classification.

30. What is micro averaging and macro averaging? Explain with an example. (10 Marks)

 Micro Averaging:
o In micro-averaging, the individual class predictions are aggregated first (i.e., the
true positives, false positives, and false negatives across all classes), and then the
metrics (precision, recall, F1 score) are computed based on the aggregated values.
 Macro Averaging:
o In macro-averaging, the metrics for each class are calculated independently, and
then the average of the metrics for each class is taken.
 Example:
o If we have two classes, A and B, and for each class, we calculate precision and
recall:
 Micro Average: We sum up all true positives, false positives, and false
negatives across both classes, and then compute precision and recall.
 Macro Average: We calculate precision and recall for each class
separately, and then compute the average of the precision and recall
values.

31. State 3 opinion mining techniques with proper explanation. (10 Marks)

1. Lexicon-based Approach:
o Uses predefined lists of positive and negative words to determine sentiment. Each
word is assigned a sentiment score, and the overall sentiment of a text is
calculated based on the words present in the text.
2. Machine Learning-based Approach:
o Utilizes classification algorithms like Naive Bayes, SVM, or deep learning
models to classify text into positive, negative, or neutral sentiments. The model is
trained on labeled datasets to learn sentiment patterns.
3. Hybrid Approach:
o Combines both lexicon-based and machine learning-based methods. It uses
lexicon for sentiment scoring and machine learning for classification, improving
accuracy by leveraging both techniques.

32. What issue crops up for Information Retrieval based on keyword search in case of a
huge size document? (5 Marks)

 Issue:
o As the size of the document increases, the retrieval process becomes slower and
more resource-intensive.
o Keyword search becomes inefficient due to the sheer volume of data, leading to
longer response times and increased computational costs.
o Handling complex queries and providing relevant results in large documents can
be challenging, especially when the terms are ambiguous or have multiple
meanings.
 Solution:
o Indexing: Using efficient indexing techniques like inverted indexing to quickly
locate keywords and their occurrences in the document.
o Document Partitioning: Breaking down large documents into smaller,
manageable chunks to improve retrieval performance.
o Vector-based Search: Employing vector space models, such as TF-IDF, for
better ranking and relevance in large documents.

33. What are the initial stages of text processing? (10 Marks)

1. Text Collection:
o Gather raw text data from various sources like websites, books, articles, or social
media platforms.
2. Text Normalization:
o Standardizing the text by converting it to lowercase, removing punctuation,
special characters, and unnecessary white spaces.
3. Tokenization:
o Breaking the text into smaller units such as words, sentences, or subword tokens.
4. Stop Word Removal:
o Removing common words such as "the", "is", "in", which do not carry significant
meaning in the context of the analysis.
5. Stemming and Lemmatization:
o Reducing words to their base or root form to consolidate similar terms. For
example, "running" becomes "run".
6. POS Tagging:
o Assigning part-of-speech tags to words (e.g., noun, verb, adjective) to understand
their role in the sentence.
7. Named Entity Recognition (NER):
o Identifying and classifying named entities (e.g., persons, organizations, locations)
in the text.
8. Vectorization:
o Converting the processed text into a numerical form, such as using bag-of-words
or TF-IDF techniques, for further analysis or modeling.

34. What is the goal of an IR system? (10 Marks)

 Goal of Information Retrieval (IR) System:


o The primary goal of an IR system is to efficiently and accurately retrieve relevant
information from a large collection of documents based on user queries.
o The system aims to:
 Relevance: Ensure the results are closely related to the user’s query.
 Efficiency: Return results in a quick and resource-efficient manner.
 Ranking: Rank the retrieved documents based on relevance, helping the
user find the most pertinent information easily.
 Scalability: Handle large datasets and adapt to increasing volumes of
documents.
 User Satisfaction: Provide the most useful, accurate, and timely
information to the user.

35. What are the different ways to use Bag-of-words representation for text classification?
(10 Marks)
1. Simple Count Vectorization:
o Count the occurrences of each word in the text and use these counts as feature
vectors for classification.
2. TF-IDF (Term Frequency-Inverse Document Frequency):
o Calculate the term frequency (TF) and multiply it by the inverse document
frequency (IDF) to give weight to less common, more informative words.
3. N-Grams Representation:
o Extend the bag-of-words model by using n-grams (unigrams, bigrams, trigrams)
to capture local word order and contextual information.
4. Sparse Matrix Representation:
o Represent the text data as a sparse matrix, where most of the elements are zero, to
save memory and computational resources.
5. Dimensionality Reduction:
o Apply techniques like Principal Component Analysis (PCA) or Latent Semantic
Analysis (LSA) to reduce the dimensionality of the feature space and focus on the
most important features.

36. State the difference between sentiment analysis, intent analysis, and emotion analysis.
(10 Marks)

 Sentiment Analysis:
o Focuses on determining the overall sentiment or opinion expressed in the text
(e.g., positive, negative, neutral).
o Example: Analyzing customer reviews to determine whether the feedback is
positive or negative about a product.
 Intent Analysis:
o Aims to understand the underlying purpose or goal behind the text (e.g., asking
for help, making a purchase, giving feedback).
o Example: Identifying whether a user's query to a chatbot is about troubleshooting
or product inquiry.
 Emotion Analysis:
o Detects and classifies the emotional tone in the text (e.g., joy, anger, sadness,
surprise).
o Example: Analyzing social media posts to understand the public's emotional
reaction to an event or news.

37. How is sentiment analysis used by different brands to assess the status of the market
after launching a product? (10 Marks)

 Sentiment Analysis for Market Assessment:


o Monitoring Social Media: Brands track sentiments expressed on social media
platforms, blogs, and forums to gauge the public's perception of a new product.
o Customer Reviews: Sentiment analysis is applied to customer reviews on e-
commerce sites to understand satisfaction levels and identify areas for
improvement.
o Feedback Loops: Brands use sentiment analysis to create real-time feedback
loops, allowing them to react quickly to customer concerns or positive feedback.
o Market Sentiment Trends: Analyzing sentiment over time to identify shifts in
customer opinions and align marketing or product strategies accordingly.
o Competitor Analysis: Sentiment analysis helps compare consumer reactions to
the brand's product versus competitors’ products, providing insights into
competitive advantages.

38. Mention a few practical applications of emotion analysis by emotion recognition. (10
Marks)

1. Customer Experience:
o Emotion analysis can be applied in customer support to analyze emotions in
interactions and improve service quality based on emotional cues.
2. Market Research:
o Brands use emotion analysis to understand consumer emotions towards their
products or advertisements, influencing marketing strategies.
3. Healthcare:
o Emotion recognition can help in mental health diagnosis and therapy,
understanding emotional states of patients and offering appropriate interventions.
4. Human-Computer Interaction:
o Emotion analysis enhances user experience by making systems more responsive
to the user’s emotional state, such as in virtual assistants or gaming.
5. Education:
o Emotion recognition can help educators understand students' emotional
engagement and adapt teaching strategies accordingly.

39. Step by step explain how Naive Bayes classifier can be used for text classification. (10
Marks)

1. Preprocessing:
o Collect the dataset, clean the text data, and convert it into a numerical form using
techniques like bag-of-words or TF-IDF.
2. Feature Extraction:
o Extract features (e.g., word frequencies) from the text for use in the classifier.
3. Model Training:
o For each class (e.g., spam, not spam), calculate the prior probability and
likelihood of each word in the class using the training data.
4. Apply Bayes' Theorem:
o Use Bayes' theorem to compute the posterior probability for each class given the
word frequencies in the text.
5. Classification:
o Choose the class with the highest posterior probability as the predicted class for
the new text.
6. Model Evaluation:
o Evaluate the model’s performance using metrics like accuracy, precision, recall,
and F1 score on the test data.

40. What are the 4 steps of text normalization? (5 Marks)

1. Lowercasing:
o Convert all text to lowercase to ensure uniformity and avoid duplicate word
entries (e.g., "Apple" and "apple").
2. Removing Punctuation:
o Remove all punctuation marks (e.g., periods, commas, exclamation marks) as
they don't contribute to the text analysis.
3. Removing Stop Words:
o Remove common words like "and", "the", "is" that don't carry significant meaning
in the analysis.
4. Stemming/Lemmatization:
o Reduce words to their root form (e.g., "running" becomes "run") to consolidate
similar terms.

41. Highlight practical applications of text classification concept. (10 Marks)

 Email Filtering:
o Classifying emails as spam or non-spam using text classification algorithms. This
helps in organizing incoming emails and preventing spam from cluttering the
inbox.
 Sentiment Analysis:
o Text classification can be used to analyze customer feedback, product reviews, or
social media posts to determine whether the sentiment is positive, negative, or
neutral.
 Document Categorization:
o Categorizing large collections of documents into predefined categories (e.g., news
articles classified into topics like sports, politics, technology) for easier retrieval.
 Content Recommendation Systems:
o Based on user preferences or reading history, text classification can recommend
similar articles, books, or movies by categorizing content based on the user's
interest.
 Chatbot Responses:
o Text classification helps chatbots understand user queries and respond
accordingly by classifying user input into predefined categories of intents (e.g.,
booking a ticket, checking weather).

42. What is Named Entity Recognition (NER)? (2 Marks)

 Named Entity Recognition (NER) is a subtask of information extraction that seeks to


locate and classify named entities in text into predefined categories, such as names of
persons, organizations, locations, dates, and more.
43. How is Named Entity Recognition useful in NLP applications? (5 Marks)

 Search Engines:
o NER helps in improving search engine results by identifying and indexing named
entities, enabling better content retrieval related to specific entities like people or
places.
 Question Answering Systems:
o In question-answering applications, NER aids in extracting relevant named
entities from a user's question to provide accurate answers.
 Information Extraction:
o NER assists in automatically extracting important information, such as people,
places, or events, from large unstructured datasets or documents.
 Sentiment Analysis:
o By recognizing named entities, sentiment analysis can be more accurate in
determining the sentiment expressed towards specific individuals or
organizations.

44. How k-fold cross validation is used for evaluating a text classifier? (10 Marks)

1. Data Splitting:
o Divide the entire dataset into k equal-sized "folds". Typically, k is chosen as 5 or
10, but it can be adjusted based on dataset size.
2. Training and Testing:
o For each fold, use k-1 folds for training the model and the remaining fold for
testing the model.
3. Model Evaluation:
o Repeat this process for each fold and calculate the evaluation metric (e.g.,
accuracy, precision, recall) for each iteration.
4. Result Averaging:
o Average the evaluation scores obtained from all k iterations to get a more robust
estimate of the model's performance.
5. Model Selection:
o The final averaged score can be used to select the best model configuration or
tune hyperparameters.

45. Explain the fundamental concepts of Natural Language Processing (NLP) and discuss
its significance in today's digital era, providing examples of real-world applications and
potential future advancements. (5 Marks)

 Fundamental Concepts:
o Text Preprocessing: Cleaning and preparing text data by tokenizing, stemming,
and removing stop words.
o Text Representation: Converting text into numerical representations like bag-of-
words, TF-IDF, or word embeddings (e.g., Word2Vec).
o Language Models: Building models that understand and generate human
language, such as n-grams or deep learning-based models like GPT.
 Significance:
o NLP enables machines to interact with humans in natural language, which is
crucial for applications such as virtual assistants (e.g., Siri, Alexa), translation
services (e.g., Google Translate), and content recommendations.
o Real-World Applications:
 Sentiment analysis in social media monitoring.
 Chatbots for customer service in e-commerce.
 Text summarization for news articles.
o Future Advancements:
 More sophisticated conversational AI with deeper understanding and
emotional intelligence.
 Real-time multilingual translation and more accurate voice recognition.

46. What is Ambiguity? Explain different types of ambiguity in NLP. (5 Marks)

 Ambiguity in NLP refers to the phenomenon where a single word or sentence can have
multiple meanings depending on the context in which it is used.
 Types of Ambiguity:
o Lexical Ambiguity:
 Occurs when a word has multiple meanings. For example, "bank" can
refer to a financial institution or the side of a river.
o Syntactic Ambiguity:
 Arises when the structure of a sentence allows for multiple interpretations.
For example, "I saw the man with the telescope" could mean either "I saw
a man who had a telescope" or "I used a telescope to see the man."
o Semantic Ambiguity:
 Occurs when a sentence can have different meanings due to the
interpretation of words. For example, "The chicken is ready to eat" could
mean the chicken is cooked and ready to be eaten, or the chicken is ready
to eat something else.
o Pragmatic Ambiguity:
 Refers to the ambiguity arising from the context or the speaker’s intent.
For example, "Can you pass the salt?" may seem like a question but is
often interpreted as a request.

47. What are the benefits of a text classification system? Give an example. (5 Marks)

 Benefits:
o Automation: Automates the categorization of large text datasets, saving time and
effort compared to manual classification.
o Accuracy: A well-trained classifier can achieve high accuracy in identifying and
categorizing text, reducing human error.
o Scalability: It can scale to handle large amounts of data efficiently.
 Example:
o A spam filter in an email system uses text classification to automatically identify
and move spam emails to the spam folder, keeping the inbox clean and organized.
48. Explain the Building Blocks of Semantic System? (5 Marks)

1. Lexical Semantics:
o Understanding the meaning of individual words and how they combine to form
phrases and sentences. It involves analyzing word meanings, synonyms, and
antonyms.
2. Compositional Semantics:
o The process of deriving the meaning of a sentence or text based on the meanings
of its individual components (words and phrases).
3. Pragmatics:
o Understanding meaning in context, such as how the meaning of a sentence may
change depending on the situation or speaker's intent.
4. World Knowledge:
o Incorporating external knowledge or common sense to enhance the understanding
of text, which is essential for disambiguating meanings.

49. What is NLTK? How is it different from Spacy? (5 Marks)

 NLTK (Natural Language Toolkit):


o A comprehensive library for working with human language data (text) in Python.
It provides tools for text processing, classification, tokenization, stemming,
tagging, parsing, and more.
 Spacy:
o A modern NLP library designed for fast and efficient text processing. It is
optimized for production usage and offers state-of-the-art models for tasks like
named entity recognition, dependency parsing, and part-of-speech tagging.
 Key Differences:
o Performance: Spacy is faster and more efficient, while NLTK is more focused on
educational and research purposes.
o Ease of Use: Spacy provides a more user-friendly API, while NLTK offers more
flexibility and a wider range of NLP functionalities.
o Use Case: NLTK is best suited for experimentation and learning, while Spacy is
ideal for building real-world applications.

50. Explain Dependency Parsing in NLP? (10 Marks)

 Definition:
o Dependency parsing is the process of analyzing the grammatical structure of a
sentence by establishing relationships between "head" words and their
dependents. It helps in understanding the syntactic structure and relationships
within the sentence.
 How it works:
o Dependency Tree: A dependency tree is constructed where the root of the tree
represents the main verb or predicate of the sentence, and all other words are
connected to the root based on syntactic dependencies.
o Dependencies: The words in the sentence are linked to their syntactic head (e.g.,
the subject of the verb, the object of the verb) with directed edges. Each word in
the sentence has a specific dependency relation with its head.
o Example:
 Sentence: "The cat sat on the mat."
 "sat" is the root of the sentence. "The" is a dependent of "cat", and "on" is
the head of the prepositional phrase "on the mat."
 Applications:
o Information Extraction: Dependency parsing helps in identifying relations
between entities, such as "John (subject) gave (verb) a book (object) to Mary
(indirect object)".
o Machine Translation: Understanding the grammatical structure of a source
sentence and translating it into another language while maintaining meaning.
o Question Answering: Helps in identifying the relevant components of a question
and matching them to the answer correctly.

51. What are the steps involved in pre-processing data for NLP? (5 Marks)

1. Text Cleaning:
o Remove unnecessary elements such as HTML tags, special characters, and
irrelevant text like advertisements.
2. Tokenization:
o Break the text into smaller units like words or sentences. For example, "I love
NLP" is tokenized into ["I", "love", "NLP"].
3. Lowercasing:
o Convert all text to lowercase to ensure that the model treats words like "Cat" and
"cat" as the same.
4. Removing Stopwords:
o Remove common words (e.g., "the", "is", "and") that do not contribute much
meaning to the text, which helps in reducing noise.
5. Stemming or Lemmatization:
o Reduce words to their root form (e.g., "running" becomes "run") to treat related
words as the same.
6. Vectorization:
o Convert the processed text into numerical representations (e.g., TF-IDF, bag-of-
words, or word embeddings) for further machine learning processing.

52. What are some common applications of chatbots in various industries? (10 Marks)

 Customer Service:
o Chatbots are used by companies like Amazon and Zappos to handle customer
queries, resolve issues, and provide product recommendations without human
intervention.
 Healthcare:
o Chatbots like Babylon Health offer medical consultations by analyzing symptoms
and providing possible diagnoses, helping patients get quick medical advice.
 E-commerce:
o Many e-commerce websites use chatbots to assist customers in browsing
products, answering product-related questions, and helping with purchases.
 Banking:
o Chatbots help customers with balance inquiries, recent transactions, bill
payments, and loan application processes, making banking services more
accessible and efficient.
 Education:
o Chatbots are used in online courses and platforms to assist students by answering
common questions, providing study materials, and helping with exam preparation.
 Travel and Hospitality:
o Travel agencies use chatbots to assist customers with booking flights, hotels, and
even offering personalized travel suggestions based on preferences.

53. Compute the minimum edit distance in transforming the word DOG to COW using
Levenshtein distance, i.e., insertion = deletion = 1 and substitution = 2. (10 Marks)

 Levenshtein Distance Algorithm:


o We need to calculate the minimum number of insertions, deletions, or
substitutions required to transform one string into another. The insertion and
deletion cost is 1, and substitution cost is 2.

 The minimum edit distance is 3.

Steps:

1. Substitute 'D' with 'C' (cost 2).


2. Substitute 'G' with 'W' (cost 2).

The total minimum edit distance is 3.

54. What are word embeddings in NLP and how can they be used in various NLP
applications? (10 Marks)

 Word Embeddings:
o Word embeddings are a type of word representation that allows words to be
represented as vectors in a continuous vector space. These vectors capture the
semantic meaning of the words, with similar words having similar vector
representations.
 How they work:
o Embeddings are typically learned using unsupervised learning techniques such as
Word2Vec, GloVe, or fastText. These models learn the relationship between
words based on their context within large corpora.
 Applications:
o Semantic Similarity: Word embeddings enable the calculation of word similarity
by comparing the cosine similarity between word vectors. For instance, "king"
and "queen" will be closer in the vector space than "king" and "car".
o Machine Translation: Embeddings improve translation by capturing the meaning
of words in different languages, helping to map semantically similar words across
languages.
o Text Classification: Word embeddings improve text classification tasks by
capturing the semantic meaning of words, enhancing the model's ability to
classify text into categories like sentiment or topic.
o Named Entity Recognition (NER): Word embeddings help recognize entities
(e.g., person names, locations) in text by understanding their context and
relationships with other words.
o Chatbots and Question Answering: Embeddings improve the chatbot’s
understanding of user input, helping the system generate more accurate and
relevant responses

55. Do you believe there are any distinctions between prediction and classification?
Illustrate with an example. (5 Marks)

 Prediction:
o Prediction is the process of estimating a continuous output variable based on input
data. The output can take any real value within a range.
o Example: Predicting the temperature tomorrow based on historical weather data.
The output could be a temperature value, say 30°C.
 Classification:
o Classification, on the other hand, involves categorizing the data into discrete
classes or labels. The output variable is a class or category rather than a
continuous value.
o Example: Classifying whether an email is spam or not based on certain features.
The output will be either "spam" or "not spam" (binary classification).

In summary, prediction focuses on continuous outputs, while classification focuses on discrete


outputs.
56. How do lexical resources like WordNet contribute to lexical semantics in NLP? How
does lexical ambiguity impact NLP tasks such as machine translation or sentiment
analysis? (5 Marks)

 WordNet and Lexical Semantics:


o WordNet is a lexical database that groups words into sets of synonyms called
synsets. It defines relationships between these synsets, such as hypernyms (more
general terms) and hyponyms (more specific terms).
o This helps in lexical semantics, which involves understanding the meanings and
relationships of words in a language. WordNet provides valuable semantic
information that can enhance tasks like word sense disambiguation, synonym
detection, and understanding context in NLP applications.
 Lexical Ambiguity:
o Lexical ambiguity occurs when a word has multiple meanings. For instance,
"bank" can refer to a financial institution or the side of a river.
o This ambiguity can impact tasks like machine translation (MT), where the
incorrect sense of a word could lead to incorrect translations. In sentiment
analysis, ambiguity in words like "great" (which could mean "good" or
"sarcastic") may lead to misclassification of sentiment.
o Solution: Techniques like word sense disambiguation (WSD) and context-based
analysis help resolve lexical ambiguity.

57. Analyze the purpose of topic modeling in text analysis. (5 Marks)

 Purpose of Topic Modeling:


o Topic modeling is a technique used to identify abstract topics or themes in a
collection of text documents. It helps in understanding large sets of unstructured
data by grouping words that frequently occur together into topics.
o Applications:
 Document Clustering: Automatically grouping similar documents, useful
for organizing large collections of articles, research papers, or news
stories.
 Content Summarization: Helping summarize documents by identifying
the main topics and presenting them to the user.
 Recommendation Systems: By understanding the topics within
documents, topic modeling can be used to recommend related documents
based on shared themes.
o Example: Latent Dirichlet Allocation (LDA) is a popular algorithm for topic
modeling that can help identify themes like "sports," "politics," and "technology"
in news articles.
58. Given the following dataset, classify whether a new email is spam or not using Naïve
Bayes. (10 Marks)

Given email: "Offer = Yes, Win = Yes, Money = Yes" (to classify)

Step-by-Step Naive Bayes Calculation:

1. Calculate Prior Probabilities:


o P(Spam = 1) = 2/5
o P(Spam = 0) = 3/5
2. Calculate Likelihoods (conditional probabilities):
o P(Offer = Yes | Spam = 1) = 2/2 (since both spam emails contain "Offer")
o P(Win = Yes | Spam = 1) = 1/2 (only one spam email contains "Win")
o P(Money = Yes | Spam = 1) = 1/2 (only one spam email contains "Money")
o P(Offer = Yes | Spam = 0) = 1/3
o P(Win = Yes | Spam = 0) = 2/3
o P(Money = Yes | Spam = 0) = 1/3
3. Apply Naive Bayes Formula:
o P(Spam = 1 | Offer = Yes, Win = Yes, Money = Yes) ∝ P(Offer = Yes | Spam =
1) * P(Win = Yes | Spam = 1) * P(Money = Yes | Spam = 1) * P(Spam = 1)
o P(Spam = 1 | Offer = Yes, Win = Yes, Money = Yes) ∝ (1) * (1/2) * (1/2) * (2/5)
= 0.1
o P(Spam = 0 | Offer = Yes, Win = Yes, Money = Yes) ∝ P(Offer = Yes | Spam =
0) * P(Win = Yes | Spam = 0) * P(Money = Yes | Spam = 0) * P(Spam = 0)
o P(Spam = 0 | Offer = Yes, Win = Yes, Money = Yes) ∝ (1/3) * (2/3) * (1/3) *
(3/5) = 0.04
4. Conclusion:
o Since P(Spam = 1 | Offer = Yes, Win = Yes, Money = Yes) > P(Spam = 0), the
email is classified as Spam.
59. A company wants to classify customer feedback as "Positive" or "Negative" based on
word occurrences. (10 Marks)

Step-by-Step Naive Bayes Calculation:

1. Prior Probabilities:
o P(Pos) = 2/4
o P(Neg) = 2/4
2. Likelihoods:
o P(Good = Yes | Pos) = 1/2
o P(Fast = Yes | Pos) = 1/2
o P(Cheap = No | Pos) = 1/2
o P(Good = Yes | Neg) = 1/2
o P(Fast = Yes | Neg) = 0/2 = 0
o P(Cheap = No | Neg) = 1/2
3. Naive Bayes Calculation:
o P(Pos | "Good" = Yes, "Fast" = Yes, "Cheap" = No) ∝ P(Good = Yes | Pos) *
P(Fast = Yes | Pos) * P(Cheap = No | Pos) * P(Pos)
o P(Pos | "Good" = Yes, "Fast" = Yes, "Cheap" = No) ∝ (1/2) * (1/2) * (1/2) * (2/4)
= 0.125
o P(Neg | "Good" = Yes, "Fast" = Yes, "Cheap" = No) ∝ P(Good = Yes | Neg) *
P(Fast = Yes | Neg) * P(Cheap = No | Neg) * P(Neg)
o P(Neg | "Good" = Yes, "Fast" = Yes, "Cheap" = No) ∝ (1/2) * (0) * (1/2) * (2/4)
=0
4. Conclusion:
o Since P(Pos) > P(Neg), the feedback is classified as Positive.
60. A weather dataset is given for predicting whether a person will play tennis. (10 Marks)

Given weather conditions:

 Outlook = Rain, Temperature = Mild, Humidity = High, Wind = Strong.

Step-by-Step Naive Bayes Calculation:

1. Prior Probabilities:
o P(Play Tennis = Yes) = 3/6
o P(Play Tennis = No) = 3/6
2. Likelihoods:
o P(Outlook = Rain | Yes) = 2/3
o P(Temperature = Mild | Yes) = 1/3
o P(Humidity = High | Yes) = 1/3
o P(Wind = Strong | Yes) = 1/3
o P(Outlook = Rain | No) = 1/3
o P(Temperature = Mild | No) = 1/3
o P(Humidity = High | No) = 2/3
o P(Wind = Strong | No) = 1/3
3. Naive Bayes Calculation:
o P(Play Tennis = Yes | Outlook = Rain, Temperature = Mild, Humidity = High,
Wind = Strong) ∝ P(Outlook = Rain | Yes) * P(Temperature = Mild | Yes) *
P(Humidity = High | Yes) * P(Wind = Strong | Yes) * P(Yes)
o P(Play Tennis = Yes | Outlook = Rain, Temperature = Mild, Humidity = High,
Wind = Strong) ∝ (2/3) * (1/3) * (1/3) * (1/3) * (3/6) = 0.027
o P(Play Tennis = No | Outlook = Rain, Temperature = Mild, Humidity = High,
Wind = Strong) ∝ P(Outlook = Rain | No) * P(Temperature = Mild | No) *
P(Humidity = High | No) * P(Wind = Strong | No) * P(No)
o P(Play Tennis = No | Outlook = Rain, Temperature = Mild, Humidity = High,
Wind = Strong) ∝ (1/3) * (1/3) * (2/3) * (1/3) * (3/6) = 0.027
4. Conclusion:
o Since P(Play Tennis = Yes) = P(Play Tennis = No), the model cannot confidently
predict. Based on the data, it can go either way.

Module 4 

1. Examine the broad classification of Recommendation systems. (5 Marks)

 Recommendation Systems can be broadly classified into three types:


1. Content-Based Filtering:
 Recommends items based on the features or attributes of the items and the
user’s preferences.
 Example: If a user likes movies of a particular genre (e.g., action), the
system recommends more movies in that genre.
2. Collaborative Filtering:
 Based on the user-item interactions, this method recommends items based
on the preferences of similar users.
 Example: Users who liked the same movies or products as you are used to
recommend new items.
3. Hybrid Systems:
 Combines content-based filtering and collaborative filtering to improve
recommendation accuracy and overcome the limitations of each approach.

2. Explain the working principle of Content Based recommendation system. (5 Marks)

 Content-Based Filtering works by recommending items similar to the ones the user has
interacted with based on the item's features or content:
1. Feature Extraction: The system identifies key features of the items, such as
genre, keywords, or price.
2. User Profile Creation: It builds a profile for each user, capturing their
preferences based on the items they've liked or interacted with.
3. Similarity Calculation: The system computes the similarity between items using
measures like cosine similarity or Euclidean distance.
4. Recommendation Generation: Items with the highest similarity to the user’s
profile are recommended.
o Example: If a user likes action movies, the system will recommend more action
movies using genre as the key feature.

3. Explain the working principle of collaborative filtering system. (5 Marks)

 Collaborative Filtering makes recommendations based on user-item interactions or


preferences:
1. User-Item Interaction Matrix: The system constructs a matrix showing which
items a user has interacted with or rated.
2. Similarity Calculation: It computes similarity between users (user-based) or
items (item-based) using techniques like cosine similarity or Pearson correlation.
3. Prediction: Based on the similarity between users or items, the system predicts
how much a user will like an unseen item.
4. Recommendation Generation: The system recommends items that similar users
have rated highly but that the user has not interacted with yet.
o Example: If User A likes movies X and Y, and User B likes movies Y and Z, the
system will recommend movie Z to User A.

4. Define the evaluation metrics of the recommendation system. (2 Marks)

 Evaluation Metrics for recommendation systems include:


1. Precision: Measures the proportion of recommended items that are relevant to the
user.
 Precision = (Number of relevant items recommended) / (Total items
recommended)
2. Recall: Measures the proportion of relevant items that were actually
recommended.
 Recall = (Number of relevant items recommended) / (Total relevant items)
3. F1-Score: A balance between precision and recall.
 F1 = 2 * (Precision * Recall) / (Precision + Recall)
4. Mean Absolute Error (MAE): Measures the average of absolute differences
between predicted ratings and actual ratings.
5. Root Mean Square Error (RMSE): Measures the square root of the average of
squared differences between predicted and actual ratings.

5. Give the definition of Hybrid recommendation systems. (2 Marks)

 Hybrid Recommendation Systems combine multiple recommendation techniques


(content-based filtering, collaborative filtering, and others) to improve the accuracy and
diversity of recommendations. By merging different models, hybrid systems aim to
leverage the strengths of each approach while minimizing their weaknesses.
o Example: Netflix uses a hybrid system combining both collaborative filtering and
content-based filtering to suggest movies.

6. What are Conversational Agents? (2 Marks)

 Conversational Agents are AI systems designed to simulate human-like conversation


with users through text or voice. They can understand and respond to natural language
queries and are used for various applications, such as virtual assistants and customer
service bots.
o Example: Siri, Alexa, or Google Assistant are conversational agents.

7. What is text Summarization? (2 Marks)

 Text Summarization is the process of reducing a long piece of text to a shorter version,
capturing the key points and important information. It can be done in two ways:
1. Extractive Summarization: Selects important sentences or segments directly
from the original text.
2. Abstractive Summarization: Generates new sentences that capture the meaning
of the original text, often using NLP techniques.

8. Explain Item-Based Collaborative Filtering. (5 Marks)

 Item-Based Collaborative Filtering is a variant of collaborative filtering where the


recommendation is based on the similarity between items rather than users:
1. Create Item-Item Similarity Matrix: Calculate similarity scores between items
based on user interactions (e.g., users who liked item A also liked item B).
2. Prediction: For a given user, the system recommends items similar to the ones
they have already interacted with.
3. Recommendation: Items with high similarity scores to the user’s liked items are
recommended.
o Example: If a user likes "The Dark Knight," the system may recommend similar
movies like "Inception."

9. State some applications of topic modelling. (2 Marks)

 Applications of Topic Modelling:


1. Content Organization: Automatically categorizing large sets of documents into
topics for easier browsing.
2. Content Recommendation: Recommending articles, papers, or books based on
the topics a user has shown interest in.
3. Trend Analysis: Identifying emerging topics and trends by analyzing large-scale
text data.
4. Summarization: Generating summaries based on the dominant topics of a
document collection.

10. What’s the need for text summarization? (2 Marks)

 Need for Text Summarization:


1. Time-Saving: Summarization helps users quickly understand large volumes of
information without reading everything.
2. Improved Information Retrieval: Summarized content can help in better search
results and indexing.
3. Efficient Decision Making: Summarized information aids in quicker decision-
making in various sectors like business, research, and news.
4. Content Extraction: Extracts only the relevant information from large text
bodies, reducing cognitive overload.

11. A Chatbot is known as a Conversational agent- Explain. (5 Marks)

 A Chatbot is a type of Conversational Agent that interacts with users through text or
voice-based communication. It is designed to simulate human conversations, answering
queries or performing tasks autonomously.
1. Types:
 Rule-Based Chatbots: Use predefined rules for responses.
 AI-Based Chatbots: Use machine learning and natural language
processing to understand and respond to complex queries.
2. Applications:
 Customer service, virtual assistants, e-commerce support, and
conversational interfaces for various services.
 Example: A chatbot on an e-commerce site assisting users in finding
products.

12. What is the advantage of artificial intelligence in chatbots? (5 Marks)

 Advantages of AI in Chatbots:
1. Natural Language Understanding: AI-powered chatbots can understand and
process natural language, enabling more human-like conversations.
2. Contextual Understanding: AI chatbots can remember previous interactions and
provide contextually relevant responses, making them more personalized.
3. 24/7 Availability: AI chatbots can operate round-the-clock, providing instant
support and responses at any time.
4. Scalability: AI chatbots can handle a large volume of interactions simultaneously,
making them scalable for businesses with high customer interaction volumes.
5. Learning and Improvement: AI chatbots improve over time as they learn from
user interactions, refining their responses for better accuracy.

13. State the concept of Retrieval-based model. (2 Marks)

 Retrieval-Based Model:
o A Retrieval-Based Model is a type of chatbot or conversational agent that
provides responses by retrieving the most appropriate answer from a predefined
set of responses or templates. It does not generate new content but selects the
most relevant response from a database or a knowledge base.
o Example: When asked about business hours, the chatbot retrieves the exact
response from a set of pre-stored information.

14. Define Question answering system. (2 Marks)

 A Question Answering System (QA System) is an information retrieval system


designed to answer questions posed by users in natural language. It uses techniques from
natural language processing (NLP), information retrieval, and machine learning to extract
answers from a knowledge base or database.
o Example: A search engine that provides direct answers to queries like "What is
the capital of France?"

15. Give some examples of Question answering system. (2 Marks)

 Examples of Question Answering Systems:


1. Google Search: Provides answers directly at the top of the results for certain
types of queries.
2. Siri, Alexa, Google Assistant: Virtual assistants that answer questions verbally
based on voice queries.
3. IBM Watson: A sophisticated QA system used for healthcare, legal, and other
domains that can answer complex questions based on structured data.
4. StackExchange: Provides user-generated answers to technical and community-
driven questions.
16. What is User-Based Collaborative Filtering? (2 Marks)

 User-Based Collaborative Filtering:


o This technique recommends items to a user by identifying other similar users
based on their interaction history or ratings. The system recommends items that
other similar users have liked.
o Example: If User A and User B have rated movies similarly, the system will
recommend movies that User B liked but User A hasn't seen yet.

17. State the concept of Information retrieval (IR) based question answering. (2 Marks)

 Information Retrieval (IR)-Based Question Answering:


o An IR-based QA system searches for relevant documents or text from a large
corpus based on the input query. The system then extracts the answer from the
retrieved document.
o It involves using search engines, text indexing, and ranking methods to find
relevant information and provide a response.
o Example: A system that pulls relevant information from articles to answer
questions like "What are the symptoms of COVID-19?"

18. Compare Information Retrieval and Web Search. (5 Marks)

 Information Retrieval vs. Web Search:


1. Information Retrieval (IR):
 IR focuses on finding relevant documents from a large corpus of data
based on a user’s query.
 It retrieves documents and provides a ranked list based on relevance.
 Examples: Academic research databases, digital libraries.
2. Web Search:
 Web search is a specific application of IR focused on retrieving relevant
web pages from the internet.
 It indexes web pages and provides results based on relevance and ranking
algorithms.
 Examples: Google, Bing, Yahoo.

Key Differences:

o Scope: Web search focuses on the internet, while IR can be applied to any
document-based data.
o Ranking Algorithms: Web search engines use complex algorithms like
PageRank to rank results, while IR typically uses relevance measures like cosine
similarity or TF-IDF.

19. Define Recommendation based on User Ratings using appropriate example. (5 Marks)

 Recommendation Based on User Ratings:


o This approach suggests items to users based on the ratings given by other users. It
uses collaborative filtering to recommend items that users with similar
preferences have rated highly.
o Example: In a movie recommendation system like Netflix, if a user rates "The
Dark Knight" highly, the system might recommend other movies with similar
ratings by other users, like "Inception" or "Interstellar."

20. Describe sentiment analysis with an example. (5 Marks)

 Sentiment Analysis is the process of determining the emotional tone or sentiment behind
a piece of text. It is often used in analyzing customer feedback, social media posts, or
product reviews.
1. Positive Sentiment: Indicates approval or satisfaction.
2. Negative Sentiment: Indicates disapproval or dissatisfaction.
3. Neutral Sentiment: Indicates a lack of strong opinion.

Example:

o Positive: "I love this product, it works perfectly!"


o Negative: "The product stopped working after a week, very disappointed."
o Neutral: "The product arrived on time."

21. Explain the different types of recommendation systems. (5 Marks)

 Types of Recommendation Systems:


1. Content-Based Recommendation:
 Recommends items based on the attributes or features of the items and the
user’s preferences.
 Example: Recommending books in a specific genre based on the user’s
past reading history.
2. Collaborative Filtering:
 Uses the preferences or behavior of other users to make recommendations.
 Example: Suggesting movies based on what other users with similar tastes
have watched.
3. Hybrid Systems:
 Combines both content-based and collaborative filtering methods for
better recommendations.
 Example: Netflix’s recommendation system uses both content and
collaborative-based methods.
4. Knowledge-Based Systems:
 Uses domain knowledge and specific rules to recommend items.
 Example: A travel website recommending vacation destinations based on
a user’s past preferences and budget.

22. Explain the concept of the Recommendation System with real-life examples. (5 Marks)

 Concept of Recommendation System:


A recommendation system predicts and suggests items or content to users based on their
preferences, behavior, or demographic information. These systems utilize algorithms and
data analysis to provide personalized suggestions.

Real-life Examples:

1. Netflix: Recommends movies and TV shows based on the user’s viewing history
and ratings, using collaborative filtering and content-based methods.
2. Amazon: Recommends products based on the items you have previously
purchased or viewed, as well as what other customers with similar behaviors have
bought.
3. Spotify: Suggests music based on the user’s listening patterns, favorite artists, or
genres, combining collaborative filtering and content-based recommendations.

23. Illustrate two kinds of conversational agents. (5 Marks)

 Two Kinds of Conversational Agents:


1. Rule-Based Chatbots:
 Operates based on a predefined set of rules and keywords. The chatbot
selects responses based on the specific input it receives.
 Example: A customer service bot that responds with fixed replies like
"What is your query?" or "Our office hours are 9 AM to 5 PM."
2. AI-Based Chatbots:
 Uses machine learning and natural language processing (NLP) to
understand context, learn from interactions, and generate personalized
responses.
 Example: Virtual assistants like Siri or Google Assistant, which can
process complex queries and provide dynamic, context-aware responses.
24. Explain Collaborative Recommendation System with example. (5 Marks)

 Collaborative Recommendation System:


o This system recommends items to users based on the preferences of other users. It
assumes that users who have agreed in the past will agree in the future about
items.

Example:

o Movie Recommendations: If User A and User B both liked the same movies, and
User A also liked a movie that User B hasn’t seen, the system will recommend
that movie to User B.
o Amazon’s Product Recommendations: If users who bought a specific book also
bought another one, the system will recommend the second book to others who
bought the first.

25. Describe the most common use-cases of sentiment analysis. (5 Marks)

 Common Use-Cases of Sentiment Analysis:


1. Customer Feedback: Analyzing product reviews and feedback to gauge
customer satisfaction and identify areas for improvement.
2. Social Media Monitoring: Analyzing tweets, posts, and comments to assess
public sentiment toward a brand, event, or topic.
3. Market Research: Understanding consumer opinions and trends to guide product
development and marketing strategies.
4. Political Sentiment Analysis: Analyzing public sentiment regarding political
candidates or policies, often used during elections.
5. Brand Monitoring: Monitoring the sentiment surrounding a company’s products
or services to improve brand perception.

26. What are steps involved in Latent Dirichlet Allocation (LDA)? (5 Marks)

 Steps Involved in LDA:


1. Initialization: Each word in the corpus is randomly assigned to one of the topics.
2. E-Step (Expectation): For each word, the algorithm estimates the probability of
the word being assigned to each topic based on the words in the document and the
words assigned to each topic.
3. M-Step (Maximization): The model re-estimates the topic distribution for each
document and the word distribution for each topic.
4. Iteration: The algorithm repeats steps 2 and 3, refining the topic-word and
document-topic distributions until convergence is reached.
5. Output: After convergence, the algorithm outputs the topics and their associated
words, which represent the underlying topics in the document corpus.

27. Describe Twitter sentiment analysis. (5 Marks)

 Twitter Sentiment Analysis:


o Concept: Twitter sentiment analysis involves analyzing tweets to determine the
sentiment (positive, negative, neutral) behind the text.
o Process:
1. Data Collection: Collect tweets using Twitter’s API or web scraping.
2. Preprocessing: Clean the tweets by removing stop words, punctuation,
and irrelevant text.
3. Sentiment Classification: Use a model (e.g., Naive Bayes, SVM, or deep
learning) to classify the sentiment of each tweet.
4. Analysis: Aggregate results to understand overall public sentiment about a
topic, product, or event.

Example: Analyzing tweets about a new product launch to assess public opinion and
consumer reactions.

28. Define the Chatbot Architectures. (5 Marks)

 Chatbot Architectures:
1. Rule-Based Architecture:
 Relies on predefined responses and rules based on keyword matching.
 Suitable for basic queries but lacks the ability to handle complex or
ambiguous inputs.
2. Retrieval-Based Architecture:
 Selects the most relevant response from a set of pre-stored responses
based on the user’s input.
 Does not generate new responses but relies on existing ones in the
database.
3. Generative Architecture:
 Uses machine learning models (e.g., neural networks) to generate dynamic
responses.
 Can handle more open-ended conversations and learn from user
interactions.
4. Hybrid Architecture:
 Combines both rule-based and generative approaches to offer a more
robust chatbot experience, enabling better handling of diverse queries.
29. Illustrate Multi-document summarization. (5 Marks)

 Multi-document Summarization:
o Concept: In multi-document summarization, the goal is to create a single, concise
summary that represents the most important information from multiple
documents.
o Process:
1. Document Collection: Collect a set of related documents on a specific
topic.
2. Text Preprocessing: Clean the documents by removing stop words,
stemming, and other irrelevant elements.
3. Text Segmentation and Clustering: Divide the documents into segments
or clusters based on similarity.
4. Summarization: Extract key sentences or information from each
document and combine them into a coherent summary.

Example: Summarizing multiple news articles about the same event into a single,
concise summary.

30. Define topic modeling. (5 Marks)

 Topic Modeling:
o Concept: Topic modeling is a technique used in natural language processing
(NLP) to discover the hidden thematic structure in a large collection of text. It
helps in identifying topics or themes that occur across the documents.
o Common Methods:
1. Latent Dirichlet Allocation (LDA): A probabilistic model that assumes
each document is a mixture of topics, and each word in the document is
attributable to one of the document's topics.
2. Non-Negative Matrix Factorization (NMF): Factorizes the document-
term matrix into two lower-dimensional matrices, one for the topic
distribution and one for word distribution.

Example: Using topic modeling to categorize articles in a news dataset into topics like
politics, sports, entertainment, etc.

31. Describe Extraction-based summarization. (5 Marks)

 Extraction-Based Summarization:
o Concept: This technique involves selecting important sentences or segments from
the original text and combining them to create a summary. It extracts the most
relevant parts without altering or generating new content.
o Process:
1. Text Preprocessing: Clean and preprocess the text, removing unnecessary
elements like stop words.
2. Sentence Ranking: Use algorithms to rank sentences based on their
relevance to the main topic (e.g., TF-IDF, TextRank).
3. Extraction: Select the top-ranked sentences or segments and combine
them to form a coherent summary.

Example: A news article about a political event may extract key sentences that highlight
the main points such as the event's location, time, and major developments.

32. State the differences between Extraction-based summarization and Abstraction-based


summarization. (5 Marks)

 Differences between Extraction-based and Abstraction-based Summarization:


1. Method:
 Extraction-based: Selects and reuses sentences directly from the original
text.
 Abstraction-based: Generates new sentences, paraphrasing or
summarizing the original text.
2. Approach:
 Extraction-based: Simpler, relies on selecting important information
directly.
 Abstraction-based: More complex, requires understanding of the content
to generate new sentences.
3. Output:
 Extraction-based: The summary consists of parts of the original text.
 Abstraction-based: The summary may contain newly generated sentences
with paraphrased content.
4. Complexity:
 Extraction-based: Easier to implement but can lack fluency.
 Abstraction-based: More sophisticated and generates more human-like
summaries.

33. Classify Recommendation techniques with examples. (10 Marks)

 Classification of Recommendation Techniques:


1. Collaborative Filtering:
 Concept: Recommends items based on the preferences of other similar
users.
 Example: Movie recommendations on Netflix based on users with similar
tastes.
2. Content-Based Filtering:
 Concept: Recommends items based on the features of the item and a
user’s past behavior.
 Example: Recommending books on Amazon based on the user’s
previously purchased genres.
3. Hybrid Recommendation Systems:
 Concept: Combines both collaborative and content-based methods to
improve recommendations.
 Example: YouTube recommends videos based on both your viewing
history and the preferences of similar users.
4. Knowledge-Based Recommendations:
 Concept: Recommendations based on explicit knowledge, such as
preferences or requirements.
 Example: A recommendation system for real estate that suggests houses
based on a user’s budget, preferred location, and family size.
5. Demographic-Based Recommendations:
 Concept: Recommends items based on demographic information such as
age, gender, or location.
 Example: A clothing store recommends outfits based on the user's gender
and age group.

34. Illustrate different Summarization techniques. (10 Marks)

 Different Summarization Techniques:


1. Extractive Summarization:
 Concept: Involves extracting key sentences or phrases directly from the
original text and combining them into a coherent summary.
 Example: Extracting sentences from a news article to create a summary.
2. Abstractive Summarization:
 Concept: Generates new sentences to convey the most important
information, paraphrasing the original text.
 Example: Using a model to rewrite a news article in a shorter form,
maintaining the key details.
3. Single-Document Summarization:
 Concept: Summarizes information from a single document.
 Example: Summarizing a research paper into a short abstract.
4. Multi-Document Summarization:
 Concept: Combines relevant information from multiple documents into a
single concise summary.
 Example: Summarizing a set of news articles covering the same event
into one summary.
5. Topic-Based Summarization:
 Concept: Summarizes content based on specific topics or themes within a
document.
 Example: A summary of a research paper that highlights the key topics
discussed, like methods, results, and conclusions.

35. Explain the Use-Cases of the Recommendation System. (10 Marks)

 Use-Cases of Recommendation Systems:


1. E-Commerce:
 Use: Suggesting products based on past purchases or browsing history.
 Example: Amazon recommends products like accessories or
complementary items to customers.
2. Music and Video Streaming:
 Use: Recommending content based on user preferences or behavior.
 Example: Spotify and Netflix recommend songs, albums, or movies based
on your listening and watching patterns.
3. Social Media:
 Use: Suggesting friends, posts, or pages based on user activity or
connections.
 Example: Facebook suggests friends or pages based on mutual
connections.
4. Online News:
 Use: Recommending articles or news based on user interests and browsing
history.
 Example: Google News recommends articles based on your reading
habits.
5. Education Platforms:
 Use: Recommending courses or learning materials based on the user’s
interests or past courses.
 Example: Coursera or Udemy recommend courses based on user
preferences and learning history.

36. Differentiate collaborative filtering and content-based systems. (10 Marks)

 Difference between Collaborative Filtering and Content-Based Systems:


1. Basis of Recommendation:
 Collaborative Filtering: Recommends items based on the behavior and
preferences of similar users.
 Content-Based Filtering: Recommends items based on the features of the
items and the user’s past behavior.
2. Data Dependency:
 Collaborative Filtering: Relies on user-item interaction data, such as
ratings, purchases, or views.
Content-Based Filtering: Relies on item features (such as genre,
description, or specifications) and user preferences.
3. Cold Start Problem:
 Collaborative Filtering: Faces the cold start problem when there is
insufficient user interaction data (new users/items).
 Content-Based Filtering: Does not suffer as much from the cold start
problem, as long as item descriptions are available.
4. Examples:
 Collaborative Filtering: Movie recommendations on Netflix based on
similar user preferences.
 Content-Based Filtering: Product recommendations on Amazon based on
past purchases.

37. Define the steps of sentiment analysis. (10 Marks)

 Steps of Sentiment Analysis:


1. Data Collection:
 Collect the text data from sources like reviews, social media posts, or
news articles.
 Example: Gather product reviews or tweets about a brand.
2. Text Preprocessing:
 Clean and prepare the data by removing unnecessary elements like stop
words, punctuation, and special characters.
 Example: Removing “the,” “is,” and “@” symbols from a tweet.
3. Tokenization:
 Split the text into smaller units, such as words or phrases, to better
understand the structure.
 Example: The sentence "I love this phone!" is split into ["I", "love", "this",
"phone"].
4. Feature Extraction:
 Convert the text data into numerical or categorical features using methods
like TF-IDF, word embeddings, or sentiment lexicons.
 Example: Convert words into vectors using Word2Vec or TF-IDF.
5. Sentiment Classification:
 Apply a classification algorithm (e.g., Naive Bayes, SVM, deep learning)
to classify the sentiment into categories like positive, negative, or neutral.
 Example: Using a Naive Bayes classifier to categorize a review as
“positive” or “negative.”
6. Evaluation:
 Evaluate the accuracy of the sentiment analysis model using metrics such
as precision, recall, and F1-score.
 Example: Checking how accurately the model classifies sentiment in test
data.
7. Interpretation:
 Interpret the results and understand the implications of the sentiment
analysis.
 Example: Analyzing the customer sentiment toward a product to
determine its market success.

38. What is LDA and how is it different from others? (10 Marks)

 LDA (Latent Dirichlet Allocation):


o Concept: LDA is a generative probabilistic model used for topic modeling. It
assumes that each document is a mixture of topics, and each topic is a mixture of
words. LDA is used to discover hidden thematic structure in large collections of
text.
o How LDA Works:
1. Document Representation: Each document is represented as a
probability distribution over topics.
2. Topic Distribution: Topics are represented as distributions over words.
3. Process:
 LDA assigns words in each document to one of the topics based on
the probability distribution.
 The model infers the hidden topics based on word co-occurrence
patterns across documents.
o Differences from Other Models:
1. LDA vs. TF-IDF:
 LDA: Finds latent topics within the text.
 TF-IDF: Measures the importance of a word in relation to a
document in a collection.
2. LDA vs. NMF (Non-negative Matrix Factorization):
 LDA: Uses a probabilistic approach to assign words to topics.
 NMF: Uses a matrix factorization approach to decompose the
document-term matrix.
3. LDA vs. Word2Vec:
 LDA: Focuses on discovering topics that summarize a collection
of text.
 Word2Vec: Focuses on learning word representations by
capturing semantic similarities between words.

39. With example illustrate Abstractive summarization. (10 Marks)

 Abstractive Summarization:
o Concept: Abstractive summarization involves generating new sentences that
convey the most important information from the original text. Unlike extractive
summarization, which directly selects parts of the text, abstractive summarization
rewrites the content.
o Example:
 Original Text: "The Eiffel Tower is one of the most famous landmarks in
Paris, France. It was designed by engineer Gustave Eiffel and completed
in 1889. The tower stands at a height of 324 meters and attracts millions of
visitors every year."
 Abstractive Summary: "The Eiffel Tower in Paris, designed by Gustave
Eiffel in 1889, is a popular tourist attraction, standing 324 meters tall."
o Methods:
 Seq2Seq Models: Use sequence-to-sequence models, often based on
LSTM or Transformer networks, to generate summaries.
 GPT-based Models: Pre-trained models like GPT-3 can be used to
generate abstractive summaries by learning the language and context.
o Challenges:
 Maintaining coherence and meaning while generating new sentences.
 Handling long documents where the summary should still capture the
essence.

41. Illustrate the advantages and disadvantages of a Content-based and collaborative


filtering recommendation system. (10 Marks)

 Content-based Filtering:
o Advantages:
1. No Need for User Data: Can recommend items to new users who have
little or no prior data.
2. Transparency: The rationale for recommendations is based on item
features, making it easier to explain why an item was recommended.
3. Personalization: Can tailor recommendations based on the individual’s
preferences and behaviors.
o Disadvantages:
1. Limited to User’s Past Preferences: Recommendations are often limited
to similar items, which might not help in discovering new content.
2. Cold Start Problem for Items: If there is no information about an item, it
can’t be recommended, which might limit the system's effectiveness.
3. Lack of Diversity: The system might recommend similar items too often,
leading to a lack of variety.
 Collaborative Filtering:
o Advantages:
1. No Need for Item Features: Works based on user-item interactions, so it
can recommend items even if detailed item features are not available.
2. Discover New Items: Can recommend items that a user might not have
found on their own, leading to serendipitous discoveries.
3. Widely Applicable: Works well in domains where user preferences are
critical, such as movies, music, or e-commerce.
o Disadvantages:
1. Cold Start Problem for New Users: If there is not enough data about a
user, it can be challenging to make recommendations.
2. Scalability Issues: As the number of users and items increases,
collaborative filtering models can become computationally expensive.
3. Sparsity: In large systems, many users may not have interacted with
enough items, leading to sparse user-item interaction matrices.

42. In the context of natural language processing, how can we leverage the concepts of TF-
IDF, training set, validation set, test set, and stop words to improve the accuracy and
effectiveness of machine learning models and algorithms? Additionally, what are some
potential challenges and considerations when working with these concepts, and how can we
address them? (10 Marks)

 Leverage in NLP:
1. TF-IDF (Term Frequency-Inverse Document Frequency):
 How it helps: TF-IDF helps quantify the importance of words in a
document relative to a collection (corpus). It reduces the influence of
common words (e.g., "the," "and") while highlighting unique terms
relevant to the context, aiding in text classification, clustering, and other
NLP tasks.
 Application: Use TF-IDF for feature extraction in tasks like document
classification, topic modeling, and search engines.
2. Training, Validation, and Test Sets:
 Training Set: Used to train the machine learning model, adjusting its
parameters based on the data.
 Validation Set: Helps tune the hyperparameters and select the best model
by evaluating performance on unseen data.
 Test Set: Evaluates the final model’s performance to assess its
generalization ability.
 How they help: These sets help prevent overfitting, ensuring the model is
trained well and can generalize to new, unseen data.
3. Stop Words:
 How it helps: Stop words (like "and," "is," "in") carry little meaning for
certain NLP tasks, so removing them can improve the signal-to-noise ratio
and reduce computational complexity.
 Application: Stop word removal is critical in tasks like text classification,
information retrieval, and sentiment analysis.
 Challenges and Considerations:
1. Overfitting: The model may memorize the training data and fail to generalize to
new data.
 Solution: Use regularization techniques and cross-validation to detect and
mitigate overfitting.
2. Data Imbalance: If the dataset has class imbalances, the model might be biased
toward the majority class.
 Solution: Use techniques like oversampling, undersampling, or class
weighting to address this issue.
3. Stop Words in Context: In some cases, stop words carry meaning (e.g., in
sentiment analysis, "not good").
 Solution: Carefully evaluate which stop words are relevant to the task
before removal.

43. Describe how extraction-based and abstraction-based summarizations vary from one
another. How would you go about creating an extractive summarization system? (10
Marks)

 Extraction-Based Summarization:
o Concept: In extraction-based summarization, important sentences or phrases are
extracted directly from the original document to form a summary. It involves
selecting the most significant portions without altering the original content.
o Example: Selecting key sentences from a news article and concatenating them to
form a brief summary.
 Abstraction-Based Summarization:
o Concept: In abstraction-based summarization, new sentences are generated by
paraphrasing the content of the original document. It involves rewriting and
synthesizing the key information.
o Example: Generating a short summary that conveys the same meaning of an
article but with different words or structure.
 Creating an Extractive Summarization System:
1. Text Preprocessing: Clean the data by removing stop words, punctuation, and
stemming or lemmatizing the words.
2. Feature Extraction: Convert the text into numerical features using methods like
TF-IDF or word embeddings.
3. Sentence Scoring: Use techniques like TF-IDF, PageRank (TextRank), or
clustering to score and rank sentences based on their importance.
4. Sentence Selection: Select the highest-scoring sentences to form the summary
while ensuring coherence and readability.
5. Post-processing: Optionally, reformat the sentences for smoother readability and
coherence.

44. Explain how pretraining techniques such as GPT (Generative Pretrained Transformer)
contribute to improving natural language understanding tasks. Discuss the key components
and training objectives of GPT models. (10 Marks)

 GPT (Generative Pretrained Transformer):


o How it Improves NLP Tasks: GPT uses a transformer architecture to model
language, capturing contextual relationships between words. Pretraining on large
amounts of text allows the model to learn general language patterns, making it
highly effective for various NLP tasks, such as text generation, summarization,
and question-answering.
o Key Components:
1. Transformer Architecture: A deep learning model based on self-
attention mechanisms that can process words in parallel, rather than
sequentially, improving speed and scalability.
2. Pretraining: GPT is trained on vast amounts of unlabeled text to predict
the next word in a sequence, enabling it to learn general linguistic patterns.
3. Fine-Tuning: After pretraining, GPT is fine-tuned on task-specific
datasets, adjusting the model to better perform tasks like classification or
translation.
o Training Objectives:
1. Language Modeling: Pretraining involves predicting the next word or
token in a sequence (autoregressive modeling), allowing the model to
generate coherent text.
2. Transfer Learning: By pretraining on a broad corpus, GPT models can
be fine-tuned for specific NLP tasks using smaller task-specific datasets,
making them highly adaptable and efficient.

45. How do you use naïve bayes model for collaborative filtering? (10 Marks)

 Naïve Bayes in Collaborative Filtering:


o Concept: Collaborative filtering aims to predict user preferences based on
historical interactions. Naïve Bayes can be applied to predict whether a user will
like an item based on the ratings or interactions of similar users.
o Steps to Apply Naïve Bayes:
1. Data Representation: Represent user-item interactions as binary features
(e.g., like or dislike, 1 or 0) in a matrix form.
2. Feature Extraction: For each item, calculate the probability that a user
likes or dislikes it, based on the features (ratings or interactions) from
other users.
3. Model Training: Use the Naïve Bayes classifier to compute the
conditional probabilities of user-item pairs, assuming independence
between the items.
4. Prediction: For an unseen user-item pair, calculate the posterior
probability and predict the user's preference based on the Naïve Bayes
model.

46. Is lexical analysis different from semantic? How? (5 Marks)

 Lexical Analysis:
o Concept: Lexical analysis is the process of breaking down a sequence of
characters (text) into tokens, such as words, punctuation, and symbols, which are
the basic units for syntax analysis.
o Goal: Identifies the structure of words and symbols in a language.
 Semantic Analysis:
o Concept: Semantic analysis deals with interpreting the meanings of words,
phrases, or sentences in context. It focuses on the relationships between symbols
to derive meaning.
o Goal: Understanding the meaning behind the words and phrases in context.
 Difference:
o Lexical Analysis: Concerned with the form and structure of the language
(syntax).
o Semantic Analysis: Concerned with the meaning and interpretation of the content
(semantics).

47. How does NLP help in sentiment analysis? (2 Marks)

 Sentiment Analysis in NLP:


o NLP techniques enable sentiment analysis by processing and understanding the
emotions conveyed in a piece of text (e.g., positive, negative, or neutral).
o It uses text preprocessing, tokenization, and feature extraction (e.g., TF-IDF, word
embeddings) to analyze sentiment and classify text based on the underlying
emotions.

48. Define what N-grams are in the context of Natural Language Processing (NLP). (2
Marks)

 N-grams:
o Definition: N-grams are contiguous sequences of "n" items (words, characters, or
tokens) from a given text or speech. For example:
 Unigram (1 word): "I", "love", "machine", "learning"
 Bigram (2 words): "I love", "love machine", "machine learning"
 Trigram (3 words): "I love machine", "love machine learning"
o N-grams are used for feature extraction, language modeling, and improving text
predictions.

49. How are N-grams utilized in Natural Language Processing (NLP)? (2 Marks)

 Utilization of N-grams:
1. Feature Extraction: N-grams are used as features in machine learning models for
tasks such as text classification, sentiment analysis, and language modeling. They
capture local context by considering adjacent words.
2. Language Modeling: N-grams are employed to predict the likelihood of the next
word in a sequence by analyzing previous words in a sentence.
3. Text Generation: N-grams help in generating coherent text by predicting the next
word based on the given sequence of words.
4. Improving Accuracy: They can capture dependencies between words, helping
models understand the structure and semantics of a sentence better.

50. What is meant by data augmentation? (2 Marks)

 Data Augmentation:
o Definition: Data augmentation refers to techniques that artificially expand the
size of a training dataset by applying transformations to the existing data. These
transformations can include adding noise, changing the order of words, or
replacing words with synonyms.
o Purpose: It is mainly used to improve model generalization, prevent overfitting,
and enhance performance, especially when data is scarce.

51. How would you create a Recommender System for text inputs? (5 Marks)

 Creating a Recommender System for Text Inputs:


1. Preprocessing: Clean and tokenize the input text by removing stop words,
punctuation, and applying stemming or lemmatization.
2. Feature Extraction: Convert the text into numerical features using techniques
like TF-IDF or word embeddings (Word2Vec, GloVe).
3. Model Selection:
 Content-Based Filtering: Use cosine similarity or other distance metrics
to recommend similar items based on text content.
 Collaborative Filtering: Recommend items based on user-item
interaction patterns, where text is used to identify similarity between
items.
4. Prediction: Use the similarity scores to recommend items with the highest
similarity to the input text.
5. Evaluation: Evaluate the model using metrics like precision, recall, and F1-score
to ensure the recommendations are relevant.

52. Discuss how the popular word embedding technique Word2Vec is implemented in the
algorithms: Continuous Bag of Words (CBOW) model, and the Skip-Gram model. (5
Marks)

 Word2Vec Models:
1. Continuous Bag of Words (CBOW):
 Concept: The CBOW model predicts the target word given a context
(surrounding words). It uses the surrounding words in a fixed window to
predict the center word.
 Example: In the sentence "The cat sat on the mat," the model will use the
words "The," "cat," "on," "the" to predict the center word "sat."
2. Skip-Gram Model:
 Concept: The Skip-Gram model does the inverse of CBOW by predicting
the context words given a target word.
 Example: Given the word "sat," the model will predict the context words
like "The," "cat," "on," "the."
3. Training Process: Both models use a neural network to map words to vectors
(embeddings), and the embeddings are updated through backpropagation based on
prediction accuracy.

53. Differentiate collaborative filtering and content-based systems. (10 Marks)

 Collaborative Filtering:
o Concept: Collaborative filtering recommends items based on past interactions or
ratings of users. It assumes that if users agree on one item, they will also agree on
other items.
o Types:
 User-Based: Recommends items by finding similar users.
 Item-Based: Recommends items that are similar to items the user has
interacted with.
o Advantages:
 Works well for new items (cold-start problem is less relevant).
 Can capture complex patterns of user preferences.
o Disadvantages:
 Suffers from the cold-start problem for new users or items.
 Requires large amounts of user data.
 Content-Based Filtering:
o Concept: Content-based filtering recommends items based on the features of the
items and the user’s previous interactions. It compares item features like
keywords, categories, or attributes to make recommendations.
o Advantages:
 Works well for new users since it doesn’t require user history.
 Can be highly personalized, focusing on item attributes.
o Disadvantages:
 Limited to recommending items similar to what the user has already
interacted with.
 May lead to narrow recommendations (lack of diversity).

54. Explain the Use-Cases of the Recommendation System. (10 Marks)

 Use-Cases of Recommendation Systems:


1. E-Commerce:
 Example: Amazon’s recommendation system suggests products based on
user browsing history, similar product recommendations, and user
preferences.
2. Streaming Services:
 Example: Netflix and Spotify use recommendation systems to suggest
movies, shows, and music based on past user choices and ratings.
3. Online Shopping Platforms:
 Example: Platforms like eBay and Flipkart recommend items based on a
user’s purchasing history and behavior.
4. Social Media:
 Example: YouTube recommends videos based on users’ watching history,
liked videos, and subscriptions.
5. Job Portals:
 Example: LinkedIn and Glassdoor recommend job postings to users based
on their profile, skills, and previous searches.
6. Educational Platforms:
 Example: Coursera and Udemy recommend courses based on a learner's
preferences, past courses taken, and trending courses.

55. Analyze different chatbot architectures in NLP, such as rule-based, retrieval-based, and
generative models, assessing their effectiveness based on scalability, response quality, and
adaptability. (10 Marks)

 Different Chatbot Architectures:


1. Rule-Based Chatbots:
 Concept: These chatbots follow predefined rules and patterns to generate
responses. They use if-else conditions and predefined templates.
 Effectiveness:
 Scalability: Poor scalability as new rules need to be manually
added.
 Response Quality: Limited to predefined responses, often leading
to less natural conversations.
 Adaptability: Low adaptability, requires manual updates to handle
new scenarios.
2. Retrieval-Based Chatbots:
 Concept: These chatbots fetch the most relevant response from a
predefined set of responses based on user input.
 Effectiveness:
 Scalability: Scalable as it only requires adding more response
templates.
 Response Quality: Can produce high-quality responses if well-
designed but still limited by the predefined responses.
 Adaptability: Moderate adaptability, requires retraining for more
complex scenarios.
3. Generative Chatbots:
 Concept: These chatbots generate responses using deep learning models
like transformers (e.g., GPT). They don’t rely on predefined responses but
generate new responses based on the input.
 Effectiveness:
 Scalability: Highly scalable as they can learn from vast datasets
and generate diverse responses.
 Response Quality: Can produce high-quality, contextually
relevant responses, offering a more human-like conversation.
 Adaptability: High adaptability, as they can improve with
continuous training and are capable of handling diverse scenarios.

56. Discuss how ChatGPT utilizes large-scale pretraining and transformer-based


architectures to generate contextually relevant responses. (5 Marks)

 ChatGPT and Large-Scale Pretraining:


1. Pretraining: ChatGPT, based on OpenAI’s GPT (Generative Pretrained
Transformer) models, is pretrained on massive amounts of text data. This
pretraining phase involves learning the structure, grammar, and relationships in
the language by predicting the next word in a sentence.
2. Transformer Architecture: The core of ChatGPT is the transformer model,
which uses self-attention mechanisms to understand contextual dependencies
between words in a sentence, even over long ranges. This allows the model to
generate coherent and contextually relevant responses by focusing on key parts of
the input.
3. Contextual Understanding: By utilizing a large-scale training corpus, the model
learns to produce responses that are not just grammatically correct but
contextually aligned with the conversation.
4. Fine-tuning: After pretraining, the model is fine-tuned on specific datasets with
human feedback, improving its ability to generate more natural and meaningful
interactions.

57. Elaborate Collaborative Recommendation System with example. (5 Marks)

 Collaborative Recommendation System:


1. Concept: Collaborative filtering works by leveraging the past behaviors or
interactions of users to recommend items. It operates on the assumption that if
two users had similar preferences in the past, they are likely to have similar
preferences in the future.
2. Types:
 User-Based Collaborative Filtering: This method recommends items by
finding users with similar preferences. For example, if User A and User B
have liked similar movies, the system would recommend to User A
movies that User B has liked but they have not yet seen.
 Item-Based Collaborative Filtering: In this approach, items are
recommended based on their similarity to items the user has already liked
or interacted with.
3. Example: In a movie recommendation system, if a user likes movies like
"Inception" and "The Matrix," the system may recommend "Interstellar" based on
the preferences of other users who liked similar movies.

58. Describe five different applications of Natural Language Processing (NLP) in various
fields such as healthcare, finance, customer service, and education. (5 Marks)

 Applications of NLP:
1. Healthcare:
 Clinical Text Analysis: NLP is used to analyze medical records and
clinical notes to extract relevant information, such as diagnoses, treatment
plans, and drug prescriptions.
 Medical Chatbots: NLP-powered chatbots can assist patients by
answering medical questions and providing basic health advice.
2. Finance:
 Sentiment Analysis: NLP is used to analyze news articles, social media
posts, and financial reports to gauge market sentiment, helping investors
make informed decisions.
 Fraud Detection: NLP can be applied to identify suspicious patterns in
financial transactions by analyzing the text in communication or
transaction details.
3. Customer Service:
 Chatbots and Virtual Assistants: NLP-based chatbots help automate
customer support by understanding and responding to customer queries in
real-time.
 Email and Ticket Classification: NLP is used to categorize and prioritize
customer service emails and tickets, directing them to the appropriate
support teams.
4. Education:
 Automated Grading: NLP is used to evaluate and grade open-ended
responses or essays based on predefined rubrics.
 Language Learning: NLP-powered tools help in language learning by
providing context-aware feedback and translating text.
5. E-Commerce:
 Product Recommendation: NLP techniques analyze user reviews and
product descriptions to suggest relevant products to customers based on
their preferences and past behaviors.
59. Explain the importance of natural language understanding (NLU) in chatbot
development? (5 Marks)

 Importance of Natural Language Understanding (NLU) in Chatbot Development:


1. Interpretation of User Intent: NLU enables the chatbot to understand the user's
intention behind a query or statement. It helps in distinguishing between different
types of user input (questions, commands, requests) and interpreting them
correctly.
2. Entity Recognition: NLU identifies entities (such as dates, locations, products) in
user input, which is crucial for providing precise and relevant responses. For
example, in a weather chatbot, NLU will extract the city name to provide the
weather forecast.
3. Context Awareness: NLU helps the chatbot maintain context across multiple
turns in a conversation. It understands not just individual words, but the context of
the entire interaction, which allows for more natural conversations.
4. Improving User Experience: By interpreting the nuances and variations in
human language, NLU enables chatbots to deliver more accurate, personalized,
and engaging responses, improving user satisfaction.
5. Scalability: With good NLU capabilities, chatbots can scale to handle more
complex queries and support a broader range of interactions, making them more
adaptable to different industries and use cases.

60. Explain the architecture of the ChatGPT model in Natural Language Processing (NLP).
(5 Marks)

 ChatGPT Architecture:
1. Transformer Model: ChatGPT is based on the transformer architecture, which
employs self-attention mechanisms to understand and process text data. The
transformer allows the model to focus on relevant parts of the input sequence,
making it highly effective for natural language tasks.
2. Encoder-Decoder Structure: In earlier models, transformers utilized an encoder-
decoder structure. ChatGPT, however, is based on a decoder-only architecture,
where it generates text by predicting the next word based on the given input
sequence.
3. Pretraining: The model is pretrained on large amounts of text data to learn the
statistical properties of language. This helps ChatGPT understand grammar,
syntax, and context.
4. Fine-tuning: After pretraining, ChatGPT is fine-tuned on specific datasets and
human feedback to improve its conversational abilities and make it more aligned
with user expectations.
5. Response Generation: ChatGPT generates responses using a process called
autoregression, where the model predicts one word at a time, using the previously
generated words as context.
61. What are some of the ways in which data augmentation can be done in NLP projects?
(5 Marks)

 Data Augmentation in NLP:


1. Synonym Replacement: Replacing words with their synonyms using tools like
WordNet to diversify training data while maintaining the meaning of the sentence.
2. Back Translation: Translating a sentence into another language and then
translating it back into the original language. This introduces variations while
preserving the original meaning.
3. Text Generation: Using pretrained models like GPT to generate new training
samples that are similar in context to the original data.
4. Random Insertion/Deletion: Inserting or deleting words at random positions in
the sentence to introduce variations without changing the core meaning of the
text.
5. Word Embedding Perturbation: Adding noise to word embeddings to slightly
change the representation of words and increase the model’s robustness to small
changes in input.

62. Compare Information Retrieval and Web Search. (5 Marks)

 Information Retrieval vs. Web Search:


1. Definition:
 Information Retrieval (IR): The process of obtaining relevant documents
or data from a large collection, usually based on text queries. The goal is
to return documents that are most relevant to the user’s query.
 Web Search: A specific application of information retrieval that involves
searching the web, typically using search engines like Google or Bing, to
return results from the internet.
2. Scope:
 IR: Typically focuses on structured or unstructured datasets like academic
papers, digital libraries, or corporate data.
 Web Search: Focuses on the entire internet, pulling results from websites,
blogs, news articles, and other publicly available online content.
3. Search Engine Optimization (SEO):
 IR: Does not necessarily depend on SEO practices but on relevance and
ranking of documents based on query matching.
 Web Search: Highly influenced by SEO techniques that aim to improve
the ranking of web pages on search engines.
4. Data Source:
 IR: Uses a curated dataset or knowledge base for retrieval.
 Web Search: Retrieves data from billions of web pages, often relying on
crawlers and indexers.
5. Use Cases:
 IR: Used in enterprise search, academic research, and digital libraries.
 Web Search: Used for general-purpose internet search to find
information, products, services, and media.

63. Explain Item-Based Collaborative Filtering. (5 Marks)

 Item-Based Collaborative Filtering:


1. Concept: Item-based collaborative filtering focuses on identifying similarities
between items rather than users. It recommends items based on the similarity
between items that the user has already interacted with.
2. How it Works:
 Item Similarity Calculation: The system calculates the similarity
between items by looking at users who have rated or interacted with both
items. For example, if users who liked Movie A also liked Movie B, these
two items are considered similar.
 Recommendation Generation: Once similarities between items are
calculated, the system recommends items that are similar to the ones the
user has interacted with. If a user liked Movie A, the system would
recommend Movie B.
3. Advantages:
 Scalability: It’s often easier to compute item similarities since there are
usually fewer items than users.
 Handling Cold-Start Problem: Item-based filtering can work better for
new users, as it doesn’t rely heavily on user-based data.
4. Example: In an online shopping platform, if a user buys a pair of shoes, the
system may recommend similar shoes or other fashion items based on item-based
collaborative filtering.

64. What is dialogue management in a chatbot model? (2 Marks)

 Dialogue Management in a Chatbot:


1. Definition: Dialogue management refers to the component of a chatbot that
controls the flow of conversation. It determines the chatbot’s response based on
the current user input and the context of the conversation.
2. Role: It maintains the conversation context, decides when to ask follow-up
questions, and handles transitions between different topics. Dialogue management
ensures that the chatbot responds appropriately and keeps the interaction coherent.
3. Approaches:
 Rule-Based: Predefined rules determine the next response based on the
user’s input.
 Machine Learning-Based: The system uses algorithms to predict the
most appropriate response based on the conversation history.
65. How do pre-trained language models like GPT-3 contribute to chatbot development?
(10 Marks)

 Pre-trained Language Models like GPT-3 in Chatbot Development:


1. Contextual Understanding: GPT-3, being a large-scale pre-trained language
model, is capable of understanding and generating contextually relevant text. This
allows chatbots to have more natural and fluid conversations with users, as GPT-3
generates responses based on the full context rather than individual queries.
2. Handling Diverse Topics: GPT-3 is trained on a wide variety of data sources,
allowing it to handle a broad range of topics. This is crucial for chatbots that need
to provide accurate information on various subjects, from customer support to
general queries.
3. Natural Responses: GPT-3 generates human-like responses, improving the
chatbot’s ability to provide coherent and engaging interactions, which reduces the
likelihood of users feeling like they are interacting with a machine.
4. Reduction of Manual Rule Setting: Traditional chatbots often rely on a set of
predefined rules and templates. GPT-3 reduces the need for extensive rule
creation by understanding and generating responses without specific rule-based
programming.
5. Adaptability: Pre-trained models like GPT-3 can be fine-tuned to specific
domains, improving their performance in specialized areas. For instance, GPT-3
can be fine-tuned for use in customer service, medical advice, or technical
troubleshooting.
6. Real-Time Learning: Although GPT-3 doesn’t "learn" in real time, it can be
updated with new training data to improve responses over time. This helps
chatbots stay current with evolving trends or information.
7. Example: A customer support chatbot powered by GPT-3 can provide responses
that feel personalized and contextually aware, making the interaction feel more
natural and human-like.

66. What is User-Based Collaborative Filtering? (2 Marks)

 User-Based Collaborative Filtering:


1. Concept: User-based collaborative filtering recommends items to a user based on
the preferences of similar users. The idea is that if two users have a history of
liking similar items, they will continue to have similar preferences in the future.
2. How it Works:
 User Similarity: The system identifies users who have similar tastes or
ratings by comparing their past behavior (e.g., movie ratings, product
purchases).
 Recommendation Generation: Items that similar users have liked are
recommended to the target user. For example, if User A and User B liked
similar movies, the system would recommend to User A movies that User
B has liked but User A hasn’t seen.
3. Challenges: User-based collaborative filtering may struggle with cold-start
problems, where new users or items have insufficient data for generating
recommendations.

67. Can statistical techniques be used to perform the task of machine translation? If so,
explain in brief. (10 Marks)

 Use of Statistical Techniques in Machine Translation:


1. Statistical Machine Translation (SMT):
 Definition: SMT involves using statistical models to translate text
between languages based on patterns derived from large corpora of
parallel text (text in both source and target languages).
 Techniques:
 Word Alignment: Words in the source language are mapped to
words in the target language based on statistical frequencies
observed in parallel corpora. For example, if the word “cat” in
English frequently corresponds to “gato” in Spanish, the model
learns this relationship.
 Phrase-Based Translation: SMT can also use phrases instead of
individual words to capture more natural translation. For example,
the phrase “good morning” in English may align with “buenos
días” in Spanish.
 Language Models: A target language model is built by analyzing
the frequency and probability of words or phrases occurring in the
target language. This helps in generating fluent translations.

2. Process:

Training Phase: A statistical model is trained on large amounts of parallel


text, allowing it to learn language pair relationships and probability
distributions for translating one language to another.
 Decoding Phase: Given a source sentence, the system uses the trained
statistical model to predict the most likely translation by considering
word/phrase alignments and language model probabilities.
2. Challenges:
 Handling Ambiguity: SMT models struggle with ambiguities, such as
words that have multiple meanings based on context.
 Fluency: SMT may produce grammatically incorrect or awkward
translations because it focuses heavily on word/phrase-level translation
rather than sentence structure.
3. Example:
 English to French Translation: For the sentence “I like dogs,” an SMT
system might align “I” with “Je,” “like” with “aime,” and “dogs” with
“chiens,” producing the translation “J’aime les chiens.”
68. Explain text summarization and multiple document text summarization with neat
diagram. (10 Marks)

 Text Summarization:
1. Definition: Text summarization is the process of reducing a large body of text
into a shorter version while preserving the key information. It can be done in two
ways:
 Extractive Summarization: Extracting important sentences or phrases
directly from the original text and combining them to form a summary.
 Abstractive Summarization: Generating a new summary that
paraphrases the original text by understanding its meaning.
2. Multiple Document Text Summarization:
 Definition: Multiple document summarization aims to create a concise
summary from several documents on the same topic. The goal is to
combine and filter out redundant information while ensuring all key points
are captured.
 Process:
 Document Clustering: Group similar documents together based
on their content.
 Sentence Extraction: Identify key sentences from each document.
 Redundancy Removal: Eliminate duplicated information to
ensure a compact summary.
3. Diagram:
 A diagram for multiple document summarization could depict:
 Step 1: Multiple documents are fed into the system.
 Step 2: Key sentences are extracted from each document.
 Step 3: Similar content across documents is identified.
 Step 4: Redundant information is removed to form the final
summary.

69. With example, illustrate Abstraction-based summarization. (5 Marks)

 Abstraction-Based Summarization:
1. Definition: Abstraction-based summarization involves generating a summary that
paraphrases the original text, creating a more natural and concise representation of
the information. This method requires deeper understanding and language
generation.
2. Process:
 Understanding the Input: The system processes the input text to extract
meaning rather than simply selecting key sentences.
 Paraphrasing: The system then generates a new, shorter version of the
text that captures the essence of the original content.
3. Example:
 Original Text: “The rain caused severe flooding across the city. People
had to be rescued from their homes as the water levels rose.”
 Abstraction-Based Summary: “Heavy rain led to widespread flooding,
requiring rescue operations.”
4. Key Points:
 Language Generation: It is based on language generation techniques that
may include machine learning models like transformers.
 Fluency: Abstraction-based summaries tend to be more fluent and natural,
unlike extractive methods that directly copy sentences from the text.

70. Illustrate the advantages and disadvantages of a Content-based and Collaborative


Filtering recommendation system. (5 Marks)

 Content-Based Filtering:
1. Advantages:
 Personalization: Recommends items similar to those a user has interacted
with, providing a personalized experience.
 No Cold Start for Items: New items can be recommended if they have
enough metadata (e.g., genre, description).
 Transparency: The reasons for recommending items are clearer because
they are based on the content the user has liked.
2. Disadvantages:
 Limited Discovery: Users may only be recommended items similar to
what they’ve seen, leading to a lack of variety.
 Dependency on Item Metadata: It requires detailed metadata, which may
not always be available.
 Collaborative Filtering:
1. Advantages:
 Discover New Items: It can recommend items that the user may never
have come across, expanding their preferences.
 No Need for Item Metadata: It doesn’t rely on item characteristics,
making it useful for items with sparse metadata.
2. Disadvantages:
 Cold Start Problem: It struggles with new users or new items that lack
sufficient interaction data.
 Scalability Issues: As the number of users and items increases,
collaborative filtering models can become computationally expensive.

THANK YOU

You might also like