NLP SEE
NLP SEE
Q1) Explain the Information Retrieval (IR) and Classical Problem in Information Retrieval (IR)
System.
Ans. Information Retrieval (IR) is the process of obtaining relevant information from a large
repository, such as a database or the web, based on user queries.
Key Concepts:
1. Document Collection: A repository of structured or unstructured text documents.
2. Queries: User input specifying the required information.
3. Relevance: Matching documents based on their content and user query.
4. Ranking: Arranging documents in order of relevance to the query.
Example: Search engines like Google use IR techniques to provide relevant results for user
queries.
Algorithm Concept:
The Boolean Model uses set operations to retrieve documents that satisfy a query's logical
conditions. The model assumes exact matching and outputs a binary decision: relevant or not
relevant.
Steps or Procedure:
1. Indexing: Create an inverted index mapping terms to document IDs.
2. Query Parsing: Convert the user query into a logical expression.
3. Set Operations: Perform set-based operations (union, intersection, or complement) on
document IDs based on Boolean operators.
4. Result Retrieval: Return documents satisfying the query criteria without ranking.
Advantages:
1. Simplicity: The model is easy to understand and implement, especially for users familiar
with Boolean logic.
2. Precision: Allows users to retrieve specific documents using exact query matching
criteria.
3. Efficiency: Works well for small datasets with straightforward and well-defined queries.
4. Structured Queries: Handles queries involving complex logical combinations using AND,
OR, and NOT operators.
5. Customization: Offers flexibility to craft queries based on specific requirements using
Boolean expressions.
6. Low Computational Requirements: Does not require advanced computational resources,
making it lightweight.
Disadvantages:
1. No Ranking: Fails to rank retrieved documents, making it difficult to prioritize the most
relevant ones.
2. Exact Match Dependency: Ineffective for ambiguous or incomplete queries, as it relies
on exact term matching.
3. No Partial Matching: Does not support fuzzy search or term similarity, limiting retrieval
capabilities.
4. Rigid Query Structure: Users must precisely formulate queries, which can be
challenging for complex or vague information needs.
5. No Semantic Understanding: Fails to capture the relationships or context between
terms.
Key Components
1. Documents as Vectors: Each document is represented as a vector of terms in a multi-
dimensional space.
2. Queries as Vectors: Queries are treated similarly, represented as vectors of terms.
3. Vector Components: Components are term weights, often calculated using TF-IDF
(Term Frequency-Inverse Document Frequency).
4. Cosine Similarity: Measures the cosine of the angle between query and document
vectors to assess similarity.
Algorithm Concept
The model calculates the cosine of the angle between query and document vectors in a vector
space. Smaller angles (cosine values closer to 1) indicate higher similarity.
Formula
Where:
Steps or Procedure
1. Vector Representation: Represent documents and queries as term-weighted vectors
using TF-IDF or other weighting schemes.
2. Dot Product Calculation: Compute the dot product between the query vector and each
document vector.
3. Magnitude Calculation: Calculate the Euclidean norm (magnitude) of the vectors.
4. Similarity Measurement: Use the cosine similarity formula to compute similarity scores
for each document.
5. Result Ranking: Rank documents based on similarity scores for query relevance.
Advantages
1. Relevance Ranking: Assigns scores, allowing documents to be ranked by similarity.
2. Partial Matching: Retrieves documents even with partial query matches.
3. Scalability: Works well with large datasets using sparse representations.
4. Mathematical Foundation: Provides a structured, mathematical approach to measuring
relevance.
Disadvantages
1. Dimensionality Issues: High-dimensional space increases computational complexity.
2. Synonym Limitations: Cannot capture semantic similarities or resolve word
ambiguities.
3. Weight Sensitivity: Accuracy depends on term weighting schemes like TF-IDF.
4. No Contextual Understanding: Ignores word order or deeper contextual relationships.
Graph:
Q5) Stemming.
Ans. Stemming in NLP
Concept Meaning:
Stemming in Natural Language Processing (NLP) reduces words to their root or base form by
removing suffixes or prefixes, improving text normalization and search efficiency.
Key Components
1. Word Roots: The base form of a word that retains its core meaning.
2. Affixes Removal: Eliminating prefixes, suffixes, or inflectional endings.
3. Algorithms: Rules or statistical methods used to derive word stems.
Types of Stemming
1. Porter Stemmer: Widely used and removes common suffixes based on a set of rules.
2. Lancaster Stemmer: A more aggressive stemmer, reducing words to shorter roots.
3. Snowball Stemmer: An improved version of Porter with support for multiple languages.
4. Lovins Stemmer: Early and rule-based, focusing on removing longest matching suffixes.
5. Suffix-Stripping Stemmer: Removes suffixes based on predefined patterns or heuristics.
6. Corpus-Based Stemmer: Utilizes a specific corpus to identify stem patterns statistically.
7. Hybrid Stemmer: Combines rule-based and statistical methods for better performance.
8. Light Stemmer: Focuses on minor affix removal, often used in non-English languages
like Arabic.
9. Inflectional Stemmer: Deals only with inflectional endings like plurals or verb
conjugations.
10. Rule-Based Stemmer: Applies explicit linguistic rules for stripping affixes.
11. Machine Learning Stemmer: Learns stemming patterns using supervised learning
models trained on labeled data.
Algorithm Concept
Stemming algorithms apply a sequence of transformation rules to remove affixes iteratively or
by matching against pre-defined patterns. The goal is to simplify word forms consistently
without altering meaning significantly.
Example:
Q6) Comparison Between Google Assistant, Apple Siri, and Amazon Alexa
Device Works with Android Exclusively for Apple Extensive support for
Compatibility phones, smart TVs, devices with limited smart home devices and
speakers, and IoT. third-party access. IoT.
Smart Home Works with Google Integrated with Apple Strong ecosystem for
Integration Home ecosystem. HomeKit for Alexa-enabled smart
compatible devices. devices.
App Supports Google apps Integrates with Apple Amazon services like
Integration like Calendar, Maps, apps like Safari, Mail, Prime, Kindle, and
and Gmail. and Reminders. Music.
Music Services Compatible with Works with Apple Supports Amazon Music,
YouTube Music, Music and limited Spotify, and many third-
Spotify, and others. third-party services. party apps.
Developer Open platform for Limited developer Alexa Skills Kit allows
Support developers to build customization for Siri extensive third-party
Google Actions. Shortcuts. integrations.
MODULE 4
1. Dimensionality Reduction: LSA reduces high-dimensional word space into a smaller set
of latent concepts using Singular Value Decomposition (SVD), capturing underlying
semantic structures.
2. Term-Document Matrix: Represents words as rows and documents (self-explanations)
as columns, with values indicating term frequency or importance.
3. Latent Semantic Space: Maps words and explanations into a lower-dimensional space
where similar meanings are closer.
4. Semantic Similarity: Measures how closely a student’s self-explanation aligns with the
ideal response by comparing vector representations in latent space.
5. Conceptual Understanding: It helps uncover hidden semantic structures, enabling
evaluations based on conceptual meaning rather than surface-level matching.
6. SVD Decomposition: Breaks the term-document matrix into three matrices (U, Σ, and
V), retaining only the most significant features to reduce noise.
7. Handling Synonymy: Captures relationships between synonyms (e.g., "car" and
"automobile") by mapping them to the same latent concept.
8. Noise Reduction: Removes irrelevant or redundant data, ensuring semantic analysis
focuses only on meaningful patterns in the text.
9. Similarity Scoring: Calculates cosine similarity between vectors of student and expected
explanations, providing a quantifiable match score.
10. Scalability: Efficiently evaluates a large number of responses by summarizing text into a
compact representation.
11. Data-Driven Approach: Relies on mathematical techniques, ensuring objective
evaluations without requiring manually crafted semantic rules.
12. Knowledge Representation: Models conceptual understanding, making it effective for
assessing the depth of student explanations in ISTART.
MODULE 3
Q1) Extracting Relations from text in NLP. (def. , concept, challenges) (IAT).
Ans. Relation Extraction (RE) :
− Identifies relationships between entities in text (e.g., "John works at Google" identifies a
works-at relation).
− Aims to structure unstructured text into knowledge graphs or databases.
Entities :
− Core elements in the text, such as names, organizations, or locations (e.g., John, Google).
− Extraction requires identifying these entities for relation identification.
Types of Relations :
− Binary Relations: Two entities connected by a single relation (e.g., "Person works-at
Organization").
− N-ary Relations: Multiple entities connected by a relation (e.g., "Person treated for
Disease by Doctor").
Applications : Used in knowledge graph construction, question-answering systems, and
automated data enrichment tasks.
Challenges :
Q2) Difference between the Rule-Based Methods and Supervised Learning extracting relations
from text in NLP. (IAT)
Definition Uses predefined linguistic rules for Learns relations from labeled data
relation extraction. using machine learning.
Scalability Difficult to scale due to manual Scalable with sufficient training data.
rule creation.
Accuracy Highly accurate for well-defined Accuracy depends on data quality and
patterns. model training.
Generalization Poor at generalizing to unseen data. Can generalize well with diverse
training data.
Data Works with minimal data but Requires large amounts of labeled data.
Requirement depends on rule quality.
Ans. Objective: Extract the relationship works-at from the sentence: "Alice works at Google."
1. Input Text - Start with raw text data, e.g., "Alice works at Google."
2. Preprocessing - Clean the text by removing punctuation, lowercasing (if required), and
tokenizing into individual words or phrases: ["Alice", "works", "at", "Google"]
3. Named Entity Recognition (NER) - Identify named entities in the text, such as people,
organizations, or locations.
Example: Alice is labeled as a Person, and Google as an Organization.
4. Dependency Parsing - Analyze grammatical structure to find relationships between
words (e.g., subject, object).
Example:
a. works → subject → Alice
b. works → object → Google
5. Feature Extraction - Extract features such as Part-of-Speech (POS) tags, dependency
paths, and surrounding context for the identified entities.
Example: POS of Alice = Noun, Google = Proper Noun, Dependency Path = works-at.
6. Relation Classification - Use a trained model (e.g., based on BERT or a rule-based
method) to determine the relationship between entities.
Output: works-at
7. Postprocessing - Validate and format the extracted relations for structured storage or
display.
Example: (Alice, works-at, Google)
8. Output - Store the relation in a database or knowledge graph, or display it for user
interaction.
Final Output: (Entity 1: Alice, Relation: works-at, Entity 2: Google)
MODULE 2
Q1) Regular Expressions in Natural Language Processing (NLP). (IAT)
Ans. A Regular Expression (Regex) is a sequence of characters defining a search pattern,
primarily used for text processing tasks like searching, matching, and manipulating strings in
NLP.
Formula :
− An FSA is represented mathematically as a 5-tuple: (Q,Σ,δ,q0 ,F)
Where:
1. Q: Set of finite states.
2. Σ: Input alphabet (set of symbols).
3. δ: Transition function (δ: Q×Σ→Q).
4. q0 : Initial state (q0 ∈ Q).
5. F: Set of final/accepting states (F⊆Q).
Types of Finite State Automata with Examples
1. Deterministic Finite Automata (DFA) :
− Each input leads to one and only one state.
− Eg : Recognizing binary strings ending with "01".
2. Non-Deterministic Finite Automata (NDFA)
− Transitions may lead to multiple states for a given input.
− Example: Recognizing binary strings containing "101".
3. ε-NFA (Epsilon-NFA)
− Includes ε-transitions (move without consuming input).
− Example: Recognizing strings starting with "a" or "b".
4. Pushdown Automata (PDA)
− Uses a stack for memory, handles context-free languages.
− Example: Parsing nested parentheses.
5. Turing Machine
− Infinite tape for memory; models any computable function.
− Example: Complex string transformations.
6. Linear Bounded Automata (LBA)
− Turing Machine with bounded tape size.
− Example: Recognizing strings of equal 'a's and 'b's.
7. Probabilistic FSA
− Each transition is associated with a probability.
− Example: Speech recognition.
8. Weighted FSA
− Each transition has weights (cost or frequency).
− Example: Pathfinding in syntax trees.
9. Mealy Machine
− Outputs are determined by transitions.
− Example: Translating Morse code.
10. Moore Machine
− Outputs are associated with states.
− Example: Encoding strings into binary.
11. Two-Way FSA
− Reads input in both directions.
− Example: Palindrome detection.
12. Timed Automata
− Transitions depend on time constraints.
− Example: Real-time processing in chatbots.
13. Hybrid Automata
− Combines discrete and continuous transitions.
− Example: Speech-to-text systems with audio features.
Advantages – Simplicity, Efficiency (Operates in linear time), Versatility (supports tasks like
tokenization, spell-checking)
Applications in NLP – Tokenization, Spell Checking (verify validity of words), Named Entity
Recognition (NER) (identify nouns , etc), Information Retrieval (matching queries w docs).
Q3)
Ans. A parser is a component of the Natural Language Processing pipeline that processes input
sentences to derive their grammatical structure. It converts unstructured or semi-structured data
into a structured format.
Block Diagram :
• Input sentence – apply grammar rules (context-free grammar) – forms parse trees
(Hierarchical representation of sentence structure)
Types of Parsing:
1. Top-Down Parsing - Starts from the root of the parse tree and expands to leaves. Example:
Recursive Descent Parsing.
- Use: Recognizes valid strings based on grammar.
2. Bottom-Up Parsing
− Starts from the input symbols (leaves) and builds up to the root.
− Use: Identifies structures as it reads the sentence.
3. Syntactic Parsing
− Focuses on grammar rules to construct the structure.
− Example: Context-Free Grammar (CFG).
− Use: Syntax tree generation for sentences.
4. Semantic Parsing
− Maps natural language to logical forms or meaning representations.
− Use: Question answering and dialogue systems.
5. Dependency Parsing
− Analyzes grammatical dependencies between words.
− Example: "She enjoys reading" → "enjoys" is the head, "she" and "reading" are
dependents.
6. Constituency Parsing
− Breaks sentences into constituents like phrases.
− Example: "The big cat" → Noun Phrase (NP).
7. Statistical Parsing
− Uses probabilistic methods to determine the most likely parse.
− Example: Probabilistic Context-Free Grammar (PCFG).
Advantages - Syntactic Analysis (Extracts sentence structure), Flexibility, Error Detection,
Supports AI Models (chatbots and translators).