0% found this document useful (0 votes)
4 views

NLP SEE

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

NLP SEE

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

MODULE 5

Q1) Explain the Information Retrieval (IR) and Classical Problem in Information Retrieval (IR)
System.
Ans. Information Retrieval (IR) is the process of obtaining relevant information from a large
repository, such as a database or the web, based on user queries.

Key Concepts:
1. Document Collection: A repository of structured or unstructured text documents.
2. Queries: User input specifying the required information.
3. Relevance: Matching documents based on their content and user query.
4. Ranking: Arranging documents in order of relevance to the query.

Example: Search engines like Google use IR techniques to provide relevant results for user
queries.

Design Features of IR:


i. Document Indexing: Efficiently organize documents using structures like inverted
indexes for quick retrieval.
ii. Query Processing: Parse, normalize, and expand user queries to handle diverse formats
and improve performance.
iii. Relevance Ranking: Rank documents based on relevance using algorithms like TF-IDF
or BM25.
iv. Scalability: Manage large datasets with efficient storage, distributed indexing, and
retrieval mechanisms.
v. Natural Language Support: Process queries and documents using stemming,
lemmatization, and phrase detection.
vi. User Feedback Integration: Enable iterative refinement of results through relevance
feedback.
vii. Semantic Search: Match concepts rather than keywords using contextual understanding.
viii. Multimedia Retrieval: Support various content types, including text, images, videos,
and audio.
ix. Real-Time Updates: Allow dynamic addition or modification of indexed documents.
x. Cross-Language Retrieval: Facilitate multilingual searching using machine translation
or language-independent indexing.
xi. Personalization: Tailor results based on user preferences, search history, and behavior.
xii. Security and Privacy: Ensure data confidentiality and protection for sensitive queries or
documents.

Classical Problems in IR Systems

Ad-Hoc Retrieval Problem


• Definition: A core problem where the system retrieves a ranked list of documents
relevant to a user's specific query without prior knowledge of the query.
• Characteristics:
− The user query is often vague or ambiguous.
− The document collection is static.
− Relevance ranking is crucial.
• Challenges:
− Determining user intent accurately.
− Ranking documents effectively despite query ambiguity.
• Key Components:
− Query Processing: Parsing and understanding the user query.
− Indexing: Precomputing document features for efficient matching.
− Matching and Ranking: Scoring documents based on relevance metrics like TF-IDF or
BM25.
• Real-World Example: When a user searches for "best phones," the system retrieves
relevant documents from a static collection and ranks them based on factors like reviews,
features, and popularity.

Q2) Types of Information Retrieval (IR) Models


Ans. The various types include:
1. Classical IR Model: Includes Boolean, vector space, and probabilistic models, focusing
on document-term relationships and relevance.
2. Non-Classical IR Model: Explores semantic, fuzzy, and graph-based models to address
challenges of contextual and uncertain data retrieval.
3. Alternative IR Model: Introduces ranking, hybrid, and knowledge-based methods,
combining multiple approaches to improve search accuracy.
4. Boolean Model: Represents queries using logical operators (AND, OR, NOT) to retrieve
exact matches from documents.
5. Vector Space Model: Represents documents and queries as vectors in multi-dimensional
space; uses cosine similarity for ranking relevance.
6. Probabilistic Model: Predicts the probability of a document's relevance to a query using
probabilistic reasoning.
7. Language Model: Uses statistical language models to rank documents based on the
likelihood of generating the user query.
8. Latent Semantic Indexing (LSI): Extracts hidden semantic structures in documents for
improved similarity matching.
9. Extended Boolean Model: Enhances classical Boolean with partial matching and
ranking for more flexible retrieval.
10. Ranking Models: Focuses on ranking algorithms (e.g., PageRank) to prioritize
documents based on their importance or authority.
11. Neural Network Models: Uses deep learning architectures like BERT and transformers
to capture semantic relationships and context.
12. Graph-Based Models: Represents documents and terms as nodes in a graph to analyze
relationships and rank documents.
13. Hybrid Models: Combines classical and non-classical models, leveraging their strengths
for enhanced accuracy and versatility in retrieval.

Q3) Boolean Model.

Ans. Boolean Model in Information Retrieval


Concept Meaning:
The Boolean Model represents queries and documents using binary logic, where terms are
matched exactly based on logical operators like AND, OR, and NOT.
Key Components:
1. Documents: A collection of indexed documents containing terms.
2. Queries: Logical expressions using Boolean operators (AND, OR, NOT) to specify
retrieval criteria.
3. Operators:
a. AND: Retrieves documents containing all specified terms.
b. OR: Retrieves documents containing at least one of the specified terms.
c. NOT: Excludes documents containing the specified term.

Algorithm Concept:
The Boolean Model uses set operations to retrieve documents that satisfy a query's logical
conditions. The model assumes exact matching and outputs a binary decision: relevant or not
relevant.

Steps or Procedure:
1. Indexing: Create an inverted index mapping terms to document IDs.
2. Query Parsing: Convert the user query into a logical expression.
3. Set Operations: Perform set-based operations (union, intersection, or complement) on
document IDs based on Boolean operators.
4. Result Retrieval: Return documents satisfying the query criteria without ranking.

Advantages:
1. Simplicity: The model is easy to understand and implement, especially for users familiar
with Boolean logic.
2. Precision: Allows users to retrieve specific documents using exact query matching
criteria.
3. Efficiency: Works well for small datasets with straightforward and well-defined queries.
4. Structured Queries: Handles queries involving complex logical combinations using AND,
OR, and NOT operators.
5. Customization: Offers flexibility to craft queries based on specific requirements using
Boolean expressions.
6. Low Computational Requirements: Does not require advanced computational resources,
making it lightweight.

Disadvantages:
1. No Ranking: Fails to rank retrieved documents, making it difficult to prioritize the most
relevant ones.
2. Exact Match Dependency: Ineffective for ambiguous or incomplete queries, as it relies
on exact term matching.
3. No Partial Matching: Does not support fuzzy search or term similarity, limiting retrieval
capabilities.
4. Rigid Query Structure: Users must precisely formulate queries, which can be
challenging for complex or vague information needs.
5. No Semantic Understanding: Fails to capture the relationships or context between
terms.

Q4) Vector Space Model of Cosine Similarity

Ans. Concept Meaning


The Vector Space Model represents documents and queries as vectors in a multi-dimensional
space, enabling similarity measurement using cosine similarity.

Key Components
1. Documents as Vectors: Each document is represented as a vector of terms in a multi-
dimensional space.
2. Queries as Vectors: Queries are treated similarly, represented as vectors of terms.
3. Vector Components: Components are term weights, often calculated using TF-IDF
(Term Frequency-Inverse Document Frequency).
4. Cosine Similarity: Measures the cosine of the angle between query and document
vectors to assess similarity.

Algorithm Concept
The model calculates the cosine of the angle between query and document vectors in a vector
space. Smaller angles (cosine values closer to 1) indicate higher similarity.

Formula

Where:

• Q⃗⋅D⃗ : Dot product of query (Q⃗) and document (D⃗) vectors.


• ∥Q ∥,∥D∥: Magnitudes (Euclidean norms) of query and document vectors.

Steps or Procedure
1. Vector Representation: Represent documents and queries as term-weighted vectors
using TF-IDF or other weighting schemes.
2. Dot Product Calculation: Compute the dot product between the query vector and each
document vector.
3. Magnitude Calculation: Calculate the Euclidean norm (magnitude) of the vectors.
4. Similarity Measurement: Use the cosine similarity formula to compute similarity scores
for each document.
5. Result Ranking: Rank documents based on similarity scores for query relevance.
Advantages
1. Relevance Ranking: Assigns scores, allowing documents to be ranked by similarity.
2. Partial Matching: Retrieves documents even with partial query matches.
3. Scalability: Works well with large datasets using sparse representations.
4. Mathematical Foundation: Provides a structured, mathematical approach to measuring
relevance.

Disadvantages
1. Dimensionality Issues: High-dimensional space increases computational complexity.
2. Synonym Limitations: Cannot capture semantic similarities or resolve word
ambiguities.
3. Weight Sensitivity: Accuracy depends on term weighting schemes like TF-IDF.
4. No Contextual Understanding: Ignores word order or deeper contextual relationships.

Graph:

Q5) Stemming.
Ans. Stemming in NLP
Concept Meaning:
Stemming in Natural Language Processing (NLP) reduces words to their root or base form by
removing suffixes or prefixes, improving text normalization and search efficiency.

Key Components
1. Word Roots: The base form of a word that retains its core meaning.
2. Affixes Removal: Eliminating prefixes, suffixes, or inflectional endings.
3. Algorithms: Rules or statistical methods used to derive word stems.

Types of Stemming

1. Porter Stemmer: Widely used and removes common suffixes based on a set of rules.
2. Lancaster Stemmer: A more aggressive stemmer, reducing words to shorter roots.
3. Snowball Stemmer: An improved version of Porter with support for multiple languages.
4. Lovins Stemmer: Early and rule-based, focusing on removing longest matching suffixes.
5. Suffix-Stripping Stemmer: Removes suffixes based on predefined patterns or heuristics.
6. Corpus-Based Stemmer: Utilizes a specific corpus to identify stem patterns statistically.
7. Hybrid Stemmer: Combines rule-based and statistical methods for better performance.
8. Light Stemmer: Focuses on minor affix removal, often used in non-English languages
like Arabic.
9. Inflectional Stemmer: Deals only with inflectional endings like plurals or verb
conjugations.
10. Rule-Based Stemmer: Applies explicit linguistic rules for stripping affixes.
11. Machine Learning Stemmer: Learns stemming patterns using supervised learning
models trained on labeled data.

Algorithm Concept
Stemming algorithms apply a sequence of transformation rules to remove affixes iteratively or
by matching against pre-defined patterns. The goal is to simplify word forms consistently
without altering meaning significantly.

Why Use Stemming?


1. Search Optimization: Reduces query and document terms to common stems for better
matches.
2. Text Normalization: Converts words to a consistent base form, simplifying text
analysis.
3. It minimizes the confusion around words that have similar meanings.
4. Storage Efficiency: Reduces storage requirements by collapsing similar words.
5. Indexing: When creating applications that search a specific text in a document, using
stemming for indexing assists in retrieving relevant documents.
6. Language Flexibility: Handles word variations in different grammatical contexts
effectively.
7. Efficiency: It enhances the accuracy of the ML/DL model as the model does not have to
deal with infected word forms.
8. Cost-Effective: Decreases computational overhead by reducing vocabulary size.

Example:
Q6) Comparison Between Google Assistant, Apple Siri, and Amazon Alexa

Feature Google Assistant Apple Siri Amazon Alexa


Platform Android, iOS, smart iOS, macOS, watchOS, Amazon Echo, Fire
Integration devices, and Google and Apple ecosystem. devices, and Alexa-
ecosystem. enabled hardware.

Voice Advanced and Limited to single user Multi-user voice


Recognition supports multiple user profile but improving. recognition for
profiles. personalized responses.

Language Over 40 languages and Supports 20+ Over 15 languages but


Support regional dialects. languages with strong fewer dialects than
regional variations. Google Assistant.

Device Works with Android Exclusively for Apple Extensive support for
Compatibility phones, smart TVs, devices with limited smart home devices and
speakers, and IoT. third-party access. IoT.

Search Leverages Google Relies on Apple Uses Bing and Amazon’s


Capability Search for detailed and services, less detailed data sources.
accurate results. web results.

Smart Home Works with Google Integrated with Apple Strong ecosystem for
Integration Home ecosystem. HomeKit for Alexa-enabled smart
compatible devices. devices.
App Supports Google apps Integrates with Apple Amazon services like
Integration like Calendar, Maps, apps like Safari, Mail, Prime, Kindle, and
and Gmail. and Reminders. Music.

Music Services Compatible with Works with Apple Supports Amazon Music,
YouTube Music, Music and limited Spotify, and many third-
Spotify, and others. third-party services. party apps.

Developer Open platform for Limited developer Alexa Skills Kit allows
Support developers to build customization for Siri extensive third-party
Google Actions. Shortcuts. integrations.

Shopping Integrates with Google Limited e-commerce Strong focus on Amazon


Integration Shopping for e- features; mainly in-app shopping and reordering
commerce. purchases. products.

Offline Some basic commands Basic functions like Limited offline


Support work offline. setting alarms work capabilities; relies heavily
offline. on internet.

Personalization Offers personalized Personalization tied to Provides tailored


responses with Google Apple ID but less experiences using
Account data. detailed. Amazon account data.

MODULE 4

Q1) Synonym Matching and Semantic Matching. (IAT)

Ans. The differences are as follows:

Aspect Synonym Matching Semantic Matching


Definition Matches words based on similar A Analyzes the broader meaning by
meanings (synonyms) to allow understanding the context of phrases and
flexibility sentences.
Purpose Aims to account for varying word Focuses on deeper comprehension by
choices that convey the same assessing the meaning beyond individual
concept words.
Technique Uses databases like WordNet or Utilizes advanced NLP models like
synonym dictionaries. BERT or Word2Vec for similarity.
Scope Limited to individual word-level Evaluates whole sentences and phrases
matching without broader context. to capture intent and meaning.

Application Useful for varied vocabulary in Assesses complete explanations to


in ISTART student explanations without rigid understand if student intent aligns with
terms. target meaning.
Example Matches “joyful” for “happy” if the Recognizes “Nature needs balance” as
meanings align closely. similar to “ecosystem relies on balance.”
Advantages Increases accuracy by accepting Provides deeper insight by understanding
vocabulary variations in responses. student intent and overall context.
Limitations Limited to surface-level meanings Resource-intensive and may yield false
and may miss contextual depth. positives if phrases are semantically
similar but contextually inaccurate.
Reliance on Heavily dependent on specific Relies less on specific vocabulary and
Vocabulary vocabulary. more on conceptual relevance.

Implementa Low computational needs, suitable High computational needs, involving


tion for simple applications. complex machine learning models.
Complexity
Suitability Effective for assessing Ideal for abstract or complex ideas
for Basic straightforward, consistent needing interpretive flexibility.
Concepts concepts.
Error Types Prone to false negatives if Risk of false positives.
synonyms are absent in the
database.
NLP Relies on basic NLP resources like Uses sophisticated models (e.g.,
Dependency lexical databases. transformers) for semantic analysis.
Q2) Exact matching and Fuzzy matching. (IAT)
Ans. The differences include:
Aspect Exact Matching Fuzzy Matching
Definition Compares text for identical Allows approximate matching,
matches in words or phrases accounting for minor variations or errors
without any tolerance for variation. in text.
Purpose Identifies precise matches, Identifies similar but not identical
ensuring consistency and clarity in matches to capture significant
responses. understanding.
Output Binary results (match or no match). Similarity score (0 to 100%) indicating
the degree of match.
Context Limited context awareness; only Considers context by allowing partial or
Awareness matches identical text patterns. similar matches based on word
similarity.
Complexity Relatively simple to implement More complex, involving algorithms to
with lower computational calculate similarity or edit distance.
requirements.
Efficiency Fast and computationally efficient, Slower due to additional processing
suitable for straightforward required for calculating similarity
matching. scores.
Use in Useful for detecting exact Enables flexible matching to account for
ISTART keywords or phrases in self- slight variations in wording.
explanations.
Effect on Strictly matches the text, which Considers semantic similarity, which
Meaning may overlook semantic meaning. helps in assessing meaning even with
minor differences in wording.
Examples "Climate change" matches exactly "Climate change" may match "changing
only with "Climate change." climate" or "climate variation" with a
high similarity score.
Applications Commonly used in systems that Used in search engines, spelling
in NLP require precise matching, such as correction, and language understanding
data retrieval or code syntax. applications.
Limitations Misses partial or near matches, Higher computational load and may
limiting flexibility in capturing yield false positives.
intent.
Algorithms Hashing, dictionary lookups, or Euclidean distance, Jaccard similarity,
Used basic string comparison techniques. cosine similarity or Soundex.
Accuracy Highly accurate for identical Flexible with a balance between
matches but lacks flexibility. accuracy and tolerance, suited for
varying word forms.

Q3) Key concepts of latent semantic analysis (LSA). (IAT)


Ans. Latent Semantic Analysis (LSA) is a natural language processing technique that identifies
relationships between words and documents by reducing dimensionality.

Key Concepts include:

1. Dimensionality Reduction: LSA reduces high-dimensional word space into a smaller set
of latent concepts using Singular Value Decomposition (SVD), capturing underlying
semantic structures.
2. Term-Document Matrix: Represents words as rows and documents (self-explanations)
as columns, with values indicating term frequency or importance.
3. Latent Semantic Space: Maps words and explanations into a lower-dimensional space
where similar meanings are closer.
4. Semantic Similarity: Measures how closely a student’s self-explanation aligns with the
ideal response by comparing vector representations in latent space.
5. Conceptual Understanding: It helps uncover hidden semantic structures, enabling
evaluations based on conceptual meaning rather than surface-level matching.
6. SVD Decomposition: Breaks the term-document matrix into three matrices (U, Σ, and
V), retaining only the most significant features to reduce noise.
7. Handling Synonymy: Captures relationships between synonyms (e.g., "car" and
"automobile") by mapping them to the same latent concept.
8. Noise Reduction: Removes irrelevant or redundant data, ensuring semantic analysis
focuses only on meaningful patterns in the text.
9. Similarity Scoring: Calculates cosine similarity between vectors of student and expected
explanations, providing a quantifiable match score.
10. Scalability: Efficiently evaluates a large number of responses by summarizing text into a
compact representation.
11. Data-Driven Approach: Relies on mathematical techniques, ensuring objective
evaluations without requiring manually crafted semantic rules.
12. Knowledge Representation: Models conceptual understanding, making it effective for
assessing the depth of student explanations in ISTART.

Q4) Applications and Advantages of LSA. (IAT)

Ans. Applications include :

1. Textual Similarity Measurement: Identifies how closely students' responses match


ideal answers, enabling accurate evaluation of comprehension.
2. Concept Discovery: Detects underlying themes or key concepts in self-explanations,
offering insights into students' understanding.
3. Vocabulary Generalization: Matches different word choices with similar meanings,
reducing the need for exact phrasing in responses.
4. Feedback Generation: Provides automated, personalized feedback by comparing student
explanations to expected semantic patterns.
5. Performance Analysis: Assesses learning progress over time by analyzing the
conceptual quality of successive self-explanations.
6. Response Clustering: Groups similar explanations, helping instructors identify common
strengths or misconceptions across multiple students.
7. Automated Scoring: Enables large-scale evaluation of self-explanations, reducing
manual grading workload while maintaining objectivity.
8. Question Categorization: Groups similar questions based on semantic patterns to
optimize learning modules in ISTART.
Advantages include :
1. Semantic Flexibility: Captures conceptual similarities even with varied vocabulary,
improving evaluation accuracy beyond keyword matching.
2. Deeper Understanding: Analyzes relationships between terms, uncovering hidden
patterns of meaning in student responses.
3. Automated Efficiency: Reduces grading time with scalable, automated analysis for large
datasets of student explanations.
4. Noise Elimination: Filters out irrelevant terms or redundant information, focusing on
meaningful contributions to self-explanations.
5. Objective Evaluation: Provides unbiased scoring based on mathematical modeling
rather than subjective human judgment.
6. Handling Synonymy: Matches responses with synonymous terms to ensure fair
evaluation of different expression styles.
7. Improved Feedback: Generates detailed insights into conceptual gaps, enabling targeted
feedback to enhance learning.
8. Cost-Effective: Reduces the need for extensive manual intervention, making it a practical
solution for large-scale educational environments.
Disadvantages include :
− Lacks Context Sensitivity: Fails to fully understand word meanings in complex
contexts, leading to potential misinterpretations.
− Ignores Word Order: Treats words as independent, overlooking grammatical structures
or sentence flow in self-explanations.
− Computational Complexity: Requires significant computational resources for singular
value decomposition, especially with large datasets.
− Static Semantic Space: Relies on fixed training data, struggling to adapt to new
vocabulary.
Q5) Difference between tokenization, stemming, and lemmatization.
Ans. The differences include :

Aspect Tokenization Stemming Lemmatization


Definition Splits text into smaller Reduces words to their Converts words to their
units (tokens) like root form by removing base form (lemma).
words or sentences. suffixes.
Purpose Helps in separating Simplifies words to Provides meaningful
meaningful units for common roots for base forms, aiding in
processing. efficient comparison. accurate analysis of word
meaning.
Output Individual tokens like Truncated root forms Grammatically accurate
words or punctuation. (e.g., "running" to "run"). base forms (e.g., "better"
to "good").
Context Does not consider the Lacks context awareness; Considers linguistic
Awareness word meaning or may produce incomplete context, producing
grammatical context. or incorrect roots. semantically accurate
lemmas.
Complexity Simple and direct Moderate complexity Higher complexity, using
splitting process. with basic rule-based dictionaries and
approaches. language rules.
Efficiency Very fast and Faster than lemmatization Slower due to context-
computationally light. but less accurate. based analysis but highly
accurate.
Effect on Does not alter word Can distort meaning due Preserves original
Meaning meaning. to oversimplification meaning by producing
(e.g., "better" to "bet"). accurate root forms.
Examples "She is running fast." "running" → "run", "running" → "run",
→ "She", "is", "better" → "bet" "better" → "good"
"running", "fast".
Applications Commonly used in Used in quick text Essential for semantic
in NLP text preprocessing and matching and retrieval analysis and natural
word frequency systems. language understanding.
analysis.
Algorithms Simple split functions Porter, Snowball, or WordNet or similar
Used or regular expressions. Lovins stemmer. lexical databases.
Accuracy Highly accurate in Moderately accurate but Highly accurate with
breaking text but no lacks context. semantic relevance.
semantic meaning.
Suitability Essential for basic text Useful for reducing word Ideal for deeper
for ISTART processing. form variation. understanding of self-
explanations.

MODULE 3
Q1) Extracting Relations from text in NLP. (def. , concept, challenges) (IAT).
Ans. Relation Extraction (RE) :
− Identifies relationships between entities in text (e.g., "John works at Google" identifies a
works-at relation).
− Aims to structure unstructured text into knowledge graphs or databases.
Entities :
− Core elements in the text, such as names, organizations, or locations (e.g., John, Google).
− Extraction requires identifying these entities for relation identification.

Types of Relations :

− Binary Relations: Two entities connected by a single relation (e.g., "Person works-at
Organization").
− N-ary Relations: Multiple entities connected by a relation (e.g., "Person treated for
Disease by Doctor").
Applications : Used in knowledge graph construction, question-answering systems, and
automated data enrichment tasks.

Challenges :

1. Language Ambiguity - Natural ambiguity causes multiple meanings, complicating


accurate extraction and interpretation.
2. Expression Variability -Different ways of expressing ideas challenge models in
generalizing extraction tasks.
3. Context Dependency - Relevant information often requires context, which models may
struggle to capture accurately.
4. Data Scarcity - Limited labeled data hinders model performance in effective information
extraction.
5. Noisy Data Handling - Typos, errors, and irrelevant content hides key information,
complicating extraction.
6. Domain Jargon - Specialized language in domains reduces model accuracy in extraction.
7. Complex Sentence Parsing - Complex structures with nested clauses can confuse
models, hindering parsing.
8. Language Diversity - Syntax and grammar variations across languages complicate
multi-lingual extraction.
9. Sentiment Detection - Sarcasm and significant expressions make accurate sentiment
extraction challenging.
10. Scalability - Growing data volumes demand scalable models that maintain accuracy and
speed.
11. Training Bias - Biased data can lead to biased extraction, affecting fairness in outcomes.
12. System Integration - Integrating extraction systems with existing platforms complicates
data flow.

Q2) Difference between the Rule-Based Methods and Supervised Learning extracting relations
from text in NLP. (IAT)

Ans. The differences include :


Aspect Rule-Based Methods Supervised Learning Methods

Definition Uses predefined linguistic rules for Learns relations from labeled data
relation extraction. using machine learning.

Dependency Relies on handcrafted rules and Requires annotated datasets for


linguistic knowledge. training.

Flexibility Limited to specific patterns and Adaptable to various domains and


languages. languages.

Scalability Difficult to scale due to manual Scalable with sufficient training data.
rule creation.

Accuracy Highly accurate for well-defined Accuracy depends on data quality and
patterns. model training.

Complexity Struggles with ambiguous or Handles ambiguity using features like


Handling complex sentences. embeddings and attention.

Resource Requires linguistic expertise and Needs computational resources for


Requirement time for rule design. training models.

Generalization Poor at generalizing to unseen data. Can generalize well with diverse
training data.

Adaptation Time-intensive to adapt to new Easily adapted by retraining on new


domains. datasets.

Error Rules propagate errors if Errors depend on model overfitting or


Propagation incorrectly designed. dataset bias.

Data Works with minimal data but Requires large amounts of labeled data.
Requirement depends on rule quality.

Language Highly dependent on language Can work across languages with


Dependence syntax and structure. appropriate data.
Maintenance Rules need frequent updates with Models are updated through retraining.
changing data patterns.

Q3) Workflow Example: Extracting Relations from Text. (IAT)

Ans. Objective: Extract the relationship works-at from the sentence: "Alice works at Google."

1. Input Text - Start with raw text data, e.g., "Alice works at Google."
2. Preprocessing - Clean the text by removing punctuation, lowercasing (if required), and
tokenizing into individual words or phrases: ["Alice", "works", "at", "Google"]
3. Named Entity Recognition (NER) - Identify named entities in the text, such as people,
organizations, or locations.
Example: Alice is labeled as a Person, and Google as an Organization.
4. Dependency Parsing - Analyze grammatical structure to find relationships between
words (e.g., subject, object).
Example:
a. works → subject → Alice
b. works → object → Google
5. Feature Extraction - Extract features such as Part-of-Speech (POS) tags, dependency
paths, and surrounding context for the identified entities.
Example: POS of Alice = Noun, Google = Proper Noun, Dependency Path = works-at.
6. Relation Classification - Use a trained model (e.g., based on BERT or a rule-based
method) to determine the relationship between entities.
Output: works-at
7. Postprocessing - Validate and format the extracted relations for structured storage or
display.
Example: (Alice, works-at, Google)
8. Output - Store the relation in a database or knowledge graph, or display it for user
interaction.
Final Output: (Entity 1: Alice, Relation: works-at, Entity 2: Google)

Q4) Techniques and Approaches in extracting relations from text in NLP.


Ans. The techniques include :
1. Rule-Based Methods: Use handcrafted linguistic rules to identify and extract predefined
relationships between entities in text.
2. Pattern Matching: Employ regular expressions or dependency patterns to detect and
extract specific relationships in structured sentences.
3. Supervised Learning: Train models on labeled datasets to predict relations using
features like embeddings and dependency paths.
4. Unsupervised Learning: Identify relations by clustering similar patterns or contexts
without relying on annotated data.
5. Semi-Supervised Learning: Combine small labeled datasets with large unlabeled
corpora to infer relations using bootstrapping techniques.
6. Dependency Parsing: Analyze grammatical dependencies between words to detect
syntactic relationships relevant for relation extraction.
7. Named Entity Recognition (NER): Identify entities like names, organizations, and
locations as a precursor to extracting their relationships.
8. Attention Mechanisms: Highlight relevant parts of text using attention layers to focus on
relationship-relevant words or phrases.
9. Contextual Embeddings: Use models like BERT or GPT to capture semantic context
and improve relation extraction accuracy.
10. Hybrid Models: Combine multiple techniques, such as rule-based and deep learning, to
enhance the accuracy of relation extraction.
11. Relation Clustering: Group similar relations into clusters to identify new, latent
relationships in text.
12. Attention Mechanisms: Highlight relevant parts of text using attention layers to focus on
relationship-relevant words or phrases.

MODULE 2
Q1) Regular Expressions in Natural Language Processing (NLP). (IAT)
Ans. A Regular Expression (Regex) is a sequence of characters defining a search pattern,
primarily used for text processing tasks like searching, matching, and manipulating strings in
NLP.

Formulating Regular Expressions

1. Basic Characters and Symbols:


i. a, b, c, etc.: Matches the exact character.
ii. .(dot): Matches any single character except newline.
iii. [abc]: Matches any character a, b, or c.
iv. [^abc]: Matches any character except a, b, or c.
2. Quantifiers:
i. *: Matches 0 or more occurrences (e.g., a* matches "", "a", "aa", etc.).
ii. +: Matches 1 or more occurrences (e.g., a+ matches "a", "aa", etc.).
iii. ?: Matches 0 or 1 occurrence (e.g., a? matches "", "a").
iv. {n}: Matches exactly n occurrences.
v. {n,}: Matches n or more occurrences.
vi. {n,m}: Matches between n and m occurrences.
3. Anchors:
i. ^: Matches the start of a string.
ii. $: Matches the end of a string.
4. Special Sequences:
i. \d: Matches any digit (equivalent to [0-9]).
ii. \D: Matches any non-digit.
iii. \w: Matches any word character (letters, digits, or underscore).
iv. \W: Matches any non-word character.
v. \s: Matches whitespace.
vi. \S: Matches any non-whitespace.

Properties of Regular Expressions:

1. Conciseness - Enables compact representation of complex patterns.


2. Flexibility - Can handle diverse text-processing needs like validation, extraction, and
substitution.
3. Universality - Supported across multiple programming languages and NLP libraries like
Python (re module), Java, and Perl.
4. Non-deterministic - Regex can match multiple patterns for the same input depending on
the greedy/lazy matching behavior.
5. Deterministic Finite Automaton (DFA) Basis - Regular expressions correspond to
DFAs, ensuring efficient pattern matching.
6. Pattern Hierarchy -Regex can represent a hierarchy of patterns from simple to complex.
Example: (ab|cd)* matches repetitions of "ab" or "cd".
7. Case Sensitivity - Regex patterns are case-sensitive unless specified with modifiers like
(?i).
8. Error-Prone - Complex patterns can lead to hard-to-debug errors.
9. Efficient Matching - Optimized regex engines ensure fast matching even for large text
corpora.

Q2) Finite State Automata. (IAT)

Ans. Concept Meaning:


A Finite State Automaton (FSA) is a computational model used in NLP to recognize patterns,
parse text, and process strings in a sequential manner. It operates on an input string and
transitions between finite states based on input symbols, accepting or rejecting the string based
on its final state.

Formula :
− An FSA is represented mathematically as a 5-tuple: (Q,Σ,δ,q0 ,F)
Where:
1. Q: Set of finite states.
2. Σ: Input alphabet (set of symbols).
3. δ: Transition function (δ: Q×Σ→Q).
4. q0 : Initial state (q0 ∈ Q).
5. F: Set of final/accepting states (F⊆Q).
Types of Finite State Automata with Examples
1. Deterministic Finite Automata (DFA) :
− Each input leads to one and only one state.
− Eg : Recognizing binary strings ending with "01".
2. Non-Deterministic Finite Automata (NDFA)
− Transitions may lead to multiple states for a given input.
− Example: Recognizing binary strings containing "101".
3. ε-NFA (Epsilon-NFA)
− Includes ε-transitions (move without consuming input).
− Example: Recognizing strings starting with "a" or "b".
4. Pushdown Automata (PDA)
− Uses a stack for memory, handles context-free languages.
− Example: Parsing nested parentheses.
5. Turing Machine
− Infinite tape for memory; models any computable function.
− Example: Complex string transformations.
6. Linear Bounded Automata (LBA)
− Turing Machine with bounded tape size.
− Example: Recognizing strings of equal 'a's and 'b's.
7. Probabilistic FSA
− Each transition is associated with a probability.
− Example: Speech recognition.
8. Weighted FSA
− Each transition has weights (cost or frequency).
− Example: Pathfinding in syntax trees.

9. Mealy Machine
− Outputs are determined by transitions.
− Example: Translating Morse code.
10. Moore Machine
− Outputs are associated with states.
− Example: Encoding strings into binary.
11. Two-Way FSA
− Reads input in both directions.
− Example: Palindrome detection.
12. Timed Automata
− Transitions depend on time constraints.
− Example: Real-time processing in chatbots.
13. Hybrid Automata
− Combines discrete and continuous transitions.
− Example: Speech-to-text systems with audio features.

Advantages – Simplicity, Efficiency (Operates in linear time), Versatility (supports tasks like
tokenization, spell-checking)

Disadvantages - Limited Expressiveness (Cannot process context-sensitive), Scalability


issues, Inflexibility (cannot handle nested structures).

Applications in NLP – Tokenization, Spell Checking (verify validity of words), Named Entity
Recognition (NER) (identify nouns , etc), Information Retrieval (matching queries w docs).

Q3)

Ans. A parser is a component of the Natural Language Processing pipeline that processes input
sentences to derive their grammatical structure. It converts unstructured or semi-structured data
into a structured format.

Block Diagram :

• Input sentence – apply grammar rules (context-free grammar) – forms parse trees
(Hierarchical representation of sentence structure)

Types of Parsing:
1. Top-Down Parsing - Starts from the root of the parse tree and expands to leaves. Example:
Recursive Descent Parsing.
- Use: Recognizes valid strings based on grammar.

2. Bottom-Up Parsing
− Starts from the input symbols (leaves) and builds up to the root.
− Use: Identifies structures as it reads the sentence.

3. Syntactic Parsing
− Focuses on grammar rules to construct the structure.
− Example: Context-Free Grammar (CFG).
− Use: Syntax tree generation for sentences.

4. Semantic Parsing
− Maps natural language to logical forms or meaning representations.
− Use: Question answering and dialogue systems.

5. Dependency Parsing
− Analyzes grammatical dependencies between words.
− Example: "She enjoys reading" → "enjoys" is the head, "she" and "reading" are
dependents.

6. Constituency Parsing
− Breaks sentences into constituents like phrases.
− Example: "The big cat" → Noun Phrase (NP).

7. Statistical Parsing
− Uses probabilistic methods to determine the most likely parse.
− Example: Probabilistic Context-Free Grammar (PCFG).
Advantages - Syntactic Analysis (Extracts sentence structure), Flexibility, Error Detection,
Supports AI Models (chatbots and translators).

Disadvantages – Complexity (computationally intensive), Grammar Dependency (Requires


comprehensive grammar rules for accuracy), Domain Limitations (Not suitable for non-standard
text formats).

Applications - Machine Translation, Speech Recognition, Text-to-Speech Systems, Question


Answering Systems, Information Extraction.

You might also like