NLP unit 1 notes
NLP unit 1 notes
NLP is a versatile field with applications spanning multiple industries. Below is a detailed exploration of its uses
across different domains:
1. Healthcare
NLP is revolutionizing healthcare by improving efficiency, enhancing diagnosis, and making medical knowledge
more accessible.
Clinical Documentation: Automating the transcription of doctors' notes and patient reports into struc-
tured data for electronic health records (EHRs).
Medical Research: Extracting key insights from medical journals, clinical trial data, and research papers
to aid drug development.
Disease Diagnosis: Using patient symptoms and reports to identify potential diseases.
Patient Interaction: Chatbots for answering health-related questions or guiding patients.
Sentiment Analysis for Mental Health: Identifying signs of depression, anxiety, or suicidal tendencies
from text inputs or social media posts.
2. E-commerce and Retail
NLP enhances the customer experience, streamlines operations, and provides valuable insights in the retail
space.
Product Recommendations: Analyzing customer reviews and preferences to suggest products.
Chatbots for Customer Support: Automating responses to customer inquiries.
Sentiment Analysis: Understanding customer opinions on products or services through reviews.
Search Optimization: Enabling semantic search to show results based on the intent behind customer
queries.
Personalized Marketing: Crafting personalized messages and promotions using customer data.
4. Education
NLP transforms learning by making it more interactive and personalized.
E-Learning Platforms: Enhancing online learning by summarizing content and creating quizzes.
Essay Scoring: Automating the evaluation of student assignments and providing detailed feedback.
Language Learning: Providing pronunciation assistance, grammar correction, and vocabulary building.
Chatbots: Assisting students with administrative queries and academic resources.
Accessibility: Enabling learning for students with disabilities through text-to-speech or speech-to-text
systems.
8. Customer Service
NLP powers customer support systems that enhance user interaction and efficiency.
Chatbots: Resolving customer queries in real-time through text or voice-based assistants.
Email Automation: Sorting and responding to customer emails using intent analysis.
Sentiment Analysis: Identifying dissatisfied customers and prioritizing their issues.
Knowledge Management: Creating and managing FAQs or self-help resources based on customer
queries.
1. Ambiguity in Language
Lexical Ambiguity: Words can have multiple meanings based on context (e.g., "bat" can refer to an ani-
mal or a sports item).
Syntactic Ambiguity: A sentence can be structured in different ways, leading to different interpreta-
tions (e.g., "I saw the man with the telescope" could mean the man had the telescope or I used the
telescope).
Semantic Ambiguity: Even when syntax is clear, the meaning can remain ambiguous due to context
(e.g., "bank" could mean a financial institution or a riverbank).
2. Context Understanding
Human language often relies on implicit context, background knowledge, and cultural understanding
that computers struggle to grasp.
For example, sarcasm, idioms, and metaphors require an understanding of context that is beyond lit-
eral interpretation (e.g., "Break a leg!" means good luck, not physical harm).
3. Data Limitations
Bias in Data: NLP models often reflect the biases present in their training data, which can lead to unfair
or discriminatory outcomes.
Low-Resource Languages: Most NLP research focuses on widely spoken languages like English, leaving
many languages with limited or no resources for NLP applications.
Domain-Specific Data: Models trained on general data may not perform well in specialized fields like
medicine or law without extensive retraining.
7. Computational Complexity
Training and deploying large NLP models like GPT or BERT require significant computational resources.
Real-time processing of large datasets or live language inputs can be computationally intensive, leading
to delays or cost constraints.
Conclusion
NLP's challenges stem from the intricacies of human language and the limitations of current AI technologies.
While these hurdles can slow progress, ongoing advancements in machine learning, data processing, and ethi-
cal AI are steadily addressing these issues. The future of NLP holds promise for more accurate, robust, and hu-
man-like language understanding systems.
1. Syntax
Syntax focuses on the structure and rules of language, such as grammar and sentence formation. NLP tasks in
this domain aim to understand how words and phrases are arranged to create meaningful sentences.
Key NLP Tasks:
1. Part-of-Speech (POS) Tagging
o Identifying the grammatical category of words in a sentence (e.g., noun, verb, adjective).
o Example:
Sentence: "The dog barks loudly."
POS Tags: Determiner (The), Noun (dog), Verb (barks), Adverb (loudly).
2. Parsing
o Analyzing the grammatical structure of a sentence and identifying relationships between words.
o Types of Parsing:
Syntactic Parsing (Dependency Parsing): Identifies relationships between words (e.g.,
subject, object).
Constituency Parsing: Breaks a sentence into sub-phrases or constituents (e.g., noun
phrase, verb phrase).
3. Sentence Boundary Detection
o Identifying where sentences begin and end in unstructured text.
4. Word Segmentation
o Particularly important for languages without clear word boundaries (e.g., Chinese, Japanese).
5. Grammar Correction
o Detecting and correcting grammatical errors in text.
6. Morphological Analysis
o Breaking down words into their root forms and affixes to understand their structure (e.g., plural
forms, tenses).
o Example: "running" → root: run, suffix: -ing.
2. Semantics
Semantics deals with the meaning of words, phrases, and sentences. The focus here is on understanding what
is being communicated rather than how it is structured.
Key NLP Tasks:
1. Named Entity Recognition (NER)
o Identifying entities like names, dates, locations, and organizations in text.
o Example: "Apple Inc. was founded in California in 1976." → Entities: Apple Inc. (Organization),
California (Location), 1976 (Date).
2. Word Sense Disambiguation (WSD)
o Determining the correct meaning of a word based on context.
o Example: "The bank of the river is beautiful." → "bank" refers to the side of a river, not a finan-
cial institution.
3. Semantic Role Labeling (SRL)
o Identifying the roles that words play in a sentence, such as subject, object, and action.
o Example: "John baked a cake for Mary."
Agent (Who?): John
Action: baked
Object (What?): a cake
Recipient (For whom?): Mary
4. Coreference Resolution
o Identifying when two or more expressions in text refer to the same entity.
o Example: "John said he would come." → "he" refers to John.
5. Textual Entailment
o Determining whether one sentence logically follows from another.
o Example:
Sentence 1: "All cats are animals."
Sentence 2: "Fluffy is an animal." → Entailed.
6. Semantic Similarity
o Measuring how similar two pieces of text are in meaning.
o Example: "The car is red" and "The automobile is crimson" have similar meanings.
7. Knowledge Graph Construction
o Extracting structured knowledge from text and organizing it into a graph of entities and their re-
lationships.
3. Pragmatics
Pragmatics focuses on how language is used in specific contexts, considering speaker intent, social norms, and
real-world knowledge. This level goes beyond literal meaning to include implied and contextual meanings.
Key NLP Tasks:
1. Sentiment Analysis
o Determining the emotional tone or sentiment expressed in a text (positive, negative, neutral).
o Example: "I love this product!" → Positive sentiment.
2. Dialogue Systems and Chatbots
o Enabling conversational agents to understand and respond appropriately in context.
o Example: In a customer support chatbot, responding to: "I’d like to return a product."
3. Sarcasm and Irony Detection
o Identifying sarcastic or ironic statements, which often require understanding tone and context.
o Example: "Great, another meeting!" → Sarcasm.
4. Discourse Analysis
o Understanding the flow of ideas and relationships between sentences or paragraphs in a docu-
ment.
o Example: Identifying how a sentence contributes to the overall argument in a text.
5. Anaphora Resolution
o Resolving references to earlier parts of a text.
o Example: "Maria went to the park. She enjoyed it." → "She" refers to Maria, and "it" refers to
the park.
6. Speech Act Recognition
o Determining the intent behind a statement (e.g., question, request, command, suggestion).
o Example: "Could you open the window?" → Request, not a literal question.
7. Intent Detection
o Understanding what the user wants to achieve through their input in a system.
o Example: "Book me a flight to New York." → Intent: Flight booking.
8. Emotion Recognition
o Detecting emotions expressed in text (e.g., joy, sadness, anger).
o Example: "I can't believe I lost my keys!" → Emotion: Frustration.
9. Presupposition and Implicature Detection
o Identifying assumptions and implied meanings not explicitly stated.
o Example: "Have you stopped smoking?" → Presupposes the person used to smoke.
1. Boolean Model
Definition
The Boolean model represents documents and queries as sets of terms (keywords). It uses logical operators
(AND, OR, NOT) to match queries to documents.
Key Features
Simple and intuitive.
Based on exact matching: a document is either relevant or not, based on the query's logical expres-
sion.
Terms are treated as binary values (present/absent).
Example
Query: (car AND engine) OR (bike AND wheel)
Document 1: "The car engine is powerful."
o Matches the query because it contains "car" and "engine."
Document 2: "The bike has a wheel."
o Matches the query because it contains "bike" and "wheel."
Advantages
Simple to implement and understand.
Suitable for applications where precision is more important than recall (e.g., legal or patent searches).
Disadvantages
No ranking of results: all matching documents are equally relevant.
Doesn't consider term frequency or partial matching.
Requires users to create complex Boolean queries.
3. Probabilistic Model
Definition
The Probabilistic Model assumes that there is a probability of relevance for each document given a query. The
system ranks documents based on their likelihood of being relevant.
Key Features
Uses probabilities to estimate relevance.
Incorporates term weighting and query expansion.
Commonly used models: Binary Independence Model (BIM) and Bayesian Network-based Models.
Example
Query: "sports car"
Document 1: Contains "sports" and "car." Probability of relevance: 0.9.
Document 2: Contains only "car." Probability of relevance: 0.4.
Document 1 is ranked higher due to its higher probability.
Advantages
Provides ranked results with probabilistic relevance scores.
Incorporates feedback to improve ranking (e.g., relevance feedback).
More formal and theoretical foundation than the Vector Space Model.
Disadvantages
Requires accurate probability estimates, which can be difficult to obtain.
Assumes independence of terms, which is often unrealistic.
Rule-Based Ma-
Information Re- Probabilistic Graph-
Aspect Rule-Based Model Statistical Model chine Transla-
trieval Model ical Model
tion Model
Uses hand-crafted Focuses on re- Employs linguis-
Relies on statistical Represents relation-
linguistic rules trieving relevant tic rules to trans-
patterns derived ships between vari-
Definition (grammar, lexi- documents late text be-
from large ables using proba-
cons) to analyze based on tween lan-
datasets. bilistic graphs.
and generate text. queries. guages.
Query-document
Uses bilingual Models conditional
Deterministic, re- Data-driven, uses matching based
dictionaries and dependencies be-
Approach lies on human-de- probabilities and on similarity or
syntactic rules tween random vari-
fined logic. patterns in text. relevance met-
for translation. ables.
rics.
Minimal or none; Uses indexed Requires bilin- Requires data to es-
uses predefined Requires large cor- document collec- gual dictionaries timate probabilities
Training Data
rules and linguistic pora for training. tions and user and grammar and conditional de-
resources. queries. rules. pendencies.
n-grams, hidden
Boolean models, Parsing, lexical Bayesian networks,
Grammar rules, Markov models
vector space mapping, trans- Markov random
Key Techniques lexicons, finite- (HMM), maximum
models, proba- fer rules, inter- fields, conditional
state automata. likelihood estima-
bilistic models. lingua methods. random fields.
tion.
Good for con-
Depends on the High if relationships
High for well-de- Improves with the strained do-
model (Boolean, between variables
Accuracy fined, small-scale amount and quality mains but strug-
vector space, or are correctly mod-
tasks. of data. gles with com-
probabilistic). eled.
plex sentences.
Rule-Based Ma-
Information Re- Probabilistic Graph-
Aspect Rule-Based Model Statistical Model chine Transla-
trieval Model ical Model
tion Model
Less interpretable; Moderate; rele- Moderate; requires
Highly inter- Interpretable
depends on statisti- vance scoring is understanding of
Interpretability pretable due to due to explicit
cal weights and often explain- probabilistic depen-
explicit rules. linguistic rules.
probabilities. able. dencies.
Limited scalabil- Scales well if com-
Limited to specific Scales well with
Scalable with suffi- ity; requires ex- putational re-
Scalability domains and lan- large document
cient data. tensive human sources are suffi-
guages. collections.
effort for rules. cient.
- Clear rules,
- Easy to adapt - Captures depen-
- Transparent and - Adapts to patterns works well for
Advantages to document re- dencies and uncer-
interpretable. in large datasets. specific language
trieval tasks. tainties effectively.
pairs.
- Effective for - Handles ambiguity - Supports rank- - Useful in struc-
- Doesn't require
tasks like tok- and variation in ing and rele- tured prediction
large parallel
enization or gram- language effec- vance-based re- tasks (e.g., POS tag-
corpora.
mar checking. tively. trieval. ging, parsing).
- Limited to re- - Struggles with - Computationally
- Brittle, can't han- - Requires signifi-
trieval, doesn't idioms, context, intensive; requires
Disadvantages dle exceptions or cant computational
generate or ana- and non-literal probabilistic knowl-
ambiguities well. resources and data.
lyze text. translations. edge.
- Language model- - Search engines, - Early machine - Named entity
- Tokenization,
ing, POS tagging, document rank- translation sys- recognition, depen-
Applications stemming, gram-
machine transla- ing, query tems (e.g., SYS- dency parsing, se-
mar correction.
tion. matching. TRAN). mantic role labeling.
- n-gram models,
- Linguistic rule- HMM-based POS - CRF for POS tag-
based chatbots, tagging, statistical - TF-IDF, cosine - SYSTRAN, Aper- ging, Bayesian net-
Examples
grammar checkers machine transla- similarity, BM25. tium. works for speech
(e.g., Grammarly). tion (e.g., Google recognition.
SMT).
Key Insights
Rule-based models are interpretable but limited to well-defined, small-scale tasks.
Statistical models offer adaptability and scalability but require large amounts of data.
Information retrieval models excel in search and ranking tasks but don't analyze or generate text.
Rule-based machine translation models are outdated and largely replaced by data-driven models.
Probabilistic graphical models capture complex relationships but are resource-independent