Unit 1 NLP KCS072

Uploaded by

BRISK GAURAV

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Unit 1 NLP KCS072

Uploaded by

BRISK GAURAV

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

NLP (KCS 072)

AKTU - UNIT 1
Challenges and Origins of Natural Language
Processing (NLP)
Ambiguity: Words and sentences can have multiple meanings depending on context.

Complexity: Language has diverse sentence structures and rules.

Context Sensitivity: Meaning changes with context, making it hard for machines to

understand.

Word Sense Disambiguation: Words like "bank" can mean different things in different

contexts.

Scarcity of Data: Annotated data for training models is limited and expensive.

Speech Variability: Accents, noise, and pronunciation differences complicate speech

recognition.
ORIGINS
1950s: NLP began with rule-based methods, focusing on machine translation using predefined linguistic rules (e.g.,
Georgetown-IBM experiment).

1960s-1980s: Symbolic approaches, like Chomsky’s generative grammar, aimed to model language using grammar
rules.

1980s: Statistical models, such as n-grams and Hidden Markov Models (HMMs), emerged to analyze language based
on observed data and probabilities.

1990s: Machine learning techniques, including decision trees and maximum entropy models, were applied to NLP
tasks like POS tagging and named entity recognition.

2000s: Deep learning models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks
improved sequential data modeling.

2018-Present: The introduction of transformers (e.g., BERT, GPT) revolutionized NLP, enabling context understanding
and leading to pre-trained language models for a wide range of tasks.
GRAMMAR BASED MODEL GRAMMAR BASED MODEL
Rule-Based: These models follow predefined Data-Driven: These models learn from large
language rules (like grammar rules). amounts of text data.

Syntax-Focused: They focus on the structure of Probabilistic: They predict the next word based
sentences, ensuring grammatical correctness. on probabilities from past data.

Precise: Grammar-based models are accurate Flexible: Statistical models can handle different
when applied to specific tasks, such as sentence types of language, including slang or new terms.
parsing.
Scalable: They work well with large datasets and
Rigid: They struggle with language variations or real-world applications.
ambiguity, like slang or idioms.
Handles Uncertainty: They manage cases with
Manual Effort: Creating and updating grammar missing or unknown words using techniques like
rules requires a lot of manual work. smoothing.
TOKENIZATION
Tokenization is the process of splitting text into smaller units called tokens. These
tokens are typically words, subwords, or characters, depending on the level of
granularity chosen for a given task. The goal of tokenization is to break down a sentence
or text into manageable pieces that can be processed by a machine learning model.

I love programming => I + love + programming

Hello, world! => Hello + , + world!
HOW IT IS USED IN NLP?
Text Preprocessing: Tokenization is one of the first steps in text preprocessing. It converts raw text
into a format that can be analyzed, helping to identify individual words or subwords for further
processing.
Word Representation: After tokenization, each token can be represented as a vector (e.g., through
word embeddings like Word2Vec or GloVe), enabling the model to understand the semantic
meaning of each word.
Handling Punctuation: Tokenization helps separate words from punctuation, which is important
for tasks like sentiment analysis or named entity recognition (NER), where punctuation can
influence meaning.
Feature Extraction: For tasks such as text classification or language modeling, tokenization breaks
text into smaller chunks, making it easier to extract features like word frequency or word context.
Efficiency in Models: Tokenization reduces the complexity of text and allows for more efficient
computation in models by converting variable-length sentences into a fixed set of tokens that can
be processed faster.
N-GRAMS
An N-gram is a sequence of N consecutive words or characters from a text. The value of N
determines the length of the sequence:

Unigrams (1-gram): Single words (e.g., "cat").

Bigrams (2-gram): Pairs of consecutive words (e.g., "black cat").
Trigrams (3-gram): Triplets of consecutive words (e.g., "the black cat").

N-grams help capture the relationship between words based on their order in the text and are used
in tasks like language modeling and text generation.

USAGE IN NLP!
1. Language Modeling: N-grams are used to predict the next word in a sentence by looking at the previous N-1 words.
2. Text Prediction: N-grams are used in autocomplete systems, where the next word is predicted based on previous
words.
3. Speech Recognition: N-grams help improve accuracy by predicting the next word based on the context of previous
words.
Language Modeling Techniques
Smoothing
Purpose: Prevents zero probabilities for unseen N-grams.
Methods:
Laplace Smoothing: Adds 1 to the count of each N-gram.
Good-Turing Smoothing: Adjusts probabilities based on frequency of N-grams seen once.
Kneser-Ney Smoothing: Reduces bias for lower-order N-grams by considering the number of contexts in
which they appear.

Interpolation:
Purpose: Combines probabilities from different N-gram models (e.g., unigram, bigram, trigram) to improve
predictions.
Example: A weighted average of bigram and unigram probabilities.

Backoff:
Purpose: If a higher-order N-gram is missing, the model "backs off" to a lower-order N-gram.
Example: If trigram data is unavailable, use bigram or unigram data for probability estimation.
PART OF SPEECH TAGGING
Part-of-Speech (POS) tagging is the process of assigning a grammatical category (such as noun, verb, adjective)
to each word in a sentence. The goal is to identify the syntactic role of each word, which is essential for tasks
like syntactic parsing, machine translation, and information retrieval.
There are 3 Methods:
Rule-Based PoS Tagging
The Rule-Based Approach to POS tagging uses predefined rules based on word context to assign tags. For example, a
word following "the" is likely a noun, as in "the dog." While accurate for common cases, it requires manual effort and
struggles with ambiguous words.
Stochastic (Statistical) Approach
The Stochastic (Statistical) Approach uses probabilities based on data. It learns from large tagged datasets to predict
tags. For example, if "dog" often follows "the," it’s tagged as a noun. This method is more flexible and handles new words,
but requires a lot of data and is more complex.
Hybrid Approach
The Hybrid Approach combines both methods. It uses rule-based tagging for common or straightforward cases and relies
on stochastic models for more complex or ambiguous ones. This approach benefits from the strengths of both methods:
the accuracy of rules and the flexibility of statistical models.
ENGLISH MORPHOLOGY
Morphology is the study of how words are formed. In Natural Language Processing (NLP), it helps to break down
words into smaller units (called morphemes) and understand how they change in different contexts. This is
important for tasks like text analysis and understanding word meanings.
It helps break down complex words into simple parts.
It improves tasks like POS tagging, where we identify whether a word is a noun, verb, etc.
It helps normalize text, making it easier for algorithms to understand.

HOW NLP UNDERSTAND AND PROCESS MORPHOLOGY

Morphemes:
A morpheme is the smallest unit of meaning in a word.
Free morphemes: Words that can stand alone, like "book" or "cat."
Bound morphemes: Parts of words that can't stand alone, like "-
ed" in "walked" or "un-" in "undo."
Inflectional Morphology:
This refers to changes in words to show things like tense (past, present),
number (singular, plural), or possession.
Example:
"cat" → "cats" (plural)
"run" → "ran" (past tense)
HOW NLP UNDERSTAND AND PROCESS MORPHOLOGY
Derivational Morphology:
This involves adding prefixes or suffixes to words to create new words or change
their meaning.
Example:
"happy" → "happiness"
"teach" → "teacher"
Stemming:
The process of reducing a word to its base form by removing
Compounding:
prefixes or suffixes.
Combining two words to create a new word.
Example:
Example:
"running", "runner" → "run"
"tooth" + "brush" = "toothbrush"
"sun" + "flower" = "sunflower"

Lemmatization:
Similar to stemming, but it converts a word to its base form based on its meaning,
using a dictionary.
Example:
"better" → "good"
"running" → "run"
HMM (Hidden Markov Model)
Hidden Markov Models (HMMs) are a type of probabilistic model used in Part-of-Speech (POS) tagging. In HMMs, the task is to assign a sequence of POS
tags to a sequence of words in a sentence, based on two key probabilities:
1. Transition Probability: The probability of one tag following another tag.
2. Emission Probability: The probability of a word being associated with a particular tag.

Steps in HMM-based POS Tagging:

Training:
-The HMM is trained using a labeled corpus (text where words are already tagged with their correct POS).
-The model learns:
Transition probabilities: e.g., P(Noun | Determiner) = probability that a noun follows a determiner.
Emission probabilities: e.g., P("dog" | Noun) = probability that the word "dog" is tagged as a noun.

Tagging (Decoding):
-For a given sentence, HMM uses the learned probabilities to predict the sequence of POS tags for the words.
-The Viterbi algorithm is used to find the most probable sequence of tags, given the observed words. The Viterbi algorithm computes the
best path through the tag sequence by considering both the transition and emission probabilities.

Prediction:
-Once trained, the model assigns tags to new sentences based on the word sequence and the learned probabilities.
-For example, in the sentence "The dog barks", the model predicts "The" as a Determiner, "dog" as a Noun, and "barks" as a Verb based on
the transition and emission probabilities.

Globalization - A Basic Text
100% (1)
Globalization - A Basic Text
3 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
MetDraw 5710
No ratings yet
MetDraw 5710
1 page
Text Analytics and Natural Language Processing - KAI073.docx
No ratings yet
Text Analytics and Natural Language Processing - KAI073.docx
24 pages
NLP m2
No ratings yet
NLP m2
71 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
CH2
No ratings yet
CH2
119 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Sample
No ratings yet
Sample
8 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Grapheme:: Morpheme
No ratings yet
Grapheme:: Morpheme
20 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
Module 1.1
No ratings yet
Module 1.1
9 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
NLP Soln
No ratings yet
NLP Soln
19 pages
UNIT 1_Part1
No ratings yet
UNIT 1_Part1
121 pages
NLP Week 2 Rationalist and Empiricist Paradigms in Natural Language Processing
No ratings yet
NLP Week 2 Rationalist and Empiricist Paradigms in Natural Language Processing
28 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
NLP Basics
No ratings yet
NLP Basics
4 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
NLP unit1
No ratings yet
NLP unit1
24 pages
NLP FINAL
No ratings yet
NLP FINAL
33 pages
unit-4 NLP
No ratings yet
unit-4 NLP
54 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
NLP Mid-1
No ratings yet
NLP Mid-1
15 pages
Project Report
No ratings yet
Project Report
12 pages
Fundaments of Text Analysis
No ratings yet
Fundaments of Text Analysis
14 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
NLP PYQ SOLUTIONS
No ratings yet
NLP PYQ SOLUTIONS
59 pages
NLP Viva
No ratings yet
NLP Viva
14 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
NATURAL LANGUAGE PROCESSING UNIT 1
No ratings yet
NATURAL LANGUAGE PROCESSING UNIT 1
16 pages
NLP SEM IMP
No ratings yet
NLP SEM IMP
46 pages
P Publication
No ratings yet
P Publication
5 pages
POStagging
No ratings yet
POStagging
72 pages
MOD-1
No ratings yet
MOD-1
71 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
CH 6. Applications of AI-NLP
No ratings yet
CH 6. Applications of AI-NLP
65 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
TextMining
No ratings yet
TextMining
43 pages
Seminar On Natural Language Processing
No ratings yet
Seminar On Natural Language Processing
21 pages
NLP Final
No ratings yet
NLP Final
4 pages
AP for NLP-Word 2 Vec
No ratings yet
AP for NLP-Word 2 Vec
33 pages
NLP IA1
No ratings yet
NLP IA1
7 pages
NLP SEM QUESTIONS AND ANSWERS
No ratings yet
NLP SEM QUESTIONS AND ANSWERS
72 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Concept of Entreprenuer
No ratings yet
Concept of Entreprenuer
29 pages
Yamaha FZS1000 Fazer 2001 Owners Manual 5LV-28199-E0
No ratings yet
Yamaha FZS1000 Fazer 2001 Owners Manual 5LV-28199-E0
110 pages
CFD Studies in Bioreactor: Experimental and Theoretical Observations
No ratings yet
CFD Studies in Bioreactor: Experimental and Theoretical Observations
47 pages
Satyam Assignment
No ratings yet
Satyam Assignment
3 pages
Charizma Polly Chiffon Collection Vol-01
No ratings yet
Charizma Polly Chiffon Collection Vol-01
19 pages
Mini Proposal of Increase 2020
No ratings yet
Mini Proposal of Increase 2020
25 pages
Ecological Literacy
No ratings yet
Ecological Literacy
7 pages
Kapferer Lbrand-Identity-Prism
100% (1)
Kapferer Lbrand-Identity-Prism
26 pages
SRD Status Check R350 - 370 - SASSA Status Check
No ratings yet
SRD Status Check R350 - 370 - SASSA Status Check
16 pages
Analyze The Elements of Macro and Micro Environment and Their Influence To MKTG Plannin
100% (1)
Analyze The Elements of Macro and Micro Environment and Their Influence To MKTG Plannin
49 pages
"Climate Change"Nstp
No ratings yet
"Climate Change"Nstp
12 pages
Reliability Evaluation in Transmission Systems
No ratings yet
Reliability Evaluation in Transmission Systems
18 pages
Cross Regulator Template
100% (1)
Cross Regulator Template
27 pages
Graha Roga's
No ratings yet
Graha Roga's
10 pages
Introduction To Computer Vision: CIE Chromaticity Diagram and Color Gamut
No ratings yet
Introduction To Computer Vision: CIE Chromaticity Diagram and Color Gamut
8 pages
Sisecamflatglass Acoustic Laminated
No ratings yet
Sisecamflatglass Acoustic Laminated
2 pages
Rur15ny1 - Rur18ny1sr2
No ratings yet
Rur15ny1 - Rur18ny1sr2
6 pages
[Organic Process Research & Development 2012-Nov 30 Vol. 16 Iss. 12] Weiberth, Franz J._ Yu, Yong_ Subotkowski, Witold_ Pemberton, Cl - Demonstration on Pilot-Plant Scale of the Utility of 1,5,7-Triazabicyclo[4.4.0]
No ratings yet
[Organic Process Research & Development 2012-Nov 30 Vol. 16 Iss. 12] Weiberth, Franz J._ Yu, Yong_ Subotkowski, Witold_ Pemberton, Cl - Demonstration on Pilot-Plant Scale of the Utility of 1,5,7-Triazabicyclo[4.4.0]
3 pages
Tiny Broom
100% (1)
Tiny Broom
4 pages
5 REVIEW TEST - Lâm Tư NG Duy
No ratings yet
5 REVIEW TEST - Lâm Tư NG Duy
53 pages
Tensioning Strips
No ratings yet
Tensioning Strips
11 pages
(SSH Client, X-Server and Networking Tools) : Dede 172.18.151.194
No ratings yet
(SSH Client, X-Server and Networking Tools) : Dede 172.18.151.194
3 pages
Architectural Footprints of CPWD
No ratings yet
Architectural Footprints of CPWD
143 pages
Multivariate Calculus: Unit One
No ratings yet
Multivariate Calculus: Unit One
26 pages
Cranial Nerve Cheat Sheet
No ratings yet
Cranial Nerve Cheat Sheet
2 pages
Econ Level Gauges Forged Steel PN 40: Materials
No ratings yet
Econ Level Gauges Forged Steel PN 40: Materials
1 page
Realism
No ratings yet
Realism
6 pages
(Republic Act No. 10354) An Act Providing For A National Policy On Responsible Parenthood and Reproductive Health
No ratings yet
(Republic Act No. 10354) An Act Providing For A National Policy On Responsible Parenthood and Reproductive Health
6 pages