0% found this document useful (0 votes)
48 views

Unit 1 NLP KCS072

Uploaded by

BRISK GAURAV
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Unit 1 NLP KCS072

Uploaded by

BRISK GAURAV
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

NLP (KCS 072)

AKTU - UNIT 1
Challenges and Origins of Natural Language
Processing (NLP)
Ambiguity: Words and sentences can have multiple meanings depending on context.

Complexity: Language has diverse sentence structures and rules.

Context Sensitivity: Meaning changes with context, making it hard for machines to

understand.

Word Sense Disambiguation: Words like "bank" can mean different things in different

contexts.

Scarcity of Data: Annotated data for training models is limited and expensive.

Speech Variability: Accents, noise, and pronunciation differences complicate speech

recognition.
ORIGINS
1950s: NLP began with rule-based methods, focusing on machine translation using predefined linguistic rules (e.g.,
Georgetown-IBM experiment).

1960s-1980s: Symbolic approaches, like Chomsky’s generative grammar, aimed to model language using grammar
rules.

1980s: Statistical models, such as n-grams and Hidden Markov Models (HMMs), emerged to analyze language based
on observed data and probabilities.

1990s: Machine learning techniques, including decision trees and maximum entropy models, were applied to NLP
tasks like POS tagging and named entity recognition.

2000s: Deep learning models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks
improved sequential data modeling.

2018-Present: The introduction of transformers (e.g., BERT, GPT) revolutionized NLP, enabling context understanding
and leading to pre-trained language models for a wide range of tasks.
GRAMMAR BASED MODEL GRAMMAR BASED MODEL
Rule-Based: These models follow predefined Data-Driven: These models learn from large
language rules (like grammar rules). amounts of text data.

Syntax-Focused: They focus on the structure of Probabilistic: They predict the next word based
sentences, ensuring grammatical correctness. on probabilities from past data.

Precise: Grammar-based models are accurate Flexible: Statistical models can handle different
when applied to specific tasks, such as sentence types of language, including slang or new terms.
parsing.
Scalable: They work well with large datasets and
Rigid: They struggle with language variations or real-world applications.
ambiguity, like slang or idioms.
Handles Uncertainty: They manage cases with
Manual Effort: Creating and updating grammar missing or unknown words using techniques like
rules requires a lot of manual work. smoothing.
TOKENIZATION
Tokenization is the process of splitting text into smaller units called tokens. These
tokens are typically words, subwords, or characters, depending on the level of
granularity chosen for a given task. The goal of tokenization is to break down a sentence
or text into manageable pieces that can be processed by a machine learning model.

I love programming => I + love + programming


Hello, world! => Hello + , + world!
HOW IT IS USED IN NLP?
Text Preprocessing: Tokenization is one of the first steps in text preprocessing. It converts raw text
into a format that can be analyzed, helping to identify individual words or subwords for further
processing.
Word Representation: After tokenization, each token can be represented as a vector (e.g., through
word embeddings like Word2Vec or GloVe), enabling the model to understand the semantic
meaning of each word.
Handling Punctuation: Tokenization helps separate words from punctuation, which is important
for tasks like sentiment analysis or named entity recognition (NER), where punctuation can
influence meaning.
Feature Extraction: For tasks such as text classification or language modeling, tokenization breaks
text into smaller chunks, making it easier to extract features like word frequency or word context.
Efficiency in Models: Tokenization reduces the complexity of text and allows for more efficient
computation in models by converting variable-length sentences into a fixed set of tokens that can
be processed faster.
N-GRAMS
An N-gram is a sequence of N consecutive words or characters from a text. The value of N
determines the length of the sequence:

Unigrams (1-gram): Single words (e.g., "cat").


Bigrams (2-gram): Pairs of consecutive words (e.g., "black cat").
Trigrams (3-gram): Triplets of consecutive words (e.g., "the black cat").

N-grams help capture the relationship between words based on their order in the text and are used
in tasks like language modeling and text generation.

USAGE IN NLP!
1. Language Modeling: N-grams are used to predict the next word in a sentence by looking at the previous N-1 words.
2. Text Prediction: N-grams are used in autocomplete systems, where the next word is predicted based on previous
words.
3. Speech Recognition: N-grams help improve accuracy by predicting the next word based on the context of previous
words.
Language Modeling Techniques
Smoothing
Purpose: Prevents zero probabilities for unseen N-grams.
Methods:
Laplace Smoothing: Adds 1 to the count of each N-gram.
Good-Turing Smoothing: Adjusts probabilities based on frequency of N-grams seen once.
Kneser-Ney Smoothing: Reduces bias for lower-order N-grams by considering the number of contexts in
which they appear.

Interpolation:
Purpose: Combines probabilities from different N-gram models (e.g., unigram, bigram, trigram) to improve
predictions.
Example: A weighted average of bigram and unigram probabilities.

Backoff:
Purpose: If a higher-order N-gram is missing, the model "backs off" to a lower-order N-gram.
Example: If trigram data is unavailable, use bigram or unigram data for probability estimation.
PART OF SPEECH TAGGING
Part-of-Speech (POS) tagging is the process of assigning a grammatical category (such as noun, verb, adjective)
to each word in a sentence. The goal is to identify the syntactic role of each word, which is essential for tasks
like syntactic parsing, machine translation, and information retrieval.
There are 3 Methods:
Rule-Based PoS Tagging
The Rule-Based Approach to POS tagging uses predefined rules based on word context to assign tags. For example, a
word following "the" is likely a noun, as in "the dog." While accurate for common cases, it requires manual effort and
struggles with ambiguous words.
Stochastic (Statistical) Approach
The Stochastic (Statistical) Approach uses probabilities based on data. It learns from large tagged datasets to predict
tags. For example, if "dog" often follows "the," it’s tagged as a noun. This method is more flexible and handles new words,
but requires a lot of data and is more complex.
Hybrid Approach
The Hybrid Approach combines both methods. It uses rule-based tagging for common or straightforward cases and relies
on stochastic models for more complex or ambiguous ones. This approach benefits from the strengths of both methods:
the accuracy of rules and the flexibility of statistical models.
ENGLISH MORPHOLOGY
Morphology is the study of how words are formed. In Natural Language Processing (NLP), it helps to break down
words into smaller units (called morphemes) and understand how they change in different contexts. This is
important for tasks like text analysis and understanding word meanings.
It helps break down complex words into simple parts.
It improves tasks like POS tagging, where we identify whether a word is a noun, verb, etc.
It helps normalize text, making it easier for algorithms to understand.

HOW NLP UNDERSTAND AND PROCESS MORPHOLOGY

Morphemes:
A morpheme is the smallest unit of meaning in a word.
Free morphemes: Words that can stand alone, like "book" or "cat."
Bound morphemes: Parts of words that can't stand alone, like "-
ed" in "walked" or "un-" in "undo."
Inflectional Morphology:
This refers to changes in words to show things like tense (past, present),
number (singular, plural), or possession.
Example:
"cat" → "cats" (plural)
"run" → "ran" (past tense)
HOW NLP UNDERSTAND AND PROCESS MORPHOLOGY
Derivational Morphology:
This involves adding prefixes or suffixes to words to create new words or change
their meaning.
Example:
"happy" → "happiness"
"teach" → "teacher"
Stemming:
The process of reducing a word to its base form by removing
Compounding:
prefixes or suffixes.
Combining two words to create a new word.
Example:
Example:
"running", "runner" → "run"
"tooth" + "brush" = "toothbrush"
"sun" + "flower" = "sunflower"

Lemmatization:
Similar to stemming, but it converts a word to its base form based on its meaning,
using a dictionary.
Example:
"better" → "good"
"running" → "run"
HMM (Hidden Markov Model)
Hidden Markov Models (HMMs) are a type of probabilistic model used in Part-of-Speech (POS) tagging. In HMMs, the task is to assign a sequence of POS
tags to a sequence of words in a sentence, based on two key probabilities:
1. Transition Probability: The probability of one tag following another tag.
2. Emission Probability: The probability of a word being associated with a particular tag.

Steps in HMM-based POS Tagging:


Training:
-The HMM is trained using a labeled corpus (text where words are already tagged with their correct POS).
-The model learns:
Transition probabilities: e.g., P(Noun | Determiner) = probability that a noun follows a determiner.
Emission probabilities: e.g., P("dog" | Noun) = probability that the word "dog" is tagged as a noun.

Tagging (Decoding):
-For a given sentence, HMM uses the learned probabilities to predict the sequence of POS tags for the words.
-The Viterbi algorithm is used to find the most probable sequence of tags, given the observed words. The Viterbi algorithm computes the
best path through the tag sequence by considering both the transition and emission probabilities.

Prediction:
-Once trained, the model assigns tags to new sentences based on the word sequence and the learned probabilities.
-For example, in the sentence "The dog barks", the model predicts "The" as a Determiner, "dog" as a Noun, and "barks" as a Verb based on
the transition and emission probabilities.

You might also like