SlideShare a Scribd company logo
Natural Language Processing
MODULE IV
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Introduction
Natural Language Processing (NLP) leverages statistical methods to analyse and generate human
language. Two fundamental concepts in this domain are probability and information theory.
Understanding these concepts is crucial for developing effective NLP models and applications.
Probability theory is used to model the likelihood of various linguistic events. Here are some key
applications:
Language Modeling:
n-grams: These are contiguous sequences of n items from a given sample of text. For example, bigrams
(n=2) and trigrams (n=3) are used to predict the next word in a sentence by considering the context of
the previous words.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Introduction
Markov Chains: These models treat each word as a state and analyze the probability of transitioning
from one word to another. This is useful for tasks like text generation and speech recognition.
Text Classification:
Naive Bayes: This algorithm uses Bayes' theorem to classify text based on the probability of certain
words appearing in different categories. It is widely used in sentiment analysis and spam detection.
Conditional Probability:
This measures the probability of an event given that another event has occurred. It is fundamental in
tasks like part-of-speech tagging and named entity recognition, where the probability of a word being a
certain part of speech or entity depends on the context.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Introduction
Information theory provides a framework for quantifying the amount of information in data, which is
essential for various NLP tasks:
Entropy:
Entropy measures the uncertainty or unpredictability of a random variable. In NLP, it quantifies
the amount of information required to describe a dataset. Higher entropy indicates greater
unpredictability, while lower entropy indicates more predictability.
Cross-Entropy Loss: This measures the difference between two probability distributions and is
used to evaluate the performance of machine learning models by comparing the predicted
distribution with the true distribution.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Introduction
2. Mutual Information:
Mutual information quantifies the dependency between two random variables. In NLP, it is used for
feature selection, where features with higher mutual information scores are more informative for the
model.
3. Kullback-Leibler (KL) Divergence:
KL divergence measures the difference between two probability distributions. It is used to compare
the true distribution of data with the predicted distribution, helping in model evaluation and
regularization techniques like variational inference in Bayesian neural networks.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Practical use of Information Theory in NLP
Feature Selection: Using mutual information to identify the most informative features for building
predictive models, thereby improving model performance by reducing dimensionality and removing
irrelevant features.
Decision Trees: Information gain, based on entropy, is used to split nodes in decision trees, leading to
more efficient and accurate models by reducing uncertainty about the target variable.
Regularization and Model Selection: KL divergence is used in regularization techniques to minimize
the difference between the approximate and true posterior distributions, achieving better model
regularization and performance.
Information Bottleneck: This method aims to find a compressed representation of the input data that
retains maximal information about the output, used in deep learning for learning efficient representations.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Probability Concepts
Probability and Information Theory are foundational to understanding and building Natural
Language Processing (NLP) systems. They provide the mathematical frameworks for modelling
uncertainty, quantifying information, and optimizing decision-making in language tasks.
Probability theory helps in modelling and predicting linguistic phenomena by managing
uncertainties inherent in natural language. Since human language is complex, ambiguous, and
context-dependent, probabilistic methods allow us to make inferences and predictions about
language data.
Information Theory quantifies the amount of information and helps in designing efficient
encoding, compression, and communication systems for text data. It is essential for
understanding language representation and optimizing models.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Applications of Probability in NLP
Language Modeling: Estimating the likelihood of a sequence of words (e.g.,
P(w1,w2,…,wn).
Part-of-Speech Tagging: Computing the most probable sequence of tags given a sequence
of words using Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs).
Machine Translation: Determining the probability of a target language sentence given a
source language sentence (P(target∣source)P(target∣source)).
Speech Recognition: Identifying the most probable sequence of words corresponding to an
audio signal.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Key Concepts of Probability in NLP
Random Variables: Representing linguistic features (e.g., word frequencies).
Conditional Probability: Modeling dependencies (e.g., P(word ∣ previous
words)P(word ∣ previous words)).
Bayes’ Theorem: Used in spam filtering and probabilistic reasoning
(P(A∣B)=P(B∣A)⋅P(A)P(B)P(A∣B)=P(B)P(B∣A)⋅P(A)).
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Basics of Probability Theory
Random Variables: A random variable is a variable whose value is unknown or a function that assigns values
to each of an experiment’s outcomes. Random variables are often designated by letters and can be classified
as discrete and continuous.
What is a discrete random variable ?
Discrete random variables take on a countable number of distinct values. Consider an experiment where a coin is
tossed three times. If X represents the number of times that the coin comes up heads, then X is a discrete random
variable that can only have the values 0, 1, 2, or 3 (from no heads in three successive coin tosses to all heads). No
other value is possible for X.
What is a continuous variable ?
Continuous random variables can represent any value within a specified range or interval and can take on an
infinite number of possible values. An example of a continuous random variable would be an experiment that
involves measuring the amount of rainfall in a city over a year or the average height of a random group of 25
people.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Probability distributions
A probability distribution is a statistical function that describes all the possible values and
likelihoods that a random variable can take within a given range. This range will be bounded
between the minimum and maximum possible values. However, where the possible value is likely
to be plotted on the probability distribution depends on several factors.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Types of probability distribution
Binomial
The binomial distribution evaluates the probability of an event occurring several times
over a given number of trials given the event’s probability in each trial. It may be
generated by keeping track of how many free throws a basketball player makes in a game,
where 1 = a basket and 0 = a miss.
Another example would be to use a coin and figure out the probability of that coin
coming up heads in 10 straight flips. A binomial distribution is discrete rather than
continuous because only one or zero is a valid response
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Types of probability distribution
Normal
The most commonly used distribution is the normal distribution. This is used frequently in
finance, investing, science, and engineering. The normal distribution is fully characterized by
its mean and standard deviation. The distribution isn't skewed and it does exhibit kurtosis.
This makes the distribution symmetric. It's depicted as a bell-shaped curve when plotted. A
normal distribution is defined by a mean (average) of zero and a standard deviation of one
with a skew of zero and kurtosis = 3.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Types of probability distribution
Poisson distribution
The Poisson distribution is a discrete probability distribution that models the number of
events occurring within a fixed interval of time or space. These events must happen
independently of each other, and the average rate (mean number of occurrences) must be
constant.
The key characteristic of the Poisson distribution is that it describes the probability of a
given number of events happening within a specified interval when the events are rare and
independent.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Types of probability distribution
Bernoulli distribution
A Bernoulli distribution is a discrete probability distribution that describes a random
experiment with only two possible outcomes: success (usually denoted by 1) or failure
(usually denoted by 0).
Gaussian distribution
A Gaussian distribution, also known as a normal distribution, is a type of continuous
probability distribution characterized by its bell-shaped curve. It's one of the most
commonly used probability distributions in statistics and many other fields.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
What is Entropy
Entropy is a measure of uncertainty or randomness associated with a random variable. In the context of NLP, it
quantifies the amount of information contained in a message or text H(X) = - Σ P(x) * log₂ P(x)
Where:
H(X): Entropy of the random variable X
P(x): Probability of the event x
Cross entropy : Cross-entropy measures the difference between two probability distributions. In NLP, it's used to
evaluate the performance of language models. It is calculated as H(p, q) = - Σ p(x) * log₂ q(x)
Where,
p(x): True probability distribution
q(x): Predicted probability distribution
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Baye’s Theorem
Bayes' Theorem is a fundamental rule of probability that allows us to update our beliefs about
the probability of an event based on new evidence. In the context of NLP, it's used to calculate
the probability of a particular class or category given a piece of text.
The formulae P(A|B) = P(B|A) * P(A) / P(B)
Where:
P(A|B): Probability of event A given event B
P(B|A): Probability of event B given event A
P(A): Prior probability of event A
P(B): Prior probability of event B
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Chain rule of conditional probability
The chain rule in probability is a fundamental concept that helps us calculate the probability of
a sequence of events. Let's break it down from the basics:
Conditional probability is the probability of an event occurring given that another event has
already occurred. It's denoted as P(A|B), which reads "the probability of A given B“.
The chain rule formula states that the probability of a sequence of events can be calculated by
multiplying the conditional probabilities of each event. Mathematically, it's represented as:P(A
∩ B ∩ C) = P(A) × P(B|A) × P(C|A ∩ B)
Here:- P(A ∩ B ∩ C) is the probability of the sequence of events A, B, and C occurring.- P(A) is
the probability of event A occurring.- P(B|A) is the conditional probability of event B
occurring given that event A has occurred.- P(C|A ∩ B) is the conditional probability of event
C occurring given that events A and B have occurred.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Chain rule of conditional probability- Example
Suppose we want to calculate the probability of a person being a smoker (S), given that they
have a family history of smoking (F) and are over 30 years old (O).
We can use the chain rule formula:
P(S ∩ F ∩ O) = P(O) × P(F|O) × P(S|F ∩ O)Assuming we have the following probabilities:-
-P(O) = 0.6 (probability of being over 30 years old)
- P(F|O) = 0.4 (probability of having a family history of smoking given that you're over 30 years old)
- P(S|F ∩ O) = 0.7 (probability of being a smoker given that you have a family history of smoking and are over 30
years old)
Using the chain rule formula, we get:P(S ∩ F ∩ O) = 0.6 × 0.4 × 0.7 = 0.168
Therefore, the probability of a person being a smoker, having a family history of smoking, and being over 30 years
old is approximately 16.8%.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Applications of Baye’s theorem in NLP
Text Classification: Calculating the probability that a document belongs to a specific class
given its features.
Information Retrieval: Ranking documents based on their relevance to a query.
Machine Translation: Selecting the most probable translation for a given word or phrase.
Example: Suppose we want to classify an email as spam or not spam. We can use Bayes'
Theorem to calculate the probability that an email is spam given the presence of certain
keywords (e.g., "free," "urgent," "money").
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Practical applications of Probability in NLP
Language Modeling: Language modeling, or LM, is the use of various statistical and
probabilistic techniques to determine the probability of a given sequence of words
occurring in a sentence. Language models analyze bodies of text data to provide a basis
for their word predictions.
Example: N-gram Models:
Simple models that predict the next word based on the previous N words.
Limited by data sparsity and the curse of dimensionality
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Practical applications of Probability in NLP
Text classification: Text classification analyzes the content and meaning of text to assign it
a label. It can be used to organize and structure large amounts of unstructured text, such as
legal documents, contracts, research documents, and email.
Example: Naive Bayes Classifier:
Assumes feature independence to simplify calculations.
Effective for many text classification tasks, especially with large datasets.
Maximum Entropy Classifier:
More flexible than Naive Bayes, but can be computationally expensive.
Often used for tasks like sentiment analysis and topic classification.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Practical applications of Probability in NLP
Information extraction (IE) is the task of automatically extracting structured information
from unstructured and/or semi-structured machine-readable documents and other
electronically represented sources. Typically, this involves processing human language texts by
means of natural language processing (NLP).
Example:
Probabilistic Retrieval Models:
Rank documents based on their probability of relevance to a query.
Incorporate language models to improve retrieval accuracy
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Language Modelling
Language modelling is the way of determining the probability of any sequence of words. Language modelling is
used in various applications such as Speech Recognition, Spam filtering, etc. Language modelling is the key aim
behind implementing many state-of-the-art Natural Language Processing models.
Two methods of Language Modeling:
Statistical Language Modelling: Statistical Language Modelling, or Language Modelling, is the development of
probabilistic models that can predict the next word in the sequence given the words that precede. Examples such
as N-gram language modelling.
Neural Language Modelling: Neural network methods are achieving better results than classical methods both on
standalone language models and when models are incorporated into larger models on challenging tasks like
speech recognition and machine translation. A way of performing a neural language model is through word
embeddings.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Introduction to N-gram models in NLP
An N-gram model is a type of probabilistic language model used in Natural Language
Processing (NLP) to predict the likelihood of a sequence of words. The term "N-gram" refers
to a contiguous sequence of 'n' items from a given sample of text or speech. These items can
be characters, words, or even phonemes. The most common types of N-grams are:Unigrams
(n=1): Single words.
Bigrams (n=2): Pairs of consecutive words.
Trigrams (n=3): Sequences of three consecutive words.
Higher-order N-grams: Sequences of more than three words.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
‘N’ gram Language Model
N-gram can be defined as the contiguous sequence of n items from a given sample of text or
speech. The items can be letters, words, or base pairs according to the application. The N-
grams typically are collected from a text or speech corpus (A long text dataset).
For instance, N-grams can be unigrams like (“This”, “article”, “is”, “on”, “NLP”) or bigrams
(“This article”, “article is”, “is on”, “on NLP”).
An N-gram language model predicts the probability of a given N-gram within any sequence
of words in a language. A well-crafted N-gram model can effectively predict the next word in
a sentence, which is essentially determining the value of p(w∣h), where h is the history or
context and w is the word to predict.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
‘N’ gram models
Let’s begin with the task of computing P(w|h), the probability of a word w given some history h.
Suppose the history h is “The water of Walden Pond is so beautifully ” and we want to know the
probability that the next word is blue: P(blue|The water of Walden Pond is so beautifully) (3.1)
One way to estimate this probability is directly from relative frequency counts: take a very large
corpus, count the number of times we see “The water of Walden Pond is so beautifully”, and count
the number of times this is followed by “blue” . This would be answering the question “Out of the
times we saw the history h, how many times was it followed by the word w”, as follows:
P(blue | The water of Walden Pond is so beautifully) =
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully)
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
‘N’ gram models
N-gram models work by calculating the probability of a word given the previous 'n-1' words. This is
done using the following steps:Corpus Collection: A large corpus of text is collected to train the
model.
Tokenization: The text is tokenized into individual words or characters.
Counting N-grams: The frequency of each N-gram in the corpus is counted.
Calculating Probabilities: The probability of each N-gram is calculated using the frequency counts.
For example, the probability of a bigram P(wi ∣ wi−1) is given
P(wi∣wi−1)=Count(wi−1,wi)Count(wi−1)
​where Count(wi−1​,wi​) is the number of times the bigram (wi−1,wi)(wi−1​,wi​) appears in the corpus,
and Count(wi−1) is the number of times the word (wi−1) appears in the corpus.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Applications of N-gram in industry
N-gram models are widely used in various NLP tasks due to their simplicity and effectiveness.
Some of the key applications include:
Language Modelling: N-gram models are fundamental in language modelling, where they
predict the next word in a sequence based on the previous 'n-1' words. This is useful for tasks
like text generation and speech recognition.
Text Generation: By predicting the next word in a sequence, N-gram models can generate
coherent and contextually relevant text. This is used in applications like auto-completion and
chatbots.
Spelling Correction: N-gram models can be used to correct spelling errors by identifying and
suggesting the most probable word sequences based on the context.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Applications of N-gram in industry
Sentiment Analysis: N-grams help in capturing local context and dependencies in text
data, which is crucial for sentiment analysis tasks where understanding the context is
important for accurate classification.
Machine Translation: N-gram models are used in statistical machine translation systems
to translate text from one language to another by modelling the probability of word
sequences in both languages.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Smoothing Techniques for N-gram Model
Smoothing techniques are essential in Natural Language Processing (NLP) to handle the issue of
data sparsity and improve the generalization of language models. These techniques adjust the
estimated probabilities of n-grams to ensure that all possible word sequences have a non-zero
probability, even if they do not appear in the training data. Here are some of the most commonly
used smoothing techniques in NLP.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Smoothing Techniques for N-gram Model
1. Additive Smoothing (Laplace Smoothing)
Additive smoothing is one of the simplest and most widely used smoothing techniques. It involves
adding a small constant value (usually denoted as α) to the count of each n-gram. The adjusted
probability is then calculated as:
where V is the vocabulary size. This method ensures that no probability is zero, even for unseen n-
grams.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Smoothing Techniques for N-gram Model
2. Good-Turing Smoothing
Good-Turing smoothing reallocates the probability mass from seen n-grams to unseen n-grams. It
estimates the probability of unseen n-grams by using the count of n-grams that occur once in the
training data. The formula for Good-Turing smoothing is:
where cc is the count of the n-gram, Nc is the number of n-grams that occur cc times, and Nc+1 is
the number of n-grams that occur c+1 times.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Smoothing Techniques for N-gram Model
3. Kneser-Ney Smoothing
Kneser-Ney smoothing is an advanced technique that considers the context of n-grams. It adjusts
the probability estimates based on the frequency of n-grams and their contexts. The formula for
Kneser-Ney smoothing involves a recursive calculation that considers lower-order n-grams:
where d is a discount factor, and λ(wi−1) is a normalization factor.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Smoothing Techniques for N-gram Model
4. Witten-Bell Smoothing
Witten-Bell smoothing estimates the probability of unseen events by considering the number of
unique events observed in the training data. It adjusts the probability estimates based on the number of
unique n-grams and their counts:
where T(wi−1) is the number of unique n-grams that follow wi−1.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Smoothing Techniques for N-gram Model
5. Jelinek-Mercer Smoothing
Jelinek-Mercer smoothing combines the maximum likelihood estimate with a lower-order model
using a linear interpolation technique. It calculates the probability as a weighted sum of the maximum
likelihood estimate and a lower-order model:
where λ is a weighting factor that balances the contribution of the higher and lower order models.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Advantages of N-gram model
N-gram models offer several advantages that make them indispensable in various NLP
applications:
Simplicity: They are easy to understand and implement, making them a good starting point
for many NLP tasks.
Efficiency: N-gram models are computationally efficient and can handle large datasets
effectively.
Effectiveness: Despite their simplicity, N-gram models often perform well in
practice, especially for tasks that require capturing local dependencies in text data.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Limitations of N-gram model
While N-gram models are powerful, they also have some limitations:
Sparsity: As 'n' increases, the number of possible N-grams grows exponentially, leading to
data sparsity issues where many N-grams may not appear in the training corpus.
Contextual Limitations: N-gram models capture only local context and dependencies, which
may not be sufficient for understanding long-range dependencies in text data.
Data Requirements: They require large amounts of training data to accurately estimate
probabilities for all possible N-grams, which can be a challenge for low-resource languages
or specialized domains.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Machine Learning Models for NLP
Natural Language Processing (NLP) is a dynamic field within artificial intelligence that
bridges the gap between human language and machine interpretation. A range of machine
learning models and tools are employed to solve various NLP tasks like sentiment analysis,
machine translation, text summarization, and entity recognition. Below are key models and
tools widely used in NLP, along with their capabilities and applications:
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Naïve Bayes Model
Naive Bayes is a family of simple yet effective probabilistic classifiers based on Bayes' theorem,
assuming independence between features. Despite its simplicity, it performs remarkably well for
many NLP tasks.
Applications:
Sentiment Analysis: Naive Bayes can categorize text into sentiments (positive, negative, neutral) based
on word probabilities.
Spam Filtering: It’s extensively used in email systems to distinguish spam emails from legitimate ones
by analysing textual content.
Text Classification: Useful for classifying documents into predefined categories, such as news topics or
product reviews.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Naïve Bayes Model
The Naive Bayes model is a probabilistic machine learning algorithm used for classification
tasks. It is based on Bayes' Theorem and assumes that the features are conditionally independent
given the class label. This assumption, often referred to as the "naive" assumption, simplifies the
computation and makes the model efficient and easy to implement.
Bayes' Theorem: The foundation of the Naive Bayes model is Bayes' Theorem, which describes
the probability of an event based on prior knowledge of conditions that might be related to the
event.
Mathematically, Baye’s theorem can be expressed as:
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Naïve Bayes Model
Where P(A∣B) is the posterior probability of class A given predictor B.
P(B∣A) is the likelihood, which is the probability of predictor BB given class A.
P(A) is the prior probability of class A.
P(B) is the prior probability of predictor B
Conditional Independence: The Naive Bayes model assumes that all features are independent of each
other given the class label. This means that the presence of one feature does not affect the presence of
another feature. This assumption simplifies the computation significantly.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Naïve Bayes Model
Text Classification: Naive Bayes is widely used in text classification tasks such as spam
detection, sentiment analysis, and topic classification. Its ability to handle high-
dimensional data makes it particularly effective in these applications.
Spam Filtering: One of the most popular applications of Naive Bayes is in email spam
filtering. The model can quickly classify emails as spam or not spam based on the presence
of certain words or phrases.
Sentiment Analysis: Naive Bayes models are used to determine the sentiment of a piece of
text, such as a review or a social media post. This is useful for businesses to understand
customer opinions and feedback.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Naïve Bayes Model
4. Medical Diagnosis: In healthcare, Naive Bayes can be used to predict the likelihood of a
patient having a certain disease based on symptoms and other medical data. This helps in
early diagnosis and treatment planning.
5. Market Analysis: Naive Bayes is used in marketing to classify customers into different
segments based on their purchasing behaviour, demographics, and other attributes. This
helps in targeted marketing and personalized recommendations.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Maximum Entropy Model
The Maximum Entropy Model (MaxEnt) is a probabilistic framework used in machine learning
and natural language processing (NLP) to estimate the probability distribution of a set of
outcomes. The principle of maximum entropy suggests choosing the probability distribution that
maximizes entropy while satisfying a set of constraints derived from the observed data. This
approach ensures that the model makes the least assumptions about the data, making it a powerful
tool for modelling complex systems where the underlying relationships are not well understood.
Mathematically, the maximum entropy principle can be expressed as:
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Maximum Entropy Model
Where P(x) is the probability distribution.
Z is the normalization constant (partition function).
λi are the Lagrange multipliers.
fi(x) are the feature functions.
Maximum Entropy Models have been widely used in various NLP tasks due to their ability to handle
complex dependencies and incorporate multiple sources of evidence
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Applications of Maximum Entropy Model
Part-of-Speech (POS) Tagging: POS tagging involves assigning a grammatical category (e.g.,
noun, verb, adjective) to each word in a sentence. MaxEnt models are effective in this task because
they can combine various contextual features to make accurate predictions.
Named Entity Recognition (NER): NER involves identifying and classifying named entities in
text into predefined categories such as names of persons, organizations, locations, etc. MaxEnt
models can leverage multiple features like word context, capitalization, and part-of-speech tags to
improve recognition accuracy.
Text Classification: Text classification tasks, such as sentiment analysis and topic classification,
benefit from MaxEnt models due to their ability to handle high-dimensional feature spaces and
incorporate diverse sources of information.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Applications of Maximum Entropy Model
Language Modeling:
MaxEnt models can be used to build language models that predict the probability of a word given its
context. This is useful for tasks like speech recognition and machine translation.
Text Segmentation:
MaxEnt models are also used in text segmentation tasks, where the goal is to divide a text into
meaningful segments such as sentences or paragraphs. This involves combining various linguistic
features to make accurate segmentation decisions
In summary, Maximum Entropy Models are a versatile and powerful tool in NLP, capable of handling complex
dependencies and incorporating diverse sources of evidence. Their applications span across various tasks, from POS
tagging and NER to text classification and language modeling, making them an essential component of modern NLP
systems.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Evaluation Metrics for NLP Tasks
In Natural Language Processing (NLP), evaluating the performance of models is crucial to
ensure they meet the desired standards. Three key metrics used for this purpose are Precision,
Recall, and F1-score. These metrics are particularly useful in classification tasks such as
sentiment analysis, named entity recognition, and text classification.
1. Precision: Precision measures the proportion of correctly predicted positive instances (true
positives) out of all instances predicted as positive (true positives + false positives). Precision is
about being precise. It tells us how many of the instances identified as positive are actually
positive. A high precision means that the model is good at avoiding false positives.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Evaluation Metrics for NLP Tasks
2. Recall: Recall measures the proportion of correctly predicted positive instances out of all actual
positive instances. It evaluates how well the model identifies true positives.
A high recall indicates that the model captures most of the actual positives. It is crucial in scenarios
where missing true positives is costly.
Use cases:
In disease screening, high recall ensures that most patients with the disease are correctly identified.
In fraud detection, it minimizes undetected fraudulent activities.
Recall is about being thorough. It tells us how many of the actual positive instances were correctly
identified by the model. A high recall means that the model is good at avoiding false negatives.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Evaluation Metrics for NLP Tasks
3. F1-Score The F1-score is the harmonic mean of Precision and Recall, providing a balanced
metric that considers both false positives and false negatives. It is useful when there is an uneven
class distribution or when both precision and recall are equally important.
The F1-score is useful when you need a balance between precision and recall, especially in cases where
there is an uneven class distribution (imbalanced datasets). It ranges from 0 to 1, with 1 being the best
possible score.
Use cases:
•Sentiment Analysis: Ensures accurate classification of both positive and negative sentiments.
•Named Entity Recognition (NER): Balances identifying entities correctly while minimizing false positives.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Trade-offs and consideration
Precision vs. Recall Trade-off: There is often a trade-off between precision and recall. Improving one
can negatively impact the other. For example, a model that predicts everything as positive will have
high recall but low precision. Conversely, a model that predicts very few positives will have high
precision but low recall.
F1-score for Imbalanced Datasets: The F1-score is particularly useful in scenarios with imbalanced
datasets where one class significantly outnumbers the other. It helps in providing a more nuanced
understanding of model performance by balancing precision and recall.
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
Ad

More Related Content

Similar to NLP Msc Computer science S2 Kerala University (20)

Probability
ProbabilityProbability
Probability
Neha Raikar
 
Advanced statistics
Advanced statisticsAdvanced statistics
Advanced statistics
Romel Villarubia
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
pujashri1975
 
Application of Central Limit Theorem to Study the Student Skills in Verbal, A...
Application of Central Limit Theorem to Study the Student Skills in Verbal, A...Application of Central Limit Theorem to Study the Student Skills in Verbal, A...
Application of Central Limit Theorem to Study the Student Skills in Verbal, A...
theijes
 
2 or more samples
2 or more samples2 or more samples
2 or more samples
SudhakarNayak11
 
Basic statistics 1
Basic statistics  1Basic statistics  1
Basic statistics 1
Kumar P
 
3 es timation-of_parameters[1]
3 es timation-of_parameters[1]3 es timation-of_parameters[1]
3 es timation-of_parameters[1]
Fernando Jose Damayo
 
Allerton
AllertonAllerton
Allerton
mustafa sarac
 
Lecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptxLecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptx
shakirRahman10
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Lifeng (Aaron) Han
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
kevig
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
ijnlc
 
Different types of distributions
Different types of distributionsDifferent types of distributions
Different types of distributions
RajaKrishnan M
 
Basen Network
Basen NetworkBasen Network
Basen Network
guestf7d226
 
Multivariate and Conditional Distribution
Multivariate and Conditional DistributionMultivariate and Conditional Distribution
Multivariate and Conditional Distribution
ssusered887b
 
Analysis of variance
Analysis of varianceAnalysis of variance
Analysis of variance
Shakeel Nouman
 
Basic probability theory and statistics
Basic probability theory and statisticsBasic probability theory and statistics
Basic probability theory and statistics
Learnbay Datascience
 
Inorganic CHEMISTRY
Inorganic CHEMISTRYInorganic CHEMISTRY
Inorganic CHEMISTRY
Saikumar raja
 
Sampling and Central Limit Theorem_18_01_23 new.pptx
Sampling and Central Limit Theorem_18_01_23 new.pptxSampling and Central Limit Theorem_18_01_23 new.pptx
Sampling and Central Limit Theorem_18_01_23 new.pptx
Universitas Pelita Harapan
 
STSTISTICS AND PROBABILITY THEORY .pptx
STSTISTICS AND PROBABILITY THEORY  .pptxSTSTISTICS AND PROBABILITY THEORY  .pptx
STSTISTICS AND PROBABILITY THEORY .pptx
VenuKumar65
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
pujashri1975
 
Application of Central Limit Theorem to Study the Student Skills in Verbal, A...
Application of Central Limit Theorem to Study the Student Skills in Verbal, A...Application of Central Limit Theorem to Study the Student Skills in Verbal, A...
Application of Central Limit Theorem to Study the Student Skills in Verbal, A...
theijes
 
Basic statistics 1
Basic statistics  1Basic statistics  1
Basic statistics 1
Kumar P
 
Lecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptxLecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptx
shakirRahman10
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Lifeng (Aaron) Han
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
kevig
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
ijnlc
 
Different types of distributions
Different types of distributionsDifferent types of distributions
Different types of distributions
RajaKrishnan M
 
Multivariate and Conditional Distribution
Multivariate and Conditional DistributionMultivariate and Conditional Distribution
Multivariate and Conditional Distribution
ssusered887b
 
Basic probability theory and statistics
Basic probability theory and statisticsBasic probability theory and statistics
Basic probability theory and statistics
Learnbay Datascience
 
Sampling and Central Limit Theorem_18_01_23 new.pptx
Sampling and Central Limit Theorem_18_01_23 new.pptxSampling and Central Limit Theorem_18_01_23 new.pptx
Sampling and Central Limit Theorem_18_01_23 new.pptx
Universitas Pelita Harapan
 
STSTISTICS AND PROBABILITY THEORY .pptx
STSTISTICS AND PROBABILITY THEORY  .pptxSTSTISTICS AND PROBABILITY THEORY  .pptx
STSTISTICS AND PROBABILITY THEORY .pptx
VenuKumar65
 

Recently uploaded (20)

ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Ad

NLP Msc Computer science S2 Kerala University

  • 1. Natural Language Processing MODULE IV Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 2. Introduction Natural Language Processing (NLP) leverages statistical methods to analyse and generate human language. Two fundamental concepts in this domain are probability and information theory. Understanding these concepts is crucial for developing effective NLP models and applications. Probability theory is used to model the likelihood of various linguistic events. Here are some key applications: Language Modeling: n-grams: These are contiguous sequences of n items from a given sample of text. For example, bigrams (n=2) and trigrams (n=3) are used to predict the next word in a sentence by considering the context of the previous words. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 3. Introduction Markov Chains: These models treat each word as a state and analyze the probability of transitioning from one word to another. This is useful for tasks like text generation and speech recognition. Text Classification: Naive Bayes: This algorithm uses Bayes' theorem to classify text based on the probability of certain words appearing in different categories. It is widely used in sentiment analysis and spam detection. Conditional Probability: This measures the probability of an event given that another event has occurred. It is fundamental in tasks like part-of-speech tagging and named entity recognition, where the probability of a word being a certain part of speech or entity depends on the context. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 4. Introduction Information theory provides a framework for quantifying the amount of information in data, which is essential for various NLP tasks: Entropy: Entropy measures the uncertainty or unpredictability of a random variable. In NLP, it quantifies the amount of information required to describe a dataset. Higher entropy indicates greater unpredictability, while lower entropy indicates more predictability. Cross-Entropy Loss: This measures the difference between two probability distributions and is used to evaluate the performance of machine learning models by comparing the predicted distribution with the true distribution. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 5. Introduction 2. Mutual Information: Mutual information quantifies the dependency between two random variables. In NLP, it is used for feature selection, where features with higher mutual information scores are more informative for the model. 3. Kullback-Leibler (KL) Divergence: KL divergence measures the difference between two probability distributions. It is used to compare the true distribution of data with the predicted distribution, helping in model evaluation and regularization techniques like variational inference in Bayesian neural networks. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 6. Practical use of Information Theory in NLP Feature Selection: Using mutual information to identify the most informative features for building predictive models, thereby improving model performance by reducing dimensionality and removing irrelevant features. Decision Trees: Information gain, based on entropy, is used to split nodes in decision trees, leading to more efficient and accurate models by reducing uncertainty about the target variable. Regularization and Model Selection: KL divergence is used in regularization techniques to minimize the difference between the approximate and true posterior distributions, achieving better model regularization and performance. Information Bottleneck: This method aims to find a compressed representation of the input data that retains maximal information about the output, used in deep learning for learning efficient representations. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 7. Probability Concepts Probability and Information Theory are foundational to understanding and building Natural Language Processing (NLP) systems. They provide the mathematical frameworks for modelling uncertainty, quantifying information, and optimizing decision-making in language tasks. Probability theory helps in modelling and predicting linguistic phenomena by managing uncertainties inherent in natural language. Since human language is complex, ambiguous, and context-dependent, probabilistic methods allow us to make inferences and predictions about language data. Information Theory quantifies the amount of information and helps in designing efficient encoding, compression, and communication systems for text data. It is essential for understanding language representation and optimizing models. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 8. Applications of Probability in NLP Language Modeling: Estimating the likelihood of a sequence of words (e.g., P(w1,w2,…,wn). Part-of-Speech Tagging: Computing the most probable sequence of tags given a sequence of words using Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs). Machine Translation: Determining the probability of a target language sentence given a source language sentence (P(target∣source)P(target∣source)). Speech Recognition: Identifying the most probable sequence of words corresponding to an audio signal. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 9. Key Concepts of Probability in NLP Random Variables: Representing linguistic features (e.g., word frequencies). Conditional Probability: Modeling dependencies (e.g., P(word ∣ previous words)P(word ∣ previous words)). Bayes’ Theorem: Used in spam filtering and probabilistic reasoning (P(A∣B)=P(B∣A)⋅P(A)P(B)P(A∣B)=P(B)P(B∣A)⋅P(A)). Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 10. Basics of Probability Theory Random Variables: A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment’s outcomes. Random variables are often designated by letters and can be classified as discrete and continuous. What is a discrete random variable ? Discrete random variables take on a countable number of distinct values. Consider an experiment where a coin is tossed three times. If X represents the number of times that the coin comes up heads, then X is a discrete random variable that can only have the values 0, 1, 2, or 3 (from no heads in three successive coin tosses to all heads). No other value is possible for X. What is a continuous variable ? Continuous random variables can represent any value within a specified range or interval and can take on an infinite number of possible values. An example of a continuous random variable would be an experiment that involves measuring the amount of rainfall in a city over a year or the average height of a random group of 25 people. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 11. Probability distributions A probability distribution is a statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. This range will be bounded between the minimum and maximum possible values. However, where the possible value is likely to be plotted on the probability distribution depends on several factors. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 12. Types of probability distribution Binomial The binomial distribution evaluates the probability of an event occurring several times over a given number of trials given the event’s probability in each trial. It may be generated by keeping track of how many free throws a basketball player makes in a game, where 1 = a basket and 0 = a miss. Another example would be to use a coin and figure out the probability of that coin coming up heads in 10 straight flips. A binomial distribution is discrete rather than continuous because only one or zero is a valid response Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 13. Types of probability distribution Normal The most commonly used distribution is the normal distribution. This is used frequently in finance, investing, science, and engineering. The normal distribution is fully characterized by its mean and standard deviation. The distribution isn't skewed and it does exhibit kurtosis. This makes the distribution symmetric. It's depicted as a bell-shaped curve when plotted. A normal distribution is defined by a mean (average) of zero and a standard deviation of one with a skew of zero and kurtosis = 3. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 14. Types of probability distribution Poisson distribution The Poisson distribution is a discrete probability distribution that models the number of events occurring within a fixed interval of time or space. These events must happen independently of each other, and the average rate (mean number of occurrences) must be constant. The key characteristic of the Poisson distribution is that it describes the probability of a given number of events happening within a specified interval when the events are rare and independent. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 15. Types of probability distribution Bernoulli distribution A Bernoulli distribution is a discrete probability distribution that describes a random experiment with only two possible outcomes: success (usually denoted by 1) or failure (usually denoted by 0). Gaussian distribution A Gaussian distribution, also known as a normal distribution, is a type of continuous probability distribution characterized by its bell-shaped curve. It's one of the most commonly used probability distributions in statistics and many other fields. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 16. What is Entropy Entropy is a measure of uncertainty or randomness associated with a random variable. In the context of NLP, it quantifies the amount of information contained in a message or text H(X) = - Σ P(x) * log₂ P(x) Where: H(X): Entropy of the random variable X P(x): Probability of the event x Cross entropy : Cross-entropy measures the difference between two probability distributions. In NLP, it's used to evaluate the performance of language models. It is calculated as H(p, q) = - Σ p(x) * log₂ q(x) Where, p(x): True probability distribution q(x): Predicted probability distribution Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 17. Baye’s Theorem Bayes' Theorem is a fundamental rule of probability that allows us to update our beliefs about the probability of an event based on new evidence. In the context of NLP, it's used to calculate the probability of a particular class or category given a piece of text. The formulae P(A|B) = P(B|A) * P(A) / P(B) Where: P(A|B): Probability of event A given event B P(B|A): Probability of event B given event A P(A): Prior probability of event A P(B): Prior probability of event B Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 18. Chain rule of conditional probability The chain rule in probability is a fundamental concept that helps us calculate the probability of a sequence of events. Let's break it down from the basics: Conditional probability is the probability of an event occurring given that another event has already occurred. It's denoted as P(A|B), which reads "the probability of A given B“. The chain rule formula states that the probability of a sequence of events can be calculated by multiplying the conditional probabilities of each event. Mathematically, it's represented as:P(A ∩ B ∩ C) = P(A) × P(B|A) × P(C|A ∩ B) Here:- P(A ∩ B ∩ C) is the probability of the sequence of events A, B, and C occurring.- P(A) is the probability of event A occurring.- P(B|A) is the conditional probability of event B occurring given that event A has occurred.- P(C|A ∩ B) is the conditional probability of event C occurring given that events A and B have occurred. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 19. Chain rule of conditional probability- Example Suppose we want to calculate the probability of a person being a smoker (S), given that they have a family history of smoking (F) and are over 30 years old (O). We can use the chain rule formula: P(S ∩ F ∩ O) = P(O) × P(F|O) × P(S|F ∩ O)Assuming we have the following probabilities:- -P(O) = 0.6 (probability of being over 30 years old) - P(F|O) = 0.4 (probability of having a family history of smoking given that you're over 30 years old) - P(S|F ∩ O) = 0.7 (probability of being a smoker given that you have a family history of smoking and are over 30 years old) Using the chain rule formula, we get:P(S ∩ F ∩ O) = 0.6 × 0.4 × 0.7 = 0.168 Therefore, the probability of a person being a smoker, having a family history of smoking, and being over 30 years old is approximately 16.8%. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 20. Applications of Baye’s theorem in NLP Text Classification: Calculating the probability that a document belongs to a specific class given its features. Information Retrieval: Ranking documents based on their relevance to a query. Machine Translation: Selecting the most probable translation for a given word or phrase. Example: Suppose we want to classify an email as spam or not spam. We can use Bayes' Theorem to calculate the probability that an email is spam given the presence of certain keywords (e.g., "free," "urgent," "money"). Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 21. Practical applications of Probability in NLP Language Modeling: Language modeling, or LM, is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models analyze bodies of text data to provide a basis for their word predictions. Example: N-gram Models: Simple models that predict the next word based on the previous N words. Limited by data sparsity and the curse of dimensionality Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 22. Practical applications of Probability in NLP Text classification: Text classification analyzes the content and meaning of text to assign it a label. It can be used to organize and structure large amounts of unstructured text, such as legal documents, contracts, research documents, and email. Example: Naive Bayes Classifier: Assumes feature independence to simplify calculations. Effective for many text classification tasks, especially with large datasets. Maximum Entropy Classifier: More flexible than Naive Bayes, but can be computationally expensive. Often used for tasks like sentiment analysis and topic classification. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 23. Practical applications of Probability in NLP Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Example: Probabilistic Retrieval Models: Rank documents based on their probability of relevance to a query. Incorporate language models to improve retrieval accuracy Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 24. Language Modelling Language modelling is the way of determining the probability of any sequence of words. Language modelling is used in various applications such as Speech Recognition, Spam filtering, etc. Language modelling is the key aim behind implementing many state-of-the-art Natural Language Processing models. Two methods of Language Modeling: Statistical Language Modelling: Statistical Language Modelling, or Language Modelling, is the development of probabilistic models that can predict the next word in the sequence given the words that precede. Examples such as N-gram language modelling. Neural Language Modelling: Neural network methods are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation. A way of performing a neural language model is through word embeddings. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 25. Introduction to N-gram models in NLP An N-gram model is a type of probabilistic language model used in Natural Language Processing (NLP) to predict the likelihood of a sequence of words. The term "N-gram" refers to a contiguous sequence of 'n' items from a given sample of text or speech. These items can be characters, words, or even phonemes. The most common types of N-grams are:Unigrams (n=1): Single words. Bigrams (n=2): Pairs of consecutive words. Trigrams (n=3): Sequences of three consecutive words. Higher-order N-grams: Sequences of more than three words. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 26. ‘N’ gram Language Model N-gram can be defined as the contiguous sequence of n items from a given sample of text or speech. The items can be letters, words, or base pairs according to the application. The N- grams typically are collected from a text or speech corpus (A long text dataset). For instance, N-grams can be unigrams like (“This”, “article”, “is”, “on”, “NLP”) or bigrams (“This article”, “article is”, “is on”, “on NLP”). An N-gram language model predicts the probability of a given N-gram within any sequence of words in a language. A well-crafted N-gram model can effectively predict the next word in a sentence, which is essentially determining the value of p(w∣h), where h is the history or context and w is the word to predict. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 27. ‘N’ gram models Let’s begin with the task of computing P(w|h), the probability of a word w given some history h. Suppose the history h is “The water of Walden Pond is so beautifully ” and we want to know the probability that the next word is blue: P(blue|The water of Walden Pond is so beautifully) (3.1) One way to estimate this probability is directly from relative frequency counts: take a very large corpus, count the number of times we see “The water of Walden Pond is so beautifully”, and count the number of times this is followed by “blue” . This would be answering the question “Out of the times we saw the history h, how many times was it followed by the word w”, as follows: P(blue | The water of Walden Pond is so beautifully) = C(The water of Walden Pond is so beautifully blue) C(The water of Walden Pond is so beautifully) Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 28. ‘N’ gram models N-gram models work by calculating the probability of a word given the previous 'n-1' words. This is done using the following steps:Corpus Collection: A large corpus of text is collected to train the model. Tokenization: The text is tokenized into individual words or characters. Counting N-grams: The frequency of each N-gram in the corpus is counted. Calculating Probabilities: The probability of each N-gram is calculated using the frequency counts. For example, the probability of a bigram P(wi ∣ wi−1) is given P(wi∣wi−1)=Count(wi−1,wi)Count(wi−1) ​where Count(wi−1​,wi​) is the number of times the bigram (wi−1,wi)(wi−1​,wi​) appears in the corpus, and Count(wi−1) is the number of times the word (wi−1) appears in the corpus. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 29. Applications of N-gram in industry N-gram models are widely used in various NLP tasks due to their simplicity and effectiveness. Some of the key applications include: Language Modelling: N-gram models are fundamental in language modelling, where they predict the next word in a sequence based on the previous 'n-1' words. This is useful for tasks like text generation and speech recognition. Text Generation: By predicting the next word in a sequence, N-gram models can generate coherent and contextually relevant text. This is used in applications like auto-completion and chatbots. Spelling Correction: N-gram models can be used to correct spelling errors by identifying and suggesting the most probable word sequences based on the context. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 30. Applications of N-gram in industry Sentiment Analysis: N-grams help in capturing local context and dependencies in text data, which is crucial for sentiment analysis tasks where understanding the context is important for accurate classification. Machine Translation: N-gram models are used in statistical machine translation systems to translate text from one language to another by modelling the probability of word sequences in both languages. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 31. Smoothing Techniques for N-gram Model Smoothing techniques are essential in Natural Language Processing (NLP) to handle the issue of data sparsity and improve the generalization of language models. These techniques adjust the estimated probabilities of n-grams to ensure that all possible word sequences have a non-zero probability, even if they do not appear in the training data. Here are some of the most commonly used smoothing techniques in NLP. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 32. Smoothing Techniques for N-gram Model 1. Additive Smoothing (Laplace Smoothing) Additive smoothing is one of the simplest and most widely used smoothing techniques. It involves adding a small constant value (usually denoted as α) to the count of each n-gram. The adjusted probability is then calculated as: where V is the vocabulary size. This method ensures that no probability is zero, even for unseen n- grams. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 33. Smoothing Techniques for N-gram Model 2. Good-Turing Smoothing Good-Turing smoothing reallocates the probability mass from seen n-grams to unseen n-grams. It estimates the probability of unseen n-grams by using the count of n-grams that occur once in the training data. The formula for Good-Turing smoothing is: where cc is the count of the n-gram, Nc is the number of n-grams that occur cc times, and Nc+1 is the number of n-grams that occur c+1 times. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 34. Smoothing Techniques for N-gram Model 3. Kneser-Ney Smoothing Kneser-Ney smoothing is an advanced technique that considers the context of n-grams. It adjusts the probability estimates based on the frequency of n-grams and their contexts. The formula for Kneser-Ney smoothing involves a recursive calculation that considers lower-order n-grams: where d is a discount factor, and λ(wi−1) is a normalization factor. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 35. Smoothing Techniques for N-gram Model 4. Witten-Bell Smoothing Witten-Bell smoothing estimates the probability of unseen events by considering the number of unique events observed in the training data. It adjusts the probability estimates based on the number of unique n-grams and their counts: where T(wi−1) is the number of unique n-grams that follow wi−1. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 36. Smoothing Techniques for N-gram Model 5. Jelinek-Mercer Smoothing Jelinek-Mercer smoothing combines the maximum likelihood estimate with a lower-order model using a linear interpolation technique. It calculates the probability as a weighted sum of the maximum likelihood estimate and a lower-order model: where λ is a weighting factor that balances the contribution of the higher and lower order models. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 37. Advantages of N-gram model N-gram models offer several advantages that make them indispensable in various NLP applications: Simplicity: They are easy to understand and implement, making them a good starting point for many NLP tasks. Efficiency: N-gram models are computationally efficient and can handle large datasets effectively. Effectiveness: Despite their simplicity, N-gram models often perform well in practice, especially for tasks that require capturing local dependencies in text data. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 38. Limitations of N-gram model While N-gram models are powerful, they also have some limitations: Sparsity: As 'n' increases, the number of possible N-grams grows exponentially, leading to data sparsity issues where many N-grams may not appear in the training corpus. Contextual Limitations: N-gram models capture only local context and dependencies, which may not be sufficient for understanding long-range dependencies in text data. Data Requirements: They require large amounts of training data to accurately estimate probabilities for all possible N-grams, which can be a challenge for low-resource languages or specialized domains. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 39. Machine Learning Models for NLP Natural Language Processing (NLP) is a dynamic field within artificial intelligence that bridges the gap between human language and machine interpretation. A range of machine learning models and tools are employed to solve various NLP tasks like sentiment analysis, machine translation, text summarization, and entity recognition. Below are key models and tools widely used in NLP, along with their capabilities and applications: Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 40. Naïve Bayes Model Naive Bayes is a family of simple yet effective probabilistic classifiers based on Bayes' theorem, assuming independence between features. Despite its simplicity, it performs remarkably well for many NLP tasks. Applications: Sentiment Analysis: Naive Bayes can categorize text into sentiments (positive, negative, neutral) based on word probabilities. Spam Filtering: It’s extensively used in email systems to distinguish spam emails from legitimate ones by analysing textual content. Text Classification: Useful for classifying documents into predefined categories, such as news topics or product reviews. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 41. Naïve Bayes Model The Naive Bayes model is a probabilistic machine learning algorithm used for classification tasks. It is based on Bayes' Theorem and assumes that the features are conditionally independent given the class label. This assumption, often referred to as the "naive" assumption, simplifies the computation and makes the model efficient and easy to implement. Bayes' Theorem: The foundation of the Naive Bayes model is Bayes' Theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event. Mathematically, Baye’s theorem can be expressed as: Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 42. Naïve Bayes Model Where P(A∣B) is the posterior probability of class A given predictor B. P(B∣A) is the likelihood, which is the probability of predictor BB given class A. P(A) is the prior probability of class A. P(B) is the prior probability of predictor B Conditional Independence: The Naive Bayes model assumes that all features are independent of each other given the class label. This means that the presence of one feature does not affect the presence of another feature. This assumption simplifies the computation significantly. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 43. Naïve Bayes Model Text Classification: Naive Bayes is widely used in text classification tasks such as spam detection, sentiment analysis, and topic classification. Its ability to handle high- dimensional data makes it particularly effective in these applications. Spam Filtering: One of the most popular applications of Naive Bayes is in email spam filtering. The model can quickly classify emails as spam or not spam based on the presence of certain words or phrases. Sentiment Analysis: Naive Bayes models are used to determine the sentiment of a piece of text, such as a review or a social media post. This is useful for businesses to understand customer opinions and feedback. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 44. Naïve Bayes Model 4. Medical Diagnosis: In healthcare, Naive Bayes can be used to predict the likelihood of a patient having a certain disease based on symptoms and other medical data. This helps in early diagnosis and treatment planning. 5. Market Analysis: Naive Bayes is used in marketing to classify customers into different segments based on their purchasing behaviour, demographics, and other attributes. This helps in targeted marketing and personalized recommendations. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 45. Maximum Entropy Model The Maximum Entropy Model (MaxEnt) is a probabilistic framework used in machine learning and natural language processing (NLP) to estimate the probability distribution of a set of outcomes. The principle of maximum entropy suggests choosing the probability distribution that maximizes entropy while satisfying a set of constraints derived from the observed data. This approach ensures that the model makes the least assumptions about the data, making it a powerful tool for modelling complex systems where the underlying relationships are not well understood. Mathematically, the maximum entropy principle can be expressed as: Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 46. Maximum Entropy Model Where P(x) is the probability distribution. Z is the normalization constant (partition function). λi are the Lagrange multipliers. fi(x) are the feature functions. Maximum Entropy Models have been widely used in various NLP tasks due to their ability to handle complex dependencies and incorporate multiple sources of evidence Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 47. Applications of Maximum Entropy Model Part-of-Speech (POS) Tagging: POS tagging involves assigning a grammatical category (e.g., noun, verb, adjective) to each word in a sentence. MaxEnt models are effective in this task because they can combine various contextual features to make accurate predictions. Named Entity Recognition (NER): NER involves identifying and classifying named entities in text into predefined categories such as names of persons, organizations, locations, etc. MaxEnt models can leverage multiple features like word context, capitalization, and part-of-speech tags to improve recognition accuracy. Text Classification: Text classification tasks, such as sentiment analysis and topic classification, benefit from MaxEnt models due to their ability to handle high-dimensional feature spaces and incorporate diverse sources of information. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 48. Applications of Maximum Entropy Model Language Modeling: MaxEnt models can be used to build language models that predict the probability of a word given its context. This is useful for tasks like speech recognition and machine translation. Text Segmentation: MaxEnt models are also used in text segmentation tasks, where the goal is to divide a text into meaningful segments such as sentences or paragraphs. This involves combining various linguistic features to make accurate segmentation decisions In summary, Maximum Entropy Models are a versatile and powerful tool in NLP, capable of handling complex dependencies and incorporating diverse sources of evidence. Their applications span across various tasks, from POS tagging and NER to text classification and language modeling, making them an essential component of modern NLP systems. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 49. Evaluation Metrics for NLP Tasks In Natural Language Processing (NLP), evaluating the performance of models is crucial to ensure they meet the desired standards. Three key metrics used for this purpose are Precision, Recall, and F1-score. These metrics are particularly useful in classification tasks such as sentiment analysis, named entity recognition, and text classification. 1. Precision: Precision measures the proportion of correctly predicted positive instances (true positives) out of all instances predicted as positive (true positives + false positives). Precision is about being precise. It tells us how many of the instances identified as positive are actually positive. A high precision means that the model is good at avoiding false positives. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 50. Evaluation Metrics for NLP Tasks 2. Recall: Recall measures the proportion of correctly predicted positive instances out of all actual positive instances. It evaluates how well the model identifies true positives. A high recall indicates that the model captures most of the actual positives. It is crucial in scenarios where missing true positives is costly. Use cases: In disease screening, high recall ensures that most patients with the disease are correctly identified. In fraud detection, it minimizes undetected fraudulent activities. Recall is about being thorough. It tells us how many of the actual positive instances were correctly identified by the model. A high recall means that the model is good at avoiding false negatives. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 51. Evaluation Metrics for NLP Tasks 3. F1-Score The F1-score is the harmonic mean of Precision and Recall, providing a balanced metric that considers both false positives and false negatives. It is useful when there is an uneven class distribution or when both precision and recall are equally important. The F1-score is useful when you need a balance between precision and recall, especially in cases where there is an uneven class distribution (imbalanced datasets). It ranges from 0 to 1, with 1 being the best possible score. Use cases: •Sentiment Analysis: Ensures accurate classification of both positive and negative sentiments. •Named Entity Recognition (NER): Balances identifying entities correctly while minimizing false positives. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum
  • 52. Trade-offs and consideration Precision vs. Recall Trade-off: There is often a trade-off between precision and recall. Improving one can negatively impact the other. For example, a model that predicts everything as positive will have high recall but low precision. Conversely, a model that predicts very few positives will have high precision but low recall. F1-score for Imbalanced Datasets: The F1-score is particularly useful in scenarios with imbalanced datasets where one class significantly outnumbers the other. It helps in providing a more nuanced understanding of model performance by balancing precision and recall. Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum