NLP Msc Computer science S2 Kerala University

Natural Language Processing
MODULE IV
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum

Introduction
Natural Language Processing (NLP) leverages statistical methods to analyse and generate human
language. Two fundamental concepts in this domain are probability and information theory.
Understanding these concepts is crucial for developing effective NLP models and applications.
Probability theory is used to model the likelihood of various linguistic events. Here are some key
applications:
Language Modeling:
n-grams: These are contiguous sequences of n items from a given sample of text. For example, bigrams
(n=2) and trigrams (n=3) are used to predict the next word in a sentence by considering the context of
the previous words.

Introduction
Markov Chains: These models treat each word as a state and analyze the probability of transitioning
from one word to another. This is useful for tasks like text generation and speech recognition.
Text Classification:
Naive Bayes: This algorithm uses Bayes' theorem to classify text based on the probability of certain
words appearing in different categories. It is widely used in sentiment analysis and spam detection.
Conditional Probability:
This measures the probability of an event given that another event has occurred. It is fundamental in
tasks like part-of-speech tagging and named entity recognition, where the probability of a word being a
certain part of speech or entity depends on the context.

Introduction
Information theory provides a framework for quantifying the amount of information in data, which is
essential for various NLP tasks:
Entropy:
Entropy measures the uncertainty or unpredictability of a random variable. In NLP, it quantifies
the amount of information required to describe a dataset. Higher entropy indicates greater
unpredictability, while lower entropy indicates more predictability.
Cross-Entropy Loss: This measures the difference between two probability distributions and is
used to evaluate the performance of machine learning models by comparing the predicted
distribution with the true distribution.

Introduction
2. Mutual Information:
Mutual information quantifies the dependency between two random variables. In NLP, it is used for
feature selection, where features with higher mutual information scores are more informative for the
model.
3. Kullback-Leibler (KL) Divergence:
KL divergence measures the difference between two probability distributions. It is used to compare
the true distribution of data with the predicted distribution, helping in model evaluation and
regularization techniques like variational inference in Bayesian neural networks.

Practical use of Information Theory in NLP
Feature Selection: Using mutual information to identify the most informative features for building
predictive models, thereby improving model performance by reducing dimensionality and removing
irrelevant features.
Decision Trees: Information gain, based on entropy, is used to split nodes in decision trees, leading to
more efficient and accurate models by reducing uncertainty about the target variable.
Regularization and Model Selection: KL divergence is used in regularization techniques to minimize
the difference between the approximate and true posterior distributions, achieving better model
regularization and performance.
Information Bottleneck: This method aims to find a compressed representation of the input data that
retains maximal information about the output, used in deep learning for learning efficient representations.

Probability Concepts
Probability and Information Theory are foundational to understanding and building Natural
Language Processing (NLP) systems. They provide the mathematical frameworks for modelling
uncertainty, quantifying information, and optimizing decision-making in language tasks.
Probability theory helps in modelling and predicting linguistic phenomena by managing
uncertainties inherent in natural language. Since human language is complex, ambiguous, and
context-dependent, probabilistic methods allow us to make inferences and predictions about
language data.
Information Theory quantifies the amount of information and helps in designing efficient
encoding, compression, and communication systems for text data. It is essential for
understanding language representation and optimizing models.

Applications of Probability in NLP
Language Modeling: Estimating the likelihood of a sequence of words (e.g.,
P(w1,w2,…,wn).
Part-of-Speech Tagging: Computing the most probable sequence of tags given a sequence
of words using Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs).
Machine Translation: Determining the probability of a target language sentence given a
source language sentence (P(target∣source)P(target∣source)).
Speech Recognition: Identifying the most probable sequence of words corresponding to an
audio signal.

Key Concepts of Probability in NLP
Random Variables: Representing linguistic features (e.g., word frequencies).
Conditional Probability: Modeling dependencies (e.g., P(word ∣ previous
words)P(word ∣ previous words)).
Bayes’ Theorem: Used in spam filtering and probabilistic reasoning
(P(A∣B)=P(B∣A)⋅P(A)P(B)P(A∣B)=P(B)P(B∣A)⋅P(A)).

Basics of Probability Theory
Random Variables: A random variable is a variable whose value is unknown or a function that assigns values
to each of an experiment’s outcomes. Random variables are often designated by letters and can be classified
as discrete and continuous.
What is a discrete random variable ?
Discrete random variables take on a countable number of distinct values. Consider an experiment where a coin is
tossed three times. If X represents the number of times that the coin comes up heads, then X is a discrete random
variable that can only have the values 0, 1, 2, or 3 (from no heads in three successive coin tosses to all heads). No
other value is possible for X.
What is a continuous variable ?
Continuous random variables can represent any value within a specified range or interval and can take on an
infinite number of possible values. An example of a continuous random variable would be an experiment that
involves measuring the amount of rainfall in a city over a year or the average height of a random group of 25
people.

Probability distributions
A probability distribution is a statistical function that describes all the possible values and
likelihoods that a random variable can take within a given range. This range will be bounded
between the minimum and maximum possible values. However, where the possible value is likely
to be plotted on the probability distribution depends on several factors.

Types of probability distribution
Binomial
The binomial distribution evaluates the probability of an event occurring several times
over a given number of trials given the event’s probability in each trial. It may be
generated by keeping track of how many free throws a basketball player makes in a game,
where 1 = a basket and 0 = a miss.
Another example would be to use a coin and figure out the probability of that coin
coming up heads in 10 straight flips. A binomial distribution is discrete rather than
continuous because only one or zero is a valid response

Normal
The most commonly used distribution is the normal distribution. This is used frequently in
finance, investing, science, and engineering. The normal distribution is fully characterized by
its mean and standard deviation. The distribution isn't skewed and it does exhibit kurtosis.
This makes the distribution symmetric. It's depicted as a bell-shaped curve when plotted. A
normal distribution is defined by a mean (average) of zero and a standard deviation of one
with a skew of zero and kurtosis = 3.

Poisson distribution
The Poisson distribution is a discrete probability distribution that models the number of
events occurring within a fixed interval of time or space. These events must happen
independently of each other, and the average rate (mean number of occurrences) must be
constant.
The key characteristic of the Poisson distribution is that it describes the probability of a
given number of events happening within a specified interval when the events are rare and
independent.

Bernoulli distribution
A Bernoulli distribution is a discrete probability distribution that describes a random
experiment with only two possible outcomes: success (usually denoted by 1) or failure
(usually denoted by 0).
Gaussian distribution
A Gaussian distribution, also known as a normal distribution, is a type of continuous
probability distribution characterized by its bell-shaped curve. It's one of the most
commonly used probability distributions in statistics and many other fields.

What is Entropy
Entropy is a measure of uncertainty or randomness associated with a random variable. In the context of NLP, it
quantifies the amount of information contained in a message or text H(X) = - Σ P(x) * log₂ P(x)
Where:
H(X): Entropy of the random variable X
P(x): Probability of the event x
Cross entropy : Cross-entropy measures the difference between two probability distributions. In NLP, it's used to
evaluate the performance of language models. It is calculated as H(p, q) = - Σ p(x) * log₂ q(x)
Where,
p(x): True probability distribution
q(x): Predicted probability distribution

Baye’s Theorem
Bayes' Theorem is a fundamental rule of probability that allows us to update our beliefs about
the probability of an event based on new evidence. In the context of NLP, it's used to calculate
the probability of a particular class or category given a piece of text.
The formulae P(A|B) = P(B|A) * P(A) / P(B)
Where:
P(A|B): Probability of event A given event B
P(B|A): Probability of event B given event A
P(A): Prior probability of event A
P(B): Prior probability of event B

Chain rule of conditional probability
The chain rule in probability is a fundamental concept that helps us calculate the probability of
a sequence of events. Let's break it down from the basics:
Conditional probability is the probability of an event occurring given that another event has
already occurred. It's denoted as P(A|B), which reads "the probability of A given B“.
The chain rule formula states that the probability of a sequence of events can be calculated by
multiplying the conditional probabilities of each event. Mathematically, it's represented as:P(A
∩ B ∩ C) = P(A) × P(B|A) × P(C|A ∩ B)
Here:- P(A ∩ B ∩ C) is the probability of the sequence of events A, B, and C occurring.- P(A) is
the probability of event A occurring.- P(B|A) is the conditional probability of event B
occurring given that event A has occurred.- P(C|A ∩ B) is the conditional probability of event
C occurring given that events A and B have occurred.

Chain rule of conditional probability- Example
Suppose we want to calculate the probability of a person being a smoker (S), given that they
have a family history of smoking (F) and are over 30 years old (O).
We can use the chain rule formula:
P(S ∩ F ∩ O) = P(O) × P(F|O) × P(S|F ∩ O)Assuming we have the following probabilities:-
-P(O) = 0.6 (probability of being over 30 years old)
- P(F|O) = 0.4 (probability of having a family history of smoking given that you're over 30 years old)
- P(S|F ∩ O) = 0.7 (probability of being a smoker given that you have a family history of smoking and are over 30
years old)
Using the chain rule formula, we get:P(S ∩ F ∩ O) = 0.6 × 0.4 × 0.7 = 0.168
Therefore, the probability of a person being a smoker, having a family history of smoking, and being over 30 years
old is approximately 16.8%.

Applications of Baye’s theorem in NLP
Text Classification: Calculating the probability that a document belongs to a specific class
given its features.
Information Retrieval: Ranking documents based on their relevance to a query.
Machine Translation: Selecting the most probable translation for a given word or phrase.
Example: Suppose we want to classify an email as spam or not spam. We can use Bayes'
Theorem to calculate the probability that an email is spam given the presence of certain
keywords (e.g., "free," "urgent," "money").

Practical applications of Probability in NLP
Language Modeling: Language modeling, or LM, is the use of various statistical and
probabilistic techniques to determine the probability of a given sequence of words
occurring in a sentence. Language models analyze bodies of text data to provide a basis
for their word predictions.
Example: N-gram Models:
Simple models that predict the next word based on the previous N words.
Limited by data sparsity and the curse of dimensionality

Text classification: Text classification analyzes the content and meaning of text to assign it
a label. It can be used to organize and structure large amounts of unstructured text, such as
legal documents, contracts, research documents, and email.
Example: Naive Bayes Classifier:
Assumes feature independence to simplify calculations.
Effective for many text classification tasks, especially with large datasets.
Maximum Entropy Classifier:
More flexible than Naive Bayes, but can be computationally expensive.
Often used for tasks like sentiment analysis and topic classification.

Information extraction (IE) is the task of automatically extracting structured information
from unstructured and/or semi-structured machine-readable documents and other
electronically represented sources. Typically, this involves processing human language texts by
means of natural language processing (NLP).
Example:
Probabilistic Retrieval Models:
Rank documents based on their probability of relevance to a query.
Incorporate language models to improve retrieval accuracy

Language Modelling
Language modelling is the way of determining the probability of any sequence of words. Language modelling is
used in various applications such as Speech Recognition, Spam filtering, etc. Language modelling is the key aim
behind implementing many state-of-the-art Natural Language Processing models.
Two methods of Language Modeling:
Statistical Language Modelling: Statistical Language Modelling, or Language Modelling, is the development of
probabilistic models that can predict the next word in the sequence given the words that precede. Examples such
as N-gram language modelling.
Neural Language Modelling: Neural network methods are achieving better results than classical methods both on
standalone language models and when models are incorporated into larger models on challenging tasks like
speech recognition and machine translation. A way of performing a neural language model is through word
embeddings.

Introduction to N-gram models in NLP
An N-gram model is a type of probabilistic language model used in Natural Language
Processing (NLP) to predict the likelihood of a sequence of words. The term "N-gram" refers
to a contiguous sequence of 'n' items from a given sample of text or speech. These items can
be characters, words, or even phonemes. The most common types of N-grams are:Unigrams
(n=1): Single words.
Bigrams (n=2): Pairs of consecutive words.
Trigrams (n=3): Sequences of three consecutive words.
Higher-order N-grams: Sequences of more than three words.

‘N’ gram Language Model
N-gram can be defined as the contiguous sequence of n items from a given sample of text or
speech. The items can be letters, words, or base pairs according to the application. The N-
grams typically are collected from a text or speech corpus (A long text dataset).
For instance, N-grams can be unigrams like (“This”, “article”, “is”, “on”, “NLP”) or bigrams
(“This article”, “article is”, “is on”, “on NLP”).
An N-gram language model predicts the probability of a given N-gram within any sequence
of words in a language. A well-crafted N-gram model can effectively predict the next word in
a sentence, which is essentially determining the value of p(w∣h), where h is the history or
context and w is the word to predict.

‘N’ gram models
Let’s begin with the task of computing P(w|h), the probability of a word w given some history h.
Suppose the history h is “The water of Walden Pond is so beautifully ” and we want to know the
probability that the next word is blue: P(blue|The water of Walden Pond is so beautifully) (3.1)
One way to estimate this probability is directly from relative frequency counts: take a very large
corpus, count the number of times we see “The water of Walden Pond is so beautifully”, and count
the number of times this is followed by “blue” . This would be answering the question “Out of the
times we saw the history h, how many times was it followed by the word w”, as follows:
P(blue | The water of Walden Pond is so beautifully) =
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully)

‘N’ gram models
N-gram models work by calculating the probability of a word given the previous 'n-1' words. This is
done using the following steps:Corpus Collection: A large corpus of text is collected to train the
model.
Tokenization: The text is tokenized into individual words or characters.
Counting N-grams: The frequency of each N-gram in the corpus is counted.
Calculating Probabilities: The probability of each N-gram is calculated using the frequency counts.
For example, the probability of a bigram P(wi ∣ wi−1) is given
P(wi∣wi−1)=Count(wi−1,wi)Count(wi−1)
where Count(wi−1,wi) is the number of times the bigram (wi−1,wi)(wi−1,wi) appears in the corpus,
and Count(wi−1) is the number of times the word (wi−1) appears in the corpus.

Applications of N-gram in industry
N-gram models are widely used in various NLP tasks due to their simplicity and effectiveness.
Some of the key applications include:
Language Modelling: N-gram models are fundamental in language modelling, where they
predict the next word in a sequence based on the previous 'n-1' words. This is useful for tasks
like text generation and speech recognition.
Text Generation: By predicting the next word in a sequence, N-gram models can generate
coherent and contextually relevant text. This is used in applications like auto-completion and
chatbots.
Spelling Correction: N-gram models can be used to correct spelling errors by identifying and
suggesting the most probable word sequences based on the context.

Applications of N-gram in industry
Sentiment Analysis: N-grams help in capturing local context and dependencies in text
data, which is crucial for sentiment analysis tasks where understanding the context is
important for accurate classification.
Machine Translation: N-gram models are used in statistical machine translation systems
to translate text from one language to another by modelling the probability of word
sequences in both languages.

Smoothing Techniques for N-gram Model
Smoothing techniques are essential in Natural Language Processing (NLP) to handle the issue of
data sparsity and improve the generalization of language models. These techniques adjust the
estimated probabilities of n-grams to ensure that all possible word sequences have a non-zero
probability, even if they do not appear in the training data. Here are some of the most commonly
used smoothing techniques in NLP.

1. Additive Smoothing (Laplace Smoothing)
Additive smoothing is one of the simplest and most widely used smoothing techniques. It involves
adding a small constant value (usually denoted as α) to the count of each n-gram. The adjusted
probability is then calculated as:
where V is the vocabulary size. This method ensures that no probability is zero, even for unseen n-
grams.

2. Good-Turing Smoothing
Good-Turing smoothing reallocates the probability mass from seen n-grams to unseen n-grams. It
estimates the probability of unseen n-grams by using the count of n-grams that occur once in the
training data. The formula for Good-Turing smoothing is:
where cc is the count of the n-gram, Nc is the number of n-grams that occur cc times, and Nc+1 is
the number of n-grams that occur c+1 times.

3. Kneser-Ney Smoothing
Kneser-Ney smoothing is an advanced technique that considers the context of n-grams. It adjusts
the probability estimates based on the frequency of n-grams and their contexts. The formula for
Kneser-Ney smoothing involves a recursive calculation that considers lower-order n-grams:
where d is a discount factor, and λ(wi−1) is a normalization factor.

4. Witten-Bell Smoothing
Witten-Bell smoothing estimates the probability of unseen events by considering the number of
unique events observed in the training data. It adjusts the probability estimates based on the number of
unique n-grams and their counts:
where T(wi−1) is the number of unique n-grams that follow wi−1.

5. Jelinek-Mercer Smoothing
Jelinek-Mercer smoothing combines the maximum likelihood estimate with a lower-order model
using a linear interpolation technique. It calculates the probability as a weighted sum of the maximum
likelihood estimate and a lower-order model:
where λ is a weighting factor that balances the contribution of the higher and lower order models.

Advantages of N-gram model
N-gram models offer several advantages that make them indispensable in various NLP
applications:
Simplicity: They are easy to understand and implement, making them a good starting point
for many NLP tasks.
Efficiency: N-gram models are computationally efficient and can handle large datasets
effectively.
Effectiveness: Despite their simplicity, N-gram models often perform well in
practice, especially for tasks that require capturing local dependencies in text data.

Limitations of N-gram model
While N-gram models are powerful, they also have some limitations:
Sparsity: As 'n' increases, the number of possible N-grams grows exponentially, leading to
data sparsity issues where many N-grams may not appear in the training corpus.
Contextual Limitations: N-gram models capture only local context and dependencies, which
may not be sufficient for understanding long-range dependencies in text data.
Data Requirements: They require large amounts of training data to accurately estimate
probabilities for all possible N-grams, which can be a challenge for low-resource languages
or specialized domains.

Machine Learning Models for NLP
Natural Language Processing (NLP) is a dynamic field within artificial intelligence that
bridges the gap between human language and machine interpretation. A range of machine
learning models and tools are employed to solve various NLP tasks like sentiment analysis,
machine translation, text summarization, and entity recognition. Below are key models and
tools widely used in NLP, along with their capabilities and applications:

Naïve Bayes Model
Naive Bayes is a family of simple yet effective probabilistic classifiers based on Bayes' theorem,
assuming independence between features. Despite its simplicity, it performs remarkably well for
many NLP tasks.
Applications:
Sentiment Analysis: Naive Bayes can categorize text into sentiments (positive, negative, neutral) based
on word probabilities.
Spam Filtering: It’s extensively used in email systems to distinguish spam emails from legitimate ones
by analysing textual content.
Text Classification: Useful for classifying documents into predefined categories, such as news topics or
product reviews.

Naïve Bayes Model
The Naive Bayes model is a probabilistic machine learning algorithm used for classification
tasks. It is based on Bayes' Theorem and assumes that the features are conditionally independent
given the class label. This assumption, often referred to as the "naive" assumption, simplifies the
computation and makes the model efficient and easy to implement.
Bayes' Theorem: The foundation of the Naive Bayes model is Bayes' Theorem, which describes
the probability of an event based on prior knowledge of conditions that might be related to the
event.
Mathematically, Baye’s theorem can be expressed as:

Naïve Bayes Model
Where P(A∣B) is the posterior probability of class A given predictor B.
P(B∣A) is the likelihood, which is the probability of predictor BB given class A.
P(A) is the prior probability of class A.
P(B) is the prior probability of predictor B
Conditional Independence: The Naive Bayes model assumes that all features are independent of each
other given the class label. This means that the presence of one feature does not affect the presence of
another feature. This assumption simplifies the computation significantly.

Naïve Bayes Model
Text Classification: Naive Bayes is widely used in text classification tasks such as spam
detection, sentiment analysis, and topic classification. Its ability to handle high-
dimensional data makes it particularly effective in these applications.
Spam Filtering: One of the most popular applications of Naive Bayes is in email spam
filtering. The model can quickly classify emails as spam or not spam based on the presence
of certain words or phrases.
Sentiment Analysis: Naive Bayes models are used to determine the sentiment of a piece of
text, such as a review or a social media post. This is useful for businesses to understand
customer opinions and feedback.

Naïve Bayes Model
4. Medical Diagnosis: In healthcare, Naive Bayes can be used to predict the likelihood of a
patient having a certain disease based on symptoms and other medical data. This helps in
early diagnosis and treatment planning.
5. Market Analysis: Naive Bayes is used in marketing to classify customers into different
segments based on their purchasing behaviour, demographics, and other attributes. This
helps in targeted marketing and personalized recommendations.

Maximum Entropy Model
The Maximum Entropy Model (MaxEnt) is a probabilistic framework used in machine learning
and natural language processing (NLP) to estimate the probability distribution of a set of
outcomes. The principle of maximum entropy suggests choosing the probability distribution that
maximizes entropy while satisfying a set of constraints derived from the observed data. This
approach ensures that the model makes the least assumptions about the data, making it a powerful
tool for modelling complex systems where the underlying relationships are not well understood.
Mathematically, the maximum entropy principle can be expressed as:

Maximum Entropy Model
Where P(x) is the probability distribution.
Z is the normalization constant (partition function).
λi are the Lagrange multipliers.
fi(x) are the feature functions.
Maximum Entropy Models have been widely used in various NLP tasks due to their ability to handle
complex dependencies and incorporate multiple sources of evidence

Applications of Maximum Entropy Model
Part-of-Speech (POS) Tagging: POS tagging involves assigning a grammatical category (e.g.,
noun, verb, adjective) to each word in a sentence. MaxEnt models are effective in this task because
they can combine various contextual features to make accurate predictions.
Named Entity Recognition (NER): NER involves identifying and classifying named entities in
text into predefined categories such as names of persons, organizations, locations, etc. MaxEnt
models can leverage multiple features like word context, capitalization, and part-of-speech tags to
improve recognition accuracy.
Text Classification: Text classification tasks, such as sentiment analysis and topic classification,
benefit from MaxEnt models due to their ability to handle high-dimensional feature spaces and
incorporate diverse sources of information.

Applications of Maximum Entropy Model
Language Modeling:
MaxEnt models can be used to build language models that predict the probability of a word given its
context. This is useful for tasks like speech recognition and machine translation.
Text Segmentation:
MaxEnt models are also used in text segmentation tasks, where the goal is to divide a text into
meaningful segments such as sentences or paragraphs. This involves combining various linguistic
features to make accurate segmentation decisions
In summary, Maximum Entropy Models are a versatile and powerful tool in NLP, capable of handling complex
dependencies and incorporating diverse sources of evidence. Their applications span across various tasks, from POS
tagging and NER to text classification and language modeling, making them an essential component of modern NLP
systems.

Evaluation Metrics for NLP Tasks
In Natural Language Processing (NLP), evaluating the performance of models is crucial to
ensure they meet the desired standards. Three key metrics used for this purpose are Precision,
Recall, and F1-score. These metrics are particularly useful in classification tasks such as
sentiment analysis, named entity recognition, and text classification.
1. Precision: Precision measures the proportion of correctly predicted positive instances (true
positives) out of all instances predicted as positive (true positives + false positives). Precision is
about being precise. It tells us how many of the instances identified as positive are actually
positive. A high precision means that the model is good at avoiding false positives.

2. Recall: Recall measures the proportion of correctly predicted positive instances out of all actual
positive instances. It evaluates how well the model identifies true positives.
A high recall indicates that the model captures most of the actual positives. It is crucial in scenarios
where missing true positives is costly.
Use cases:
In disease screening, high recall ensures that most patients with the disease are correctly identified.
In fraud detection, it minimizes undetected fraudulent activities.
Recall is about being thorough. It tells us how many of the actual positive instances were correctly
identified by the model. A high recall means that the model is good at avoiding false negatives.

3. F1-Score The F1-score is the harmonic mean of Precision and Recall, providing a balanced
metric that considers both false positives and false negatives. It is useful when there is an uneven
class distribution or when both precision and recall are equally important.
The F1-score is useful when you need a balance between precision and recall, especially in cases where
there is an uneven class distribution (imbalanced datasets). It ranges from 0 to 1, with 1 being the best
possible score.
Use cases:
•Sentiment Analysis: Ensures accurate classification of both positive and negative sentiments.
•Named Entity Recognition (NER): Balances identifying entities correctly while minimizing false positives.

Trade-offs and consideration
Precision vs. Recall Trade-off: There is often a trade-off between precision and recall. Improving one
can negatively impact the other. For example, a model that predicts everything as positive will have
high recall but low precision. Conversely, a model that predicts very few positives will have high
precision but low recall.
F1-score for Imbalanced Datasets: The F1-score is particularly useful in scenarios with imbalanced
datasets where one class significantly outnumbers the other. It helps in providing a more nuanced
understanding of model performance by balancing precision and recall.

NLP Msc Computer science S2 Kerala University

Recommended

More Related Content

Similar to NLP Msc Computer science S2 Kerala University (20)

Recently uploaded (20)

NLP Msc Computer science S2 Kerala University