0% found this document useful (0 votes)
107 views

Information Theory in Machine Learning

Information theory provides fundamental concepts for quantifying information that are widely applied in machine learning. Shannon entropy measures the uncertainty in a random variable and is used in decision trees to calculate the information gain from splitting data, reducing uncertainty. Cross-entropy and KL divergence measure the difference between true and predicted distributions, and are used as loss functions in deep learning and dimensionality reduction techniques like t-SNE. Entropy can also quantify the imbalance in target class distributions for a classification problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

Information Theory in Machine Learning

Information theory provides fundamental concepts for quantifying information that are widely applied in machine learning. Shannon entropy measures the uncertainty in a random variable and is used in decision trees to calculate the information gain from splitting data, reducing uncertainty. Cross-entropy and KL divergence measure the difference between true and predicted distributions, and are used as loss functions in deep learning and dimensionality reduction techniques like t-SNE. Entropy can also quantify the imbalance in target class distributions for a classification problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

4.

Information Theory in Machine Learning

Information Theory
Researchers have pondered upon quantifying information since the early 1900s, and in 1948,
Claude Shannon published a phenomenal article called “A Mathematical Theory of
Communication.” This paper birthed the field of Information Theory. Information Theory, in
definition, is the study of the quantification, storage, and communication of information. But
it is so much more than that. It made some significant contributions to the field of Statistical
Physics, Computer Science, Economics etc.

The primary focus of Shannon’s paper was on the general communication system as he was
working in Bell Labs when he published the article. It established quite a few important
concepts, such as information entropy and redundancy. In the present day, its core
fundamentals are applied in the fields of lossless data compression, lossy data compression and
channel coding.

The techniques used in Information Theory are probabilistic in nature and usually deal with 2
specific quantities, viz. Entropy and Mutual Information.

Shannon Entropy (or just Entropy)


Entropy is the measure of uncertainty of a random variable or the amount of information
required to describe a variable. Suppose x is a discrete random variable, and it can take any
value defined in the set, χ. Let’s assume the set is finite in this scenario. The probability
distribution for x will be p(x) = Pr{χ = x}, x ∈ χ.

Joint and Conditional Entropy

The joint entropy H(X, Y) of a pair of discrete random variables (X, Y) with a joint
distribution p(x, y) is defined as

𝐻(𝑋, 𝑌) = −𝐸𝑙𝑜𝑔 𝑝 (𝑋, 𝑌)

Relative Entropy

Relative Entropy is somewhat different as it moves on from random variables to distributions.


It is a measure of the distance between two distributions.

A more instinctive way to put it would be: Relative entropy or KL-Divergence, denoted
by D(p||q), is a measure of the inefficiency of assuming that the distribution is q when the true
distribution is p.
Mutual Information

Mutual Information is a measure of the amount of information that one random variable

contains about another random variable. Alternatively, it can be defined as the reduction in

uncertainty of one variable due to the knowledge of the other. The technical definition for it

would be as follows:

Considering two random variables X and Y with a joint probability mass function p(x, y) and

marginal probability mass functions p(x) and p(y). The mutual information I (X; Y) is the

relative entropy between the joint distribution and the product distribution p(x)p(y).

In both cases, the matrix contained is sparse with many more zero values than data values. The
problem with representing these sparse matrices as dense matrices is that memory is required
and must be allocated for each 32-bit or even 64-bit zero value in the matrix.

This is clearly a waste of memory resources as those zero values do not contain any
information.

Applications of Information Theory in Machine Learning

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification
and regression. The goal is to create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features. The core algorithm used here is
called ID3, which was developed by Ross Quinlan. It employs a top-down greedy search
approach and involves partitioning the data into subsets with homogeneous data. The ID3
algorithms decide the partition by calculating the homogeneity of the sample using entropy. If
the sample is homogenous, entropy is 0, and if the sample is uniformly divided, it has maximum
entropy. But entropy does not have direct implication on the construction of the trees. The
algorithm relies on Information Gain, which is based on the decrease in entropy after a dataset
is split on an attribute. If you think intuitively, you will see that this is actually Mutual
Information that I had mentioned above. Mutual Information decreases the uncertainty of one
variable given the value of the other. In DT, we calculate the entropy of the predicted variable.
Then, the dataset is split based on entropy, and the entropy of the resultant variable is subtracted
from the previous entropy value. This is Information Gain and, obviously, Mutual Information
in play.
Cross-Entropy

Cross entropy is a concept very similar to Relative Entropy. Relative entropy is when a random
variable compares true distribution p with how the approximated distribution q differs
from p at each sample point (divergence or difference). Whereas cross-entropy directly
compares true distribution p with approximated distribution q. Now, cross-entropy is a term
heavily used in the field of deep learning. It is used as a loss function that measures the
performance of a classification model whose output is a probability value between 0 and
1. Cross-entropy loss increases as the predicted probability diverge from the actual label.

KL-Divergence

K-L Divergence or Relative Entropy is also a topic embedded in the deep learning literature,
specifically in VAE. Variational Autoencoders take in input in the form of Gaussian
Distributions rather than discrete data points. It is optimal for the distributions of the VAE to
be regularized to increase the amount of overlap within the latent space. K-L divergence
measures this and is added to the loss function.

K-L Divergence is also used in t-SNE. tSNE is a dimensionality reduction technique that is
mainly used to visualize data in high dimensions. It converts similarities between data points
to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint
probabilities of the low-dimensional embedding and the high-dimensional data.

Calculating imbalance in target class distribution

Entropy can be used to calculate target class imbalances. If we consider the predicted feature
as a random variable with two classes, a balanced set (50/50 split) should have the maximum
entropy as we saw in the case of the coin toss. But if the split is skewed and one class has a
90% prevalence, then there’s lesser knowledge to be gained, hence a lower entropy.
Implementing the chain rule for calculating entropy, we can check whether a multiclass target
variable is balanced in a single quantified value, albeit an average that masks the individual
probabilities.

You might also like