Information Theory in Machine Learning
Information Theory in Machine Learning
Information Theory
Researchers have pondered upon quantifying information since the early 1900s, and in 1948,
Claude Shannon published a phenomenal article called “A Mathematical Theory of
Communication.” This paper birthed the field of Information Theory. Information Theory, in
definition, is the study of the quantification, storage, and communication of information. But
it is so much more than that. It made some significant contributions to the field of Statistical
Physics, Computer Science, Economics etc.
The primary focus of Shannon’s paper was on the general communication system as he was
working in Bell Labs when he published the article. It established quite a few important
concepts, such as information entropy and redundancy. In the present day, its core
fundamentals are applied in the fields of lossless data compression, lossy data compression and
channel coding.
The techniques used in Information Theory are probabilistic in nature and usually deal with 2
specific quantities, viz. Entropy and Mutual Information.
The joint entropy H(X, Y) of a pair of discrete random variables (X, Y) with a joint
distribution p(x, y) is defined as
Relative Entropy
A more instinctive way to put it would be: Relative entropy or KL-Divergence, denoted
by D(p||q), is a measure of the inefficiency of assuming that the distribution is q when the true
distribution is p.
Mutual Information
Mutual Information is a measure of the amount of information that one random variable
contains about another random variable. Alternatively, it can be defined as the reduction in
uncertainty of one variable due to the knowledge of the other. The technical definition for it
would be as follows:
Considering two random variables X and Y with a joint probability mass function p(x, y) and
marginal probability mass functions p(x) and p(y). The mutual information I (X; Y) is the
relative entropy between the joint distribution and the product distribution p(x)p(y).
In both cases, the matrix contained is sparse with many more zero values than data values. The
problem with representing these sparse matrices as dense matrices is that memory is required
and must be allocated for each 32-bit or even 64-bit zero value in the matrix.
This is clearly a waste of memory resources as those zero values do not contain any
information.
Decision Trees
Decision Trees (DTs) are a non-parametric supervised learning method used for classification
and regression. The goal is to create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features. The core algorithm used here is
called ID3, which was developed by Ross Quinlan. It employs a top-down greedy search
approach and involves partitioning the data into subsets with homogeneous data. The ID3
algorithms decide the partition by calculating the homogeneity of the sample using entropy. If
the sample is homogenous, entropy is 0, and if the sample is uniformly divided, it has maximum
entropy. But entropy does not have direct implication on the construction of the trees. The
algorithm relies on Information Gain, which is based on the decrease in entropy after a dataset
is split on an attribute. If you think intuitively, you will see that this is actually Mutual
Information that I had mentioned above. Mutual Information decreases the uncertainty of one
variable given the value of the other. In DT, we calculate the entropy of the predicted variable.
Then, the dataset is split based on entropy, and the entropy of the resultant variable is subtracted
from the previous entropy value. This is Information Gain and, obviously, Mutual Information
in play.
Cross-Entropy
Cross entropy is a concept very similar to Relative Entropy. Relative entropy is when a random
variable compares true distribution p with how the approximated distribution q differs
from p at each sample point (divergence or difference). Whereas cross-entropy directly
compares true distribution p with approximated distribution q. Now, cross-entropy is a term
heavily used in the field of deep learning. It is used as a loss function that measures the
performance of a classification model whose output is a probability value between 0 and
1. Cross-entropy loss increases as the predicted probability diverge from the actual label.
KL-Divergence
K-L Divergence or Relative Entropy is also a topic embedded in the deep learning literature,
specifically in VAE. Variational Autoencoders take in input in the form of Gaussian
Distributions rather than discrete data points. It is optimal for the distributions of the VAE to
be regularized to increase the amount of overlap within the latent space. K-L divergence
measures this and is added to the loss function.
K-L Divergence is also used in t-SNE. tSNE is a dimensionality reduction technique that is
mainly used to visualize data in high dimensions. It converts similarities between data points
to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint
probabilities of the low-dimensional embedding and the high-dimensional data.
Entropy can be used to calculate target class imbalances. If we consider the predicted feature
as a random variable with two classes, a balanced set (50/50 split) should have the maximum
entropy as we saw in the case of the coin toss. But if the split is skewed and one class has a
90% prevalence, then there’s lesser knowledge to be gained, hence a lower entropy.
Implementing the chain rule for calculating entropy, we can check whether a multiclass target
variable is balanced in a single quantified value, albeit an average that masks the individual
probabilities.