0% found this document useful (0 votes)

107 views

Information Theory in Machine Learning

Information theory provides fundamental concepts for quantifying information that are widely applied in machine learning. Shannon entropy measures the uncertainty in a random variable and is used in decision trees to calculate the information gain from splitting data, reducing uncertainty. Cross-entropy and KL divergence measure the difference between true and predicted distributions, and are used as loss functions in deep learning and dimensionality reduction techniques like t-SNE. Entropy can also quantify the imbalance in target class distributions for a classification problem.

Uploaded by

Kishore Kumar Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views

Information Theory in Machine Learning

Uploaded by

Kishore Kumar Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

4.

Information Theory in Machine Learning

Information Theory
Researchers have pondered upon quantifying information since the early 1900s, and in 1948,
Claude Shannon published a phenomenal article called “A Mathematical Theory of
Communication.” This paper birthed the field of Information Theory. Information Theory, in
definition, is the study of the quantification, storage, and communication of information. But
it is so much more than that. It made some significant contributions to the field of Statistical
Physics, Computer Science, Economics etc.

The primary focus of Shannon’s paper was on the general communication system as he was
working in Bell Labs when he published the article. It established quite a few important
concepts, such as information entropy and redundancy. In the present day, its core
fundamentals are applied in the fields of lossless data compression, lossy data compression and
channel coding.

The techniques used in Information Theory are probabilistic in nature and usually deal with 2
specific quantities, viz. Entropy and Mutual Information.

Shannon Entropy (or just Entropy)

Entropy is the measure of uncertainty of a random variable or the amount of information
required to describe a variable. Suppose x is a discrete random variable, and it can take any
value defined in the set, χ. Let’s assume the set is finite in this scenario. The probability
distribution for x will be p(x) = Pr{χ = x}, x ∈ χ.

Joint and Conditional Entropy

The joint entropy H(X, Y) of a pair of discrete random variables (X, Y) with a joint
distribution p(x, y) is defined as

𝐻(𝑋, 𝑌) = −𝐸𝑙𝑜𝑔 𝑝 (𝑋, 𝑌)

Relative Entropy

Relative Entropy is somewhat different as it moves on from random variables to distributions.

It is a measure of the distance between two distributions.

A more instinctive way to put it would be: Relative entropy or KL-Divergence, denoted
by D(p||q), is a measure of the inefficiency of assuming that the distribution is q when the true
distribution is p.
Mutual Information

Mutual Information is a measure of the amount of information that one random variable

contains about another random variable. Alternatively, it can be defined as the reduction in

uncertainty of one variable due to the knowledge of the other. The technical definition for it

would be as follows:

Considering two random variables X and Y with a joint probability mass function p(x, y) and

marginal probability mass functions p(x) and p(y). The mutual information I (X; Y) is the

relative entropy between the joint distribution and the product distribution p(x)p(y).

In both cases, the matrix contained is sparse with many more zero values than data values. The
problem with representing these sparse matrices as dense matrices is that memory is required
and must be allocated for each 32-bit or even 64-bit zero value in the matrix.

This is clearly a waste of memory resources as those zero values do not contain any
information.

Applications of Information Theory in Machine Learning

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification
and regression. The goal is to create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features. The core algorithm used here is
called ID3, which was developed by Ross Quinlan. It employs a top-down greedy search
approach and involves partitioning the data into subsets with homogeneous data. The ID3
algorithms decide the partition by calculating the homogeneity of the sample using entropy. If
the sample is homogenous, entropy is 0, and if the sample is uniformly divided, it has maximum
entropy. But entropy does not have direct implication on the construction of the trees. The
algorithm relies on Information Gain, which is based on the decrease in entropy after a dataset
is split on an attribute. If you think intuitively, you will see that this is actually Mutual
Information that I had mentioned above. Mutual Information decreases the uncertainty of one
variable given the value of the other. In DT, we calculate the entropy of the predicted variable.
Then, the dataset is split based on entropy, and the entropy of the resultant variable is subtracted
from the previous entropy value. This is Information Gain and, obviously, Mutual Information
in play.
Cross-Entropy

Cross entropy is a concept very similar to Relative Entropy. Relative entropy is when a random
variable compares true distribution p with how the approximated distribution q differs
from p at each sample point (divergence or difference). Whereas cross-entropy directly
compares true distribution p with approximated distribution q. Now, cross-entropy is a term
heavily used in the field of deep learning. It is used as a loss function that measures the
performance of a classification model whose output is a probability value between 0 and
1. Cross-entropy loss increases as the predicted probability diverge from the actual label.

KL-Divergence

K-L Divergence or Relative Entropy is also a topic embedded in the deep learning literature,
specifically in VAE. Variational Autoencoders take in input in the form of Gaussian
Distributions rather than discrete data points. It is optimal for the distributions of the VAE to
be regularized to increase the amount of overlap within the latent space. K-L divergence
measures this and is added to the loss function.

K-L Divergence is also used in t-SNE. tSNE is a dimensionality reduction technique that is
mainly used to visualize data in high dimensions. It converts similarities between data points
to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint
probabilities of the low-dimensional embedding and the high-dimensional data.

Calculating imbalance in target class distribution

Entropy can be used to calculate target class imbalances. If we consider the predicted feature
as a random variable with two classes, a balanced set (50/50 split) should have the maximum
entropy as we saw in the case of the coin toss. But if the split is skewed and one class has a
90% prevalence, then there’s lesser knowledge to be gained, hence a lower entropy.
Implementing the chain rule for calculating entropy, we can check whether a multiclass target
variable is balanced in a single quantified value, albeit an average that masks the individual
probabilities.

Literary Analysis of Taylor Swift
No ratings yet
Literary Analysis of Taylor Swift
6 pages
Literature Review
No ratings yet
Literature Review
17 pages
CSD411-_Week_4-_MF,_IT_and_Model_1724689912176241587666ccadf8821c9
No ratings yet
CSD411-_Week_4-_MF,_IT_and_Model_1724689912176241587666ccadf8821c9
48 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
Shannon Entropy
No ratings yet
Shannon Entropy
10 pages
Information Theory 5th Unit
No ratings yet
Information Theory 5th Unit
20 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
22 pages
On Measures of Entropy and Information
No ratings yet
On Measures of Entropy and Information
18 pages
Mutual Information
No ratings yet
Mutual Information
4 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
4 pages
Mutinf PDF
No ratings yet
Mutinf PDF
4 pages
A Beginners’ Guide to Cross-Entropy in Machine Learning
No ratings yet
A Beginners’ Guide to Cross-Entropy in Machine Learning
2 pages
INFORMATION THEORY AND SOURCE CODING
No ratings yet
INFORMATION THEORY AND SOURCE CODING
45 pages
Entropy
No ratings yet
Entropy
21 pages
ICT - Module 1 Lecture 1
No ratings yet
ICT - Module 1 Lecture 1
34 pages
University of Gondar: August 2011 E.C Gondar, Ethiopia
No ratings yet
University of Gondar: August 2011 E.C Gondar, Ethiopia
10 pages
Entropy: Quantum Entropy and Its Applications To Quantum Communication and Statistical Physics
No ratings yet
Entropy: Quantum Entropy and Its Applications To Quantum Communication and Statistical Physics
52 pages
CSC 422 Part 1
No ratings yet
CSC 422 Part 1
49 pages
A Cornputational Theory of Surprise: Transmission of Data, Shan
No ratings yet
A Cornputational Theory of Surprise: Transmission of Data, Shan
25 pages
A Physical Theory of Information vs. A Mathematical Theory of Communication
No ratings yet
A Physical Theory of Information vs. A Mathematical Theory of Communication
10 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Entropie Eng PDF
No ratings yet
Entropie Eng PDF
6 pages
21ECE72_Coding and Cryp Module 1
No ratings yet
21ECE72_Coding and Cryp Module 1
34 pages
What Is Information Theory? The Basics: Sensor Reading Group 10 October 2003
No ratings yet
What Is Information Theory? The Basics: Sensor Reading Group 10 October 2003
9 pages
Construction Planning
No ratings yet
Construction Planning
10 pages
Cross Entropy Loss Intro, Applications
No ratings yet
Cross Entropy Loss Intro, Applications
21 pages
Info Theory Polyanskiy Wu
No ratings yet
Info Theory Polyanskiy Wu
730 pages
A Friendly Introduction To Cross Entropy Loss
No ratings yet
A Friendly Introduction To Cross Entropy Loss
10 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Information Theory
No ratings yet
Information Theory
26 pages
Entropy: Information Theory and Dynamical System Predictability
No ratings yet
Entropy: Information Theory and Dynamical System Predictability
38 pages
ITC Module2 1
No ratings yet
ITC Module2 1
34 pages
What Is Shannon Information
No ratings yet
What Is Shannon Information
33 pages
Information Theory Entropy Relative Entropy
No ratings yet
Information Theory Entropy Relative Entropy
60 pages
Chap 1
No ratings yet
Chap 1
15 pages
Entropy (Information Theory)
No ratings yet
Entropy (Information Theory)
17 pages
Unit 1 part 2 notes
No ratings yet
Unit 1 part 2 notes
34 pages
549 1496 2 PB
No ratings yet
549 1496 2 PB
10 pages
Natural Language Processing Natural Language Processing: Unit - 1 Essential Information Theory
No ratings yet
Natural Language Processing Natural Language Processing: Unit - 1 Essential Information Theory
34 pages
A Gentle Introduction to Cross-Entropy for Machine Learning
No ratings yet
A Gentle Introduction to Cross-Entropy for Machine Learning
24 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Proposed Algorithm
No ratings yet
Proposed Algorithm
3 pages
Comparison of Classification Algorithms
No ratings yet
Comparison of Classification Algorithms
11 pages
Entropy: Tomasz Downarowicz Institute of Mathematics and Computer Science Wroclaw University of Technology
100% (1)
Entropy: Tomasz Downarowicz Institute of Mathematics and Computer Science Wroclaw University of Technology
20 pages
Entropy in Thermodynamics and Info PDF
No ratings yet
Entropy in Thermodynamics and Info PDF
6 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Information Theory
No ratings yet
Information Theory
114 pages
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
No ratings yet
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
114 pages
Divergence, Entropy, Information: Phil Chodrow
No ratings yet
Divergence, Entropy, Information: Phil Chodrow
18 pages
Module14 InformationTheoryandEntropy
No ratings yet
Module14 InformationTheoryandEntropy
24 pages
Information Theory & Coding: "Science Is Organized Knowledge. Wisdom Is Organized Life." - Immanuel Kant
No ratings yet
Information Theory & Coding: "Science Is Organized Knowledge. Wisdom Is Organized Life." - Immanuel Kant
35 pages
Entr5
No ratings yet
Entr5
2 pages
Information Processing Equalities and The Information-Risk Bridge
No ratings yet
Information Processing Equalities and The Information-Risk Bridge
53 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Ai epom
No ratings yet
Ai epom
3 pages
A Mini-Introduction To Information Theory: Edward Witten
No ratings yet
A Mini-Introduction To Information Theory: Edward Witten
39 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Kullback-Leibler Divergence
No ratings yet
Kullback-Leibler Divergence
22 pages
Lesson Plan Template - Set Program
No ratings yet
Lesson Plan Template - Set Program
4 pages
Exit Interview Rubric
No ratings yet
Exit Interview Rubric
1 page
Mind Map As A Tool For Critical Thinking
100% (1)
Mind Map As A Tool For Critical Thinking
6 pages
It's Time We Had A Talk - About Talk by Marion Blank, PHD, and Mary Beth Cull
No ratings yet
It's Time We Had A Talk - About Talk by Marion Blank, PHD, and Mary Beth Cull
10 pages
Practical Research 2 Module 6
No ratings yet
Practical Research 2 Module 6
21 pages
End of The Unit Test 11 Answers
No ratings yet
End of The Unit Test 11 Answers
7 pages
The Complexity of Modern Devices
No ratings yet
The Complexity of Modern Devices
11 pages
Effects of Social Skill
No ratings yet
Effects of Social Skill
9 pages
Résumé Writing For Emotional Impact
No ratings yet
Résumé Writing For Emotional Impact
2 pages
Scion - Deva - CS - God
No ratings yet
Scion - Deva - CS - God
1 page
SLP-Grade-6-Science-ELECTRICAL AND CHEMICAL ENERGY TRANSFORMATION-WEEK-4-DAY-1
No ratings yet
SLP-Grade-6-Science-ELECTRICAL AND CHEMICAL ENERGY TRANSFORMATION-WEEK-4-DAY-1
2 pages
Cognitive Anthropology: Selected Issues: Jana Trajtelová
No ratings yet
Cognitive Anthropology: Selected Issues: Jana Trajtelová
33 pages
Chapter 1
No ratings yet
Chapter 1
38 pages
Teacher Talk in Classroom Interaction Performance: Journal of English Education and Literature Badan Penerbit UNM. 2007
No ratings yet
Teacher Talk in Classroom Interaction Performance: Journal of English Education and Literature Badan Penerbit UNM. 2007
10 pages
Mirroring (Psychology) - Wikipedia The Free Encyclopedia
100% (1)
Mirroring (Psychology) - Wikipedia The Free Encyclopedia
2 pages
Metaphors
No ratings yet
Metaphors
8 pages
Fausey, C. M. (2010) - Constructing Agency The Role of Language. Frontiers in Psychology, 1.
No ratings yet
Fausey, C. M. (2010) - Constructing Agency The Role of Language. Frontiers in Psychology, 1.
11 pages
Shantel - CHP2 Reviewed
No ratings yet
Shantel - CHP2 Reviewed
15 pages
Woman Are Better at Compromising and Collaborating Than Man
100% (2)
Woman Are Better at Compromising and Collaborating Than Man
4 pages
Unit 5 - 5.1 Reinforcement Learning
No ratings yet
Unit 5 - 5.1 Reinforcement Learning
9 pages
Voice Rules
No ratings yet
Voice Rules
11 pages
Comparison of Learning Theories PDF
No ratings yet
Comparison of Learning Theories PDF
15 pages
Toward A Neuroscience of Attachment
No ratings yet
Toward A Neuroscience of Attachment
34 pages
RECANATI, François. Contextualism and Anti-Contextualism in The Philosophy of Language. Foundations of Speech Act TH
No ratings yet
RECANATI, François. Contextualism and Anti-Contextualism in The Philosophy of Language. Foundations of Speech Act TH
7 pages
Teaching Schemes in Multi Grade Classes Lesson Plan
No ratings yet
Teaching Schemes in Multi Grade Classes Lesson Plan
8 pages
Introducing Identity - Summary
No ratings yet
Introducing Identity - Summary
4 pages
Future of English As A Global Language - 11 - PPP
No ratings yet
Future of English As A Global Language - 11 - PPP
32 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Winslow Dynamics Profile PDF
No ratings yet
Winslow Dynamics Profile PDF
62 pages

Information Theory in Machine Learning

Uploaded by

Information Theory in Machine Learning

Uploaded by

4.

Information Theory in Machine Learning

Shannon Entropy (or just Entropy)

Joint and Conditional Entropy

𝐻(𝑋, 𝑌) = −𝐸𝑙𝑜𝑔 𝑝 (𝑋, 𝑌)

Relative Entropy is somewhat different as it moves on from random variables to distributions.

Applications of Information Theory in Machine Learning

Calculating imbalance in target class distribution

You might also like