0% found this document useful (0 votes)

36 views59 pages

IFT6085 Presentation IB

Uploaded by

VISHAL V

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views59 pages

IFT6085 Presentation IB

Uploaded by

VISHAL V

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Opening the black box of Deep

Neural Networks via Information

(Ravid Shwartz-Ziv and Naftali Tishby)

An overview by
Philip Amortila and Nicolas Gagné

1
Problem:
The usual "black box" story:
"Despite their great success, there is still
no comprehensive understanding of the
internal organization of Deep Neural
Networks."

"bull"

? ? ?
2
Solution:
Opening the blackbox using
mutual information!

a t i on
al i n form
Mutu

"bull"

? ? ?
3
Solution:
Opening the blackbox using
mutual information!

a t i o n
a l i nform
Mutu

"bull"

? ? ?
4
Solution:
Opening the blackbox using
mutual information!

n
io
at
m
or
nf
li
ua
ut

"bull"
M

? ? ?
5
Solution:
Opening the blackbox using
mutual information!

a t i o n
a l i nform
Mutu

"bull"

6
Spoiler alert!
As we train a deep neural network, its layers will

7
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.

2. Then, will gain "information" about

the true label but will loose
"information" about the input!

8
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.

2. Then, will gain "information" about

the true label but will lose
"information" about the input!

9
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.

2. Then, will gain "information" about

the true label but will lose
"information" about the input!
During that second stage, it forgets the details of the
input that are irrelevant to the task.

10
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.

2. Then, will gain "information" about

the true label but will lose
"information" about the input!
During that second stage, it forgets the details of the
input that are irrelevant to the task.

(Not unlike Picasso's deconstruction of the bull.)

11
Spoiler alert!
Before delving in, we first need to define what we mean by
"information".

12
Spoiler alert!
Before delving in, we first need to define what we mean by
"information".

By information, we mean mutual information.

13
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
n
X
• Entropy of X H(X|y) := p(xi |y) log (p(xi |y))
given y i=1
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
n
X
• Entropy of X H(X|y) := p(xi |y) log (p(xi |y))
given y i=1
m
X
• Entropy of X H(X|Y ) := p(yj )H(X|yj )
given Y j=1

16
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
n
X
• Entropy of X H(X|y) := p(xi |y) log (p(xi |y))
given y i=1
m
X
• Entropy of X H(X|Y ) := p(yj )H(X|yj )
given Y j=1

Mutual information
I(X; Y ) := H(X) H(X|Y )
between X and Y
17
Setup:
Given a deep neural network where X and Y are discrete
random variables:
p(x|y)

18
Setup:
We identify the propagated values at layer 1 with
the vector T1
p(x|y)

19
Setup:
We identify the propagated values at layer i with the
vector Ti
p(x|y)

20
Setup:
We identify the values for every layer with their
corresponding vector.
p(x|y)

21
Setup:
We identify the values for every layer with their
corresponding vector.

And doing so, we get the following Markov chain.

22
Setup:
We identify the values for every layer with their
corresponding vector.

And doing so, we get the following Markov chain.

(Hint: Markov chain rhymes with data processing inequality)

I(Y, Ti ) I(Y, Ti+n ) and I(X, Ti ) I(X, Ti+n )

23
Setup:

And doing so, we get the following Markov chain.

Next, pick your favourite layer, say Ti

24
Setup:

And doing so, we get the following Markov chain.

Next, pick your favourite layer, say Ti

25
Next, pick your favorite layer, say T.
We will plot T's current location on the

"information plane".
"How much it knows about Y"

I(Y;T)

I(X;T)
"How much it knows about X"
26
Then, we train our deep neural network for a bit.

And plot the new corresponding distribution of T

"How much it knows about Y"

I(Y;T)

I(X;T)
"How much it knows about X"
27
We train a bit more...

"How much it knows about Y"

I(Y;T)

I(X;T)
"How much it knows about X"
28
And as we train, we trace the layer's trajectory in
the information plane.

"How much it knows about Y"

I(Y;T)

I(X;T)
"How much it knows about X"
29
Let's see what it looks
like for a fixed data point ( , )
"bull"

I(Y;T)
"How much it knows about Y"

I(X;T)
"How much it knows about X"

30
Let's see what it looks
like for a fixed data point ( , )
"bull"

I(Y;T)
"How much it knows about Y"

I(X;T)
"How much it knows about X"

31
Let's see what it looks
like for a fixed data point ( , )
"bull"

I(Y;T)
"How much it knows about Y"

( , "dog" ) I(X;T)
"How much it knows about X"

32
Let's see what it looks
like for a fixed data point ( , )
"bull"

I(Y;T)
"How much it knows about Y"

( , "dog" ) I(X;T)
"How much it knows about X"

33
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"

( , "goat")
( , "dog" ) I(X;T)
"How much it knows about X"

34
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"

( , "goat")
( , "dog" ) I(X;T)
"How much it knows about X"

35
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"

( , )
"bull"

( , )
"goat"

( , "dog" ) I(X;T)
"How much it knows about X"

36
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"

( , )
"bull"

( , )
"goat"

( , "dog" ) I(X;T)
"How much it knows about X"

37
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T) ( , "bull" )
"How much it knows about Y"

( , )
"bull"

( , )
"goat"

( , "dog" ) I(X;T)
"How much it knows about X"

38
Numerical Experiments
and Results
Examining the dynamics of SGD in the mutual
information plane
Experimental Setup
• Explored fully connected feed-forward neural nets, with no other
architecture constraints: 7 fully connected hidden layers with
widths 12-10-7-5-4-3-2

• sigmoid activation on final layer, tanh activation on all other layers

• Binary decision task, synthetic data is used

D ⇠ P (X, Y )

• Experiment with 50 diﬀerent randomized weight initializations and

50 diﬀerent datasets generated from the same distribution

• trained with SGD to minimize the cross-entropy loss function,

and with no regularization
Dynamics in the
Information Plane
Dynamics in the
Information Plane
• All 50 test runs follow similar paths in the information plane

• Two diﬀerent phases of training: a fast ‘ERM reduction’

phase and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly

reduced (increase in I(T ; Y ) )

• In the second phase (from 400-9000 epochs) the error is

relatively unchanged but the layers lose input information
(decrease in I(X; T ) )
Dynamics in the
Information Plane
• All 50 test runs follow similar paths in the information plane

• Two diﬀerent phases of training: a fast ‘ER reduction’ phase

and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly

reduced (increase in I(T ; Y ) )

• In the second phase (from 400-9000 epochs) the error is

relatively unchanged but the layers lose input information
(decrease in I(X; T ) )
Dynamics in the
Information Plane
• All 50 test runs follow similar paths in the information plane

• Two diﬀerent phases of training: a fast ‘ER reduction’ phase

and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly

reduced (increase in I(T ; Y ) )

• In the second phase (from 400-9000 epochs) the error is

relatively unchanged but the layers lose input information
(decrease in I(X; T ) )
Dynamics in the
Information Plane
• All 50 test runs follow similar paths in the information plane

• Two diﬀerent phases of training: a fast ‘ER reduction’ phase

and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly

reduced (increase in I(T ; Y ) )

• In the second phase (from 400-9000 epochs) the error is

relatively unchanged but the layers lose input information
(decrease in I(X; T ) )
Representation
Compression
• The loss of input information occurs without any form of regularization

• This prevents overfitting since the layers lose irrelevant information

(“generalizing by forgetting”)

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively

Representation
Compression
• The loss of input information occurs without any form of regularization

• This prevents overfitting since the layers lose irrelevant information

(generalizing by forgetting)

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively

Representation
Compression
• The loss of input information occurs without any form of regularization

• This prevents overfitting since the layers lose irrelevant information

(generalizing by forgetting)

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively

Phase transitions
• The ‘ER reduction’ phase is called a drift phase, where the gradients are large
and the weights are changing rapidly (high signal-to-noise)

• The ‘Representation compression’ phase is called a diﬀusion phase, where

the gradients are small compared to their variance

The phase transition occurs at the dotted line

Phase transitions
• The ‘ER reduction’ phase is called a drift phase, where the gradients are large
and the weights are changing rapidly (high signal-to-noise)

• The ‘Representation compression’ phase is called a diﬀusion phase, where

the gradients are small compared to their variance (low signal-to-noise)

The phase transition occurs at the dotted line

Effectiveness of Deep Nets
• Because of the low Signal-to-noise Ratio in the diﬀusion
phase, the final weights obtained by the DNN are
eﬀectively random

• Across diﬀerent experiments, the correlations between

the weights of diﬀerent neurons in the same layer was
very small

• “This indicates that there is a huge number of diﬀerent

networks with essentially optimal performance, and
attempts to interpret single weights or single neutrons in
such networks can be meaningless”
Effectiveness of Deep Nets
• Because of the low Signal-to-noise Ratio in the diﬀusion
phase, the final weights obtained by the DNN are
eﬀectively random

• Across diﬀerent experiments, the correlations between

the weights of diﬀerent neurons in the same layer was
very small

• “This indicates that there is a huge number of diﬀerent

• Across diﬀerent experiments, the correlations between

the weights of diﬀerent neurons in the same layer was
very small

• “This indicates that there is a huge number of diﬀerent

networks with essentially optimal performance, and
attempts to interpret single weights or single neurons in
such networks can be meaningless”
Why go deep?
• Faster ER minimization (in epochs)

• Faster representation compression time

Why go deep?
• Faster ER minimization (in epochs)

• Faster representation compression time (in epochs)

Discussions/Disclaimers

• The claims being made are certainly very interesting,

although the scope of experiments is limited: only one
specific distribution and one specific network are
examined.

• The paper acknowledges that diﬀerent setups need to be

tested: do the results hold with diﬀerent decision rules
and network architectures? And is this observed in “real
world” problems?
Discussions/Disclaimers

• The claims being made are certainly very interesting,

although the scope of experiments is limited: only one
specific distribution and one specific network are
examined.

• The paper acknowledges that diﬀerent setups need to be

tested: do the results hold with diﬀerent decision rules
and network architectures? And is this observed in “real
world” problems?
Discussions/Disclaimers
• As of now there is some controversy about whether or not these
claims hold up: see “On the Information Bottleneck Theory of
Deep Learning (Saxe et al., 2018)”

• “ Here we show that none of these claims hold true in the

general case. […] we demonstrate that the information plane
trajectory is predominantly a function of the neural nonlinearity
employed […] Moreover, we find that there is no evident causal
connection between compression and generalization: networks
that do not compress are still capable of generalization, and vice
versa. Next, we show that the compression phase, when it
exists, does not arise from stochasticity in training by
demonstrating that we can replicate the IB findings using full
batch gradient descent rather than stochastic gradient descent.”
The End

Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
No ratings yet
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
16 pages
Tom M CMU ANN Lecture Notes
No ratings yet
Tom M CMU ANN Lecture Notes
68 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
L12 Generative Models en
No ratings yet
L12 Generative Models en
65 pages
Opening The Black Box of Deep Neural Networks Via Information
No ratings yet
Opening The Black Box of Deep Neural Networks Via Information
19 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
Short Course On Deep Learning: Welcome!!
No ratings yet
Short Course On Deep Learning: Welcome!!
57 pages
Deep
No ratings yet
Deep
15 pages
1807.04162v3
No ratings yet
1807.04162v3
24 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
conmatphys-031119-050745
No ratings yet
conmatphys-031119-050745
28 pages
Statistics Mechanic of Deep Learning
No ratings yet
Statistics Mechanic of Deep Learning
28 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Deep Learning Models
No ratings yet
Deep Learning Models
18 pages
2408.14681v1
No ratings yet
2408.14681v1
16 pages
NN PDF
No ratings yet
NN PDF
23 pages
cv3
No ratings yet
cv3
159 pages
Statistical Mechanics of Deep Learning
No ratings yet
Statistical Mechanics of Deep Learning
30 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
Gen AI Unit 2
No ratings yet
Gen AI Unit 2
65 pages
Introduction to machine learning
No ratings yet
Introduction to machine learning
33 pages
Talk MLSS Part2
No ratings yet
Talk MLSS Part2
97 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
86 37 196 Mod 5
No ratings yet
86 37 196 Mod 5
52 pages
Nips10 Workshop Tutorial Final PDF
No ratings yet
Nips10 Workshop Tutorial Final PDF
73 pages
20 StatMechDeep
No ratings yet
20 StatMechDeep
30 pages
Generative Adversarial Networks (GAN) : A Gentle Introduction (UPDATED)
No ratings yet
Generative Adversarial Networks (GAN) : A Gentle Introduction (UPDATED)
11 pages
02 ML Fundatmentals 2
No ratings yet
02 ML Fundatmentals 2
81 pages
Unit-3
No ratings yet
Unit-3
16 pages
Day 1 S3
No ratings yet
Day 1 S3
29 pages
4a - Approximate Reinforcement Learning
No ratings yet
4a - Approximate Reinforcement Learning
55 pages
AI-unit-4
No ratings yet
AI-unit-4
91 pages
3048 Greedy Layer Wise Training of Deep Networks
No ratings yet
3048 Greedy Layer Wise Training of Deep Networks
8 pages
RBM, DBN, and DBM
No ratings yet
RBM, DBN, and DBM
79 pages
Parallelized Deep Neural Networks
No ratings yet
Parallelized Deep Neural Networks
34 pages
Macro Finance
No ratings yet
Macro Finance
119 pages
Lecture04 VDL
No ratings yet
Lecture04 VDL
93 pages
Machine Learning-Gkouzionis
No ratings yet
Machine Learning-Gkouzionis
14 pages
Module 1.Pptx
No ratings yet
Module 1.Pptx
64 pages
Quadrant Data Efficient Machine Learning Screen
No ratings yet
Quadrant Data Efficient Machine Learning Screen
6 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
Lec 2 Basics of machine learning (1)
No ratings yet
Lec 2 Basics of machine learning (1)
35 pages
Lec15 Generative Models
No ratings yet
Lec15 Generative Models
51 pages
DLbook
No ratings yet
DLbook
165 pages
poly_aml
No ratings yet
poly_aml
76 pages
Advanced Design For AI Algorithms: Lec.: 1 GAN
No ratings yet
Advanced Design For AI Algorithms: Lec.: 1 GAN
223 pages
FRJ Paper
No ratings yet
FRJ Paper
9 pages
01_ml_basics
No ratings yet
01_ml_basics
61 pages
Deep Learning Midsem Merged Previous Batch
No ratings yet
Deep Learning Midsem Merged Previous Batch
423 pages
L10-DL Intro
No ratings yet
L10-DL Intro
25 pages
LBDL
No ratings yet
LBDL
142 pages
Unit 5 Deep Unsupervised Learning
No ratings yet
Unit 5 Deep Unsupervised Learning
30 pages
Deep Learning u5
No ratings yet
Deep Learning u5
5 pages
gan_tutorial_suwang
No ratings yet
gan_tutorial_suwang
11 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
A Fast Learning Algorithm For Deep Belief Nets PDF
100% (1)
A Fast Learning Algorithm For Deep Belief Nets PDF
208 pages
M2_AI_Chap1_neural-network
No ratings yet
M2_AI_Chap1_neural-network
60 pages
GeoStat DeepLearn NDesassis 15 06 22
No ratings yet
GeoStat DeepLearn NDesassis 15 06 22
134 pages
IITM Product Casebook '24-'25
100% (1)
IITM Product Casebook '24-'25
124 pages
Flipkart
No ratings yet
Flipkart
3 pages
An Information-Maximisation Approach To Blind Separation and Blind Deconvolution
No ratings yet
An Information-Maximisation Approach To Blind Separation and Blind Deconvolution
38 pages
ME6148 Renewable Energy Technology-Assignment 1
No ratings yet
ME6148 Renewable Energy Technology-Assignment 1
3 pages
Lect-5-Solar Flat Plate Collector
100% (1)
Lect-5-Solar Flat Plate Collector
25 pages

IFT6085 Presentation IB

Uploaded by

IFT6085 Presentation IB

Uploaded by

Opening the black box of Deep

Neural Networks via Information

2. Then, will gain "information" about

2. Then, will gain "information" about

2. Then, will gain "information" about

2. Then, will gain "information" about

(Not unlike Picasso's deconstruction of the bull.)

By information, we mean mutual information.

And doing so, we get the following Markov chain.

And doing so, we get the following Markov chain.

(Hint: Markov chain rhymes with data processing inequality)

And doing so, we get the following Markov chain.

Next, pick your favourite layer, say Ti

And doing so, we get the following Markov chain.

Next, pick your favourite layer, say Ti

And plot the new corresponding distribution of T

"How much it knows about Y"

"How much it knows about Y"

"How much it knows about Y"

• sigmoid activation on final layer, tanh activation on all other layers

• Binary decision task, synthetic data is used

• Experiment with 50 diﬀerent randomized weight initializations and

• trained with SGD to minimize the cross-entropy loss function,

• Two diﬀerent phases of training: a fast ‘ERM reduction’

• In the first phase (~400 epochs), the test error is rapidly

• In the second phase (from 400-9000 epochs) the error is

• Two diﬀerent phases of training: a fast ‘ER reduction’ phase

• In the first phase (~400 epochs), the test error is rapidly

• In the second phase (from 400-9000 epochs) the error is

• Two diﬀerent phases of training: a fast ‘ER reduction’ phase

• In the first phase (~400 epochs), the test error is rapidly

• In the second phase (from 400-9000 epochs) the error is

• Two diﬀerent phases of training: a fast ‘ER reduction’ phase

• In the first phase (~400 epochs), the test error is rapidly

• In the second phase (from 400-9000 epochs) the error is

• This prevents overfitting since the layers lose irrelevant information

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively

• This prevents overfitting since the layers lose irrelevant information

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively

• This prevents overfitting since the layers lose irrelevant information

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively

• The ‘Representation compression’ phase is called a diﬀusion phase, where

The phase transition occurs at the dotted line

• The ‘Representation compression’ phase is called a diﬀusion phase, where

The phase transition occurs at the dotted line

• Across diﬀerent experiments, the correlations between

• “This indicates that there is a huge number of diﬀerent

• Across diﬀerent experiments, the correlations between

• “This indicates that there is a huge number of diﬀerent

• Across diﬀerent experiments, the correlations between

• “This indicates that there is a huge number of diﬀerent

• Faster representation compression time

• Faster representation compression time (in epochs)

• The claims being made are certainly very interesting,

• The paper acknowledges that diﬀerent setups need to be

• The claims being made are certainly very interesting,

• The paper acknowledges that diﬀerent setups need to be

• “ Here we show that none of these claims hold true in the

You might also like