0% found this document useful (0 votes)
36 views59 pages

IFT6085 Presentation IB

Uploaded by

VISHAL V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views59 pages

IFT6085 Presentation IB

Uploaded by

VISHAL V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Opening the black box of Deep

Neural Networks via Information


(Ravid Shwartz-Ziv and Naftali Tishby)

An overview by
Philip Amortila and Nicolas Gagné

1
Problem:
The usual "black box" story:
"Despite their great success, there is still
no comprehensive understanding of the
internal organization of Deep Neural
Networks."

"bull"

? ? ?
2
Solution:
Opening the blackbox using
mutual information!

a t i on
al i n form
Mutu

"bull"

? ? ?
3
Solution:
Opening the blackbox using
mutual information!

a t i o n
a l i nform
Mutu

"bull"

? ? ?
4
Solution:
Opening the blackbox using
mutual information!

n
io
at
m
or
nf
li
ua
ut

"bull"
M

? ? ?
5
Solution:
Opening the blackbox using
mutual information!

a t i o n
a l i nform
Mutu

"bull"

6
Spoiler alert!
As we train a deep neural network, its layers will

7
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.

2. Then, will gain "information" about


the true label but will loose
"information" about the input!

8
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.

2. Then, will gain "information" about


the true label but will lose
"information" about the input!

9
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.

2. Then, will gain "information" about


the true label but will lose
"information" about the input!
During that second stage, it forgets the details of the
input that are irrelevant to the task.

10
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.

2. Then, will gain "information" about


the true label but will lose
"information" about the input!
During that second stage, it forgets the details of the
input that are irrelevant to the task.

(Not unlike Picasso's deconstruction of the bull.)


11
Spoiler alert!
Before delving in, we first need to define what we mean by
"information".

12
Spoiler alert!
Before delving in, we first need to define what we mean by
"information".

By information, we mean mutual information.

13
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
n
X
• Entropy of X H(X|y) := p(xi |y) log (p(xi |y))
given y i=1
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
n
X
• Entropy of X H(X|y) := p(xi |y) log (p(xi |y))
given y i=1
m
X
• Entropy of X H(X|Y ) := p(yj )H(X|yj )
given Y j=1

16
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
n
X
• Entropy of X H(X|y) := p(xi |y) log (p(xi |y))
given y i=1
m
X
• Entropy of X H(X|Y ) := p(yj )H(X|yj )
given Y j=1

Mutual information
I(X; Y ) := H(X) H(X|Y )
between X and Y
17
Setup:
Given a deep neural network where X and Y are discrete
random variables:
p(x|y)

18
Setup:
We identify the propagated values at layer 1 with
the vector T1
p(x|y)

19
Setup:
We identify the propagated values at layer i with the
vector Ti
p(x|y)

20
Setup:
We identify the values for every layer with their
corresponding vector.
p(x|y)

21
Setup:
We identify the values for every layer with their
corresponding vector.

And doing so, we get the following Markov chain.

22
Setup:
We identify the values for every layer with their
corresponding vector.

And doing so, we get the following Markov chain.

(Hint: Markov chain rhymes with data processing inequality)


I(Y, Ti ) I(Y, Ti+n ) and I(X, Ti ) I(X, Ti+n )

23
Setup:

And doing so, we get the following Markov chain.

Next, pick your favourite layer, say Ti

24
Setup:

And doing so, we get the following Markov chain.

Next, pick your favourite layer, say Ti

25
Next, pick your favorite layer, say T.
We will plot T's current location on the

"information plane".
"How much it knows about Y"

I(Y;T)

I(X;T)
"How much it knows about X"
26
Then, we train our deep neural network for a bit.

And plot the new corresponding distribution of T

"How much it knows about Y"

I(Y;T)

I(X;T)
"How much it knows about X"
27
We train a bit more...

"How much it knows about Y"

I(Y;T)

I(X;T)
"How much it knows about X"
28
And as we train, we trace the layer's trajectory in
the information plane.

"How much it knows about Y"

I(Y;T)

I(X;T)
"How much it knows about X"
29
Let's see what it looks
like for a fixed data point ( , )
"bull"

I(Y;T)
"How much it knows about Y"

I(X;T)
"How much it knows about X"

30
Let's see what it looks
like for a fixed data point ( , )
"bull"

I(Y;T)
"How much it knows about Y"

I(X;T)
"How much it knows about X"

31
Let's see what it looks
like for a fixed data point ( , )
"bull"

I(Y;T)
"How much it knows about Y"

( , "dog" ) I(X;T)
"How much it knows about X"

32
Let's see what it looks
like for a fixed data point ( , )
"bull"

I(Y;T)
"How much it knows about Y"

( , "dog" ) I(X;T)
"How much it knows about X"

33
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"

( , "goat")
( , "dog" ) I(X;T)
"How much it knows about X"

34
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"

( , "goat")
( , "dog" ) I(X;T)
"How much it knows about X"

35
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"

( , )
"bull"

( , )
"goat"

( , "dog" ) I(X;T)
"How much it knows about X"

36
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"

( , )
"bull"

( , )
"goat"

( , "dog" ) I(X;T)
"How much it knows about X"

37
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T) ( , "bull" )
"How much it knows about Y"

( , )
"bull"

( , )
"goat"

( , "dog" ) I(X;T)
"How much it knows about X"

38
Numerical Experiments
and Results
Examining the dynamics of SGD in the mutual
information plane
Experimental Setup
• Explored fully connected feed-forward neural nets, with no other
architecture constraints: 7 fully connected hidden layers with
widths 12-10-7-5-4-3-2

• sigmoid activation on final layer, tanh activation on all other layers

• Binary decision task, synthetic data is used


D ⇠ P (X, Y )

• Experiment with 50 different randomized weight initializations and


50 different datasets generated from the same distribution

• trained with SGD to minimize the cross-entropy loss function,


and with no regularization
Dynamics in the
Information Plane
Dynamics in the
Information Plane
• All 50 test runs follow similar paths in the information plane

• Two different phases of training: a fast ‘ERM reduction’


phase and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly


reduced (increase in I(T ; Y ) )

• In the second phase (from 400-9000 epochs) the error is


relatively unchanged but the layers lose input information
(decrease in I(X; T ) )
Dynamics in the
Information Plane
• All 50 test runs follow similar paths in the information plane

• Two different phases of training: a fast ‘ER reduction’ phase


and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly


reduced (increase in I(T ; Y ) )

• In the second phase (from 400-9000 epochs) the error is


relatively unchanged but the layers lose input information
(decrease in I(X; T ) )
Dynamics in the
Information Plane
• All 50 test runs follow similar paths in the information plane

• Two different phases of training: a fast ‘ER reduction’ phase


and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly


reduced (increase in I(T ; Y ) )

• In the second phase (from 400-9000 epochs) the error is


relatively unchanged but the layers lose input information
(decrease in I(X; T ) )
Dynamics in the
Information Plane
• All 50 test runs follow similar paths in the information plane

• Two different phases of training: a fast ‘ER reduction’ phase


and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly


reduced (increase in I(T ; Y ) )

• In the second phase (from 400-9000 epochs) the error is


relatively unchanged but the layers lose input information
(decrease in I(X; T ) )
Representation
Compression
• The loss of input information occurs without any form of regularization

• This prevents overfitting since the layers lose irrelevant information


(“generalizing by forgetting”)

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively


Representation
Compression
• The loss of input information occurs without any form of regularization

• This prevents overfitting since the layers lose irrelevant information


(generalizing by forgetting)

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively


Representation
Compression
• The loss of input information occurs without any form of regularization

• This prevents overfitting since the layers lose irrelevant information


(generalizing by forgetting)

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively


Phase transitions
• The ‘ER reduction’ phase is called a drift phase, where the gradients are large
and the weights are changing rapidly (high signal-to-noise)

• The ‘Representation compression’ phase is called a diffusion phase, where


the gradients are small compared to their variance

The phase transition occurs at the dotted line


Phase transitions
• The ‘ER reduction’ phase is called a drift phase, where the gradients are large
and the weights are changing rapidly (high signal-to-noise)

• The ‘Representation compression’ phase is called a diffusion phase, where


the gradients are small compared to their variance (low signal-to-noise)

The phase transition occurs at the dotted line


Effectiveness of Deep Nets
• Because of the low Signal-to-noise Ratio in the diffusion
phase, the final weights obtained by the DNN are
effectively random

• Across different experiments, the correlations between


the weights of different neurons in the same layer was
very small

• “This indicates that there is a huge number of different


networks with essentially optimal performance, and
attempts to interpret single weights or single neutrons in
such networks can be meaningless”
Effectiveness of Deep Nets
• Because of the low Signal-to-noise Ratio in the diffusion
phase, the final weights obtained by the DNN are
effectively random

• Across different experiments, the correlations between


the weights of different neurons in the same layer was
very small

• “This indicates that there is a huge number of different


networks with essentially optimal performance, and
attempts to interpret single weights or single neutrons in
such networks can be meaningless”
Effectiveness of Deep Nets
• Because of the low Signal-to-noise Ratio in the diffusion
phase, the final weights obtained by the DNN are
effectively random

• Across different experiments, the correlations between


the weights of different neurons in the same layer was
very small

• “This indicates that there is a huge number of different


networks with essentially optimal performance, and
attempts to interpret single weights or single neurons in
such networks can be meaningless”
Why go deep?
• Faster ER minimization (in epochs)

• Faster representation compression time


Why go deep?
• Faster ER minimization (in epochs)

• Faster representation compression time (in epochs)


Discussions/Disclaimers

• The claims being made are certainly very interesting,


although the scope of experiments is limited: only one
specific distribution and one specific network are
examined.

• The paper acknowledges that different setups need to be


tested: do the results hold with different decision rules
and network architectures? And is this observed in “real
world” problems?
Discussions/Disclaimers

• The claims being made are certainly very interesting,


although the scope of experiments is limited: only one
specific distribution and one specific network are
examined.

• The paper acknowledges that different setups need to be


tested: do the results hold with different decision rules
and network architectures? And is this observed in “real
world” problems?
Discussions/Disclaimers
• As of now there is some controversy about whether or not these
claims hold up: see “On the Information Bottleneck Theory of
Deep Learning (Saxe et al., 2018)”

• “ Here we show that none of these claims hold true in the


general case. […] we demonstrate that the information plane
trajectory is predominantly a function of the neural nonlinearity
employed […] Moreover, we find that there is no evident causal
connection between compression and generalization: networks
that do not compress are still capable of generalization, and vice
versa. Next, we show that the compression phase, when it
exists, does not arise from stochasticity in training by
demonstrating that we can replicate the IB findings using full
batch gradient descent rather than stochastic gradient descent.”
The End

You might also like