IFT6085 Presentation IB
IFT6085 Presentation IB
An overview by
Philip Amortila and Nicolas Gagné
1
Problem:
The usual "black box" story:
"Despite their great success, there is still
no comprehensive understanding of the
internal organization of Deep Neural
Networks."
"bull"
? ? ?
2
Solution:
Opening the blackbox using
mutual information!
a t i on
al i n form
Mutu
"bull"
? ? ?
3
Solution:
Opening the blackbox using
mutual information!
a t i o n
a l i nform
Mutu
"bull"
? ? ?
4
Solution:
Opening the blackbox using
mutual information!
n
io
at
m
or
nf
li
ua
ut
"bull"
M
? ? ?
5
Solution:
Opening the blackbox using
mutual information!
a t i o n
a l i nform
Mutu
"bull"
6
Spoiler alert!
As we train a deep neural network, its layers will
7
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.
8
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.
9
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.
10
Spoiler alert!
As we train a deep neural network, its layers will
1. Gain "information" about the true
label and the input.
12
Spoiler alert!
Before delving in, we first need to define what we mean by
"information".
13
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
n
X
• Entropy of X H(X|y) := p(xi |y) log (p(xi |y))
given y i=1
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
n
X
• Entropy of X H(X|y) := p(xi |y) log (p(xi |y))
given y i=1
m
X
• Entropy of X H(X|Y ) := p(yj )H(X|yj )
given Y j=1
16
What is mutual information?
For discrete random variables X and Y :
n
X
• Entropy of X H(X) := p(xi ) log (p(xi ))
given nothing i=1
n
X
• Entropy of X H(X|y) := p(xi |y) log (p(xi |y))
given y i=1
m
X
• Entropy of X H(X|Y ) := p(yj )H(X|yj )
given Y j=1
Mutual information
I(X; Y ) := H(X) H(X|Y )
between X and Y
17
Setup:
Given a deep neural network where X and Y are discrete
random variables:
p(x|y)
18
Setup:
We identify the propagated values at layer 1 with
the vector T1
p(x|y)
19
Setup:
We identify the propagated values at layer i with the
vector Ti
p(x|y)
20
Setup:
We identify the values for every layer with their
corresponding vector.
p(x|y)
21
Setup:
We identify the values for every layer with their
corresponding vector.
22
Setup:
We identify the values for every layer with their
corresponding vector.
23
Setup:
24
Setup:
25
Next, pick your favorite layer, say T.
We will plot T's current location on the
"information plane".
"How much it knows about Y"
I(Y;T)
I(X;T)
"How much it knows about X"
26
Then, we train our deep neural network for a bit.
I(Y;T)
I(X;T)
"How much it knows about X"
27
We train a bit more...
I(Y;T)
I(X;T)
"How much it knows about X"
28
And as we train, we trace the layer's trajectory in
the information plane.
I(Y;T)
I(X;T)
"How much it knows about X"
29
Let's see what it looks
like for a fixed data point ( , )
"bull"
I(Y;T)
"How much it knows about Y"
I(X;T)
"How much it knows about X"
30
Let's see what it looks
like for a fixed data point ( , )
"bull"
I(Y;T)
"How much it knows about Y"
I(X;T)
"How much it knows about X"
31
Let's see what it looks
like for a fixed data point ( , )
"bull"
I(Y;T)
"How much it knows about Y"
( , "dog" ) I(X;T)
"How much it knows about X"
32
Let's see what it looks
like for a fixed data point ( , )
"bull"
I(Y;T)
"How much it knows about Y"
( , "dog" ) I(X;T)
"How much it knows about X"
33
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"
( , "goat")
( , "dog" ) I(X;T)
"How much it knows about X"
34
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"
( , "goat")
( , "dog" ) I(X;T)
"How much it knows about X"
35
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"
( , )
"bull"
( , )
"goat"
( , "dog" ) I(X;T)
"How much it knows about X"
36
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T)
"How much it knows about Y"
( , )
"bull"
( , )
"goat"
( , "dog" ) I(X;T)
"How much it knows about X"
37
Let's see what it looks
like for a fixed data point ( , "bull" )
I(Y;T) ( , "bull" )
"How much it knows about Y"
( , )
"bull"
( , )
"goat"
( , "dog" ) I(X;T)
"How much it knows about X"
38
Numerical Experiments
and Results
Examining the dynamics of SGD in the mutual
information plane
Experimental Setup
• Explored fully connected feed-forward neural nets, with no other
architecture constraints: 7 fully connected hidden layers with
widths 12-10-7-5-4-3-2