Jeff Byers - Machine Learning and Advanced Statitics
Jeff Byers - Machine Learning and Advanced Statitics
• Note, these are choices we make and NOT fundamental attributes of the
universe. They reside in our minds and associated artifacts (e.g., databases,
journal articles, computer programs) as a representation of the universe.
• These choices are both necessary and arbitrary and there is no way to avoid
making them.
• We should NEVER become confused and push even our most useful
concepts onto the universe and imagine they have a reality independent
from the representations in our mind.
• The key is to identify those properties of the universe that remain after we
have removed the effect of our arbitrary choices in how to represent it.
Historical Note: Doing this for coordinate systems led Einstein to General Relativity and the failure to
do this for probability has led to the interpretation morass of Quantum Mechanics.
Types of Models
X – Properties
Y - Observations
Statistical resolvability of a model
p (model) p (data)
p (data=obs)
p (model | obs)
Bayesians “pull back” the observations to
the model space as a posterior probability.
Bayesian Inference: Looking backwards through the model
Physicists usually think about what model can explain the data
BUT what about a different view of this question:
p ( , )
NEW DATA
REPRESENTATION After N samples
p ( x | , ) p ( , )
p ( , | x ) =
p(x)
posterior
evidence
Posterior
Prior
UPDATE 9
1 2
Bayesian Inference: Looking backwards through the model
Goal: Learn the fairness or bias, B, of a coin.
Sequence of coin tosses form the data set:
- Assign prior probability based on beliefs:
p( x = H | B ) = B
p( x = T | B ) = 1 − B
NOTE: B is an unknown parameter.
p ( x | B) p ( B)
p(B | x) =
p (x)
posterior
evidence 10
At some point, we must make a decision …
Decision-theoretic perspective:
• Define a set of probability models, p(X|), for
the data, X, indexed by a parameter, Q.
• Define a procedure d(X) that operates on the
data to produce a decision.
• Define a loss function, L(, d(X)).
• The goal is to use the loss function to
compare procedures via a risk, R, however
both arguments are unknown!
Beliefs about the data you might sample, Beliefs about models prior to acquiring data,
place a probability on data space ˆ = d ( X ) place a probability on model space
statistical
p( X )
estimation
procedure
p ( )
DATA SPACE d MODEL SPACE
X E Q
L(, d(X)).
loss function
NOW
PAST FUTURE
DATA → Xt − 2 → Xt −1 → Xt → Xt +1 → Xt +1 →
The University of Florida Sparse Matrix Collection T. A. Davis and Y. Hu, ACM Transactions on Mathematical Software,
Vol 38, Issue 1, 2011, pp. 1-25 http://www.cise.ufl.edu/research/sparse/matrices/synopsis
18
From rules to intelligence
Prosthesis for the mind:
From data to symbolic reasoning
https://www.fiverr.com/bilalahmedd/machine-learning-data-science-tensor-flow-python
Neural Networks
Neural Network C
wij (t + 1) = wij (t ) −
y = f W ( x) wij
wij Learning Rate
MIT CSAIL
“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images”,
Anh Nguyen, Jason Yosinski and Jeff Clune, CVPR 2015, p.427-436.
So What’s Happening?
The data lies along a manifold However, the DL Neural Network typically
constrained to low-dimensions chops up the space with hyperplanes to
by the generative mechanism. form decision boundaries for classes.
d << D
So What’s Happening?
The data lies along a manifold However, the DL Neural Network typically
constrained to low-dimensions chops up the space with hyperplanes to form
by the generative mechanism. non-local decision boundaries for classes.
d << D
Exception!
So What’s Happening?
Real data Adversarial data
https://thomas-tanay.github.io/post--L2-regularization/
Generative Adversarial Network (GAN): Bug as Feature
“This, and the variations that are now being proposed is
the most interesting idea in the last 10 years in ML, in
my opinion.” -Yann LeCun (2016)
More Supervised Learning to the rescue … LABEL = { Real, Fake }
Generative Adversarial Network (GAN): Bug as Feature
“This, and the variations that are now being proposed is
the most interesting idea in the last 10 years in ML, in
my opinion.” -Yann LeCun (2016)
More Supervised Learning to the rescue … LABEL = { Real, Fake }
Data manifold
This is a form of Implicit Density Estimation
YES
NO
d<<D
35
Spectral Dynamics
Dynamical Systems Theory Random Matrix Theory
Statistical Mechanics of Knowledge
“Atom of Decidability”
PHYSICS
INFORMATION 0 1
LANGUAGE F T
Cantor’s Nightmare: Developing a proper fear of the infinite
The real number system is a very unstable representation for modelling.
Likewise, functions on the reals are strange objects. Why?
More topics
The function should be viewed as the limit of a
Mathematical Background probabilistic association between the elements of the sets
Thinking of functions as “really big” vectors in the domain and range.
Many problems in machine learning and statistics are fundamentally about finding
a decomposition of a space into two subspaces and then learning a mapping
between them using a finite data set.
x1
Entire space: z E = ( B F ) pU
Base space: x B x x
d
z= = 1
Fiber: y F y y
Trivializing neighborhood: U B D − d
y
Vapnik’s Gamble: Discriminative Supervised Learning
When solving a problem of interest, do not solve
a more general problem as an intermediate step. L = L f −1
– Vladimir Vapnik
y " f −1 " x̂ L ˆ
Coarse-grained
retrieval
y L
L → p ( | y ) = dx p ( | x) p( x | y )
Summary of perspectives on Data-to-Decisions
p( ) p ( ) H ( L) H ( L)
Q p( | ) I ( L, L)
Decisions: l l’ l l’ l l’
Label mixing Confusion matrix
L p ( | x) p( | y) I ( X , L) I (Y , L)
L
f p( y | x) I ( X ,Y )
Data: x Forward model
y x Likelihood
y x y
p( x) p( y ) H (X ) H (Y )
I ( L, L) = H ( L) − H ( L | L)
C = max I ( L, L)
p ( |x )
Information-theoretic view of remote sensing
Using the “Information Bottleneck” approach to solve for the Property space: x
Classification Design (to support the experiment) Observation space: y Q
QR indices:
Given the possible observations of the property space, what are the reliable classification schemes?
Quantizing the property space, , to preserve :
L L
p( x)
p ( | x) =
p( )
exp − p( y | x) log
p( y | x) f
p( y | x) Z ( x, ) y p( y | )
p ( | x)
Q
1
p( y | ) = p ( | x) p ( y | x) p( x)
p( ) x The parameter represents the
f : → Q,
tradeoff between compression of x y = f ( x)
p ( ) = p ( | x) p( x) x and the predictive accuracy of L. L: →
x x = L( x)
Experimental Design (to support the classification) L :Q →
Given a classification scheme of the property space, what are the reliable observations? y = L ( y)
x x xx2 xy2
z = ,μ = , Σ = 2 2
, Σ = xx2 yy2 − xy4
y y xy yy Completing the square in the Gaussian:
( )
2
Λ = Σ −1 ( y − y ) + B ( x − x ) = ( y − y ) 2 + 2 B ( x − x ) ( y − y ) + B2 ( x − x ) 2
2
xx xx xx
2 = B=
xy yy2 xx2 yy2 − xy4 xy2 xx2 xx
−1 2 − 12 q 2
p ( x, y ) = (2 ) − d 2 Σ e
q 2 = xx2 ( x − x ) 2 + yy2 ( y − y ) 2 + 2xy2 ( x − x )( y − y )
yy2 ( x − x ) 2 + xx2 ( y − y ) 2 + 2 xy2 ( x − x )( y − y )
=
xx2 yy2 − xy4
( yy2 − B 2 )( x − x ) 2 + xx2 ( y − y ) 2 + 2 xy2 ( x − x )( y − y ) + B2 ( x − x ) 2
2
2
= xx xx
xx yy − xy
2 2 4
( )
2
(y− ( x) )
4 2
( yy2 − xy2 )( x − x ) 2 + xx2 ( y − y ) + xy2 ( x − x ) ( x − x )
2 2
= = +
xx xx y
( )
2
where y ( x) = y − xy2 ( x − x ) and yy2 = xx−2 ( xx2 yy2 − xy4 ). ( y − ( x ))
( xx2 yy2 − xy4 )
2
−1 2 ( x − x )2
xx p ( x, y ) = 1
2 exp − 2y 2 exp − 2 xx2
yy
( ) ( x − x ) (y− ( x) )
2 2
( x − x )2
p ( x | x , ) = (2 )
2
xx
2 −1 2
xx exp − 2 xx
2 q 2
= +
y
xx2 yy2
p ( y | y , yy2 ) = (2 yy2 ) −1 2 exp − ( ( y − y )2
2 yy
2 ) 2
where y ( x) = y − xy2 ( x − x ) and yy2 = xx−2 ( xx2 yy2 − xy4 ).
xx
Mutual information: Gaussians
Bivariate Gaussian PDF: Marginals:
( yy2 − xy4 )
( y − ( x ))
( ) ( )
2
−1 2 ( x − )2
p( x | x , xx2 ) = (2 xx2 )−1 2 exp − ( x2−2x )
2
p ( x, y ) = 1
2
2
xx exp − 2y 2 exp − 2 2x
yy
xx xx
−1 2 ( y − ( x ))
= ( 2 yy2 ) exp − 2y 2 (2 xx2 ) −1 2 exp − 2 2x
2
(
( x − )2
) p( y | y , yy2 ) = (2 yy2 )−1 2 exp − ( ( y − y )2
2 yy
2 )
yy
xx
( y − y ( x ))
y ( x) = ( 2 yy2 )
2
−1 2
p ( y | x) p( x) − dy y exp − 2 yy2
( y − y ( x ))
= ( 2 yy ) dy ( y − y ( x) ) exp − 2 2
2
2 −1 2 2
Mean-shift (linear regression): y ( x) = y − xy2 ( x − x )
2 yy2
xx
−
yy
Variance reduction: 2
yy = − xy4 0 (Note: no x-dependence.)
2
yy
−2
xx
p ( y | x)
I ( X , Y ) = dx p ( x) dy p ( y | x) ln ln
p( y | x)
= ln p( y | x) − ln p( y )
− − p( y ) p( y )
( y − ( x ))
= ln ( 2 yy2 ) exp − 2y 2 − ln (2 yy2 ) −1 2 exp − 2 2y ( )
2
−1 2 ( y − )2
yy
yy
( y − ( x ))
2
( y − )2
= 2 2y − 2y 2 + 12 ln yy2 yy−2
yy yy
( y − y ) 2 = y 2 − 2 y y + y2 + y2 − y2 − 2 y y + 2 y y
= ( y 2 − 2 y y + y2 ) + 2 y y − 2 y y + y2 − y2
= ( y − y ) 2 + 2( y − y ) y + ( y2 − y2 )
( y − y ( x ))
( 2 yy2 )
2
−1 2
− dy ( y − y ) exp − 2 yy2 = yy + 2 y ( y − y ) + ( y − y )
2 2 2 2
Mutual information: Gaussians
( y − ( x ) ) ( y − )2 ( y − ( x ) )
= ( 2 yy2 ) dy exp − 2y 2 2 2y − 2y 2 + 12 ln yy2 yy−2
2 2
p ( y | x) −1 2
−
dy p ( y | x) ln
p( y ) −
yy
yy yy
( y − y ( x ))
= 12 ln yy2 yy−2 − 12 + 12 yy−2 ( 2 yy2 )
2
−1 2
− dy ( y − y ) exp − 2 yy2
2
= yy
2
+ 2 y ( y − y ) + ( y2 − y2 )
−
= 1
2
−2
yy ( yy2 ln( yy2 yy−2 ) + yy2 − yy2 + xx−2 xy4 )
yy−2 yy2 + xx−2 yy−2 xy4 − 1 = yy−2 yy2 + xx−2 yy−2 xy4 − 1
= ( ln( ) − 1 + +
1
2
2
yy
−2
yy
−2
yy
2
yy
−2
xx
−2
yy
4
xy ) = yy−2 ( yy2 − xx−2 xy4 ) + xx−2 yy−2 xy4 − 1
2 xx yy =0
2 2
= 12 ln yy2 = 12 ln Σ
yy
xx yy xx yy A0
I ( X , Y ) = H ( X ) + H (Y ) − H ( X , Y ) = ln 12 =
Σ Σ
12
A
12 12
Σ XX ΣYY
Proposed generalization: I ( X , Y ) = ln 12
Σ
Class Schedules
Advanced Statistics for Physics Statistical Mechanics
of Complex Systems