0% found this document useful (0 votes)

157 views

Jeff Byers - Machine Learning and Advanced Statitics

This lecture introduces the Bayesian approach to introduce inductive bias into deep learning and machine learning in general.

Uploaded by

carlo_polato

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

157 views

Jeff Byers - Machine Learning and Advanced Statitics

This lecture introduces the Bayesian approach to introduce inductive bias into deep learning and machine learning in general.

Uploaded by

carlo_polato

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

“Physics of Data” Padua Lectures

Dr. Jeff M. Byers

Naval Research Laboratory
Washington, D.C.
Local Email: [email protected]
Syllabus of supplementary lectures
Spring semester 2019
CLASS: Statistical Mechanics of Complex Systems
Diffusion: The Ultimate Learning Machine
• Machine learning kernels and the Green’s function of the Diffusion equation
• Local structure versus Global symmetries: Hamiltonians or Green’s functions
• Diffusion on a Manifold: Parametrix approximation
Information theory: Understanding the world as a communications channel
• Comparing probability distributions: From KL divergence to information geometry
8 hours • Relating probability, information theory and geometry
• Statistical manifolds: The geometry of models
• The manifold of multivariate Gaussian probability distributions
Random Matrix Theory: 1-d statistical mechanics of eigenvalues
• Random covariance matrices: Semi-circle and Quarter circle Laws
• Inferring covariance matrices from data: The Wishart distribution
• Mean-field theory for estimating matrices: Variational Bayes

CLASS: Advanced Statistics for Physical Analysis

Bayesian inference: Letting the model fluctuate around the data
• Using a stochastic process as a Bayesian prior
• Gaussian Process: Using a path integral to fit data
• GP Example: Fitting spherical harmonics to data
• Dirichlet Process: The probability over probability distributions
• DP Example: Computing the uncertainty in the density estimation of a gas
8 hours • Beta Process: The probability over combinations of features
The statistics of high-dimensional point clouds: Your intuition is wrong!
• Correcting the bias of the average: The James-Stein estimator
• When to use the Gram or the Covariance Matrix
• Estimating the dimensionality of the data
• Divide and conquer the manifold: Estimating tangent spaces in high-d data
• Stitching together the tangent spaces: Curvature functionals
Dynamics: The OODA Loop
WARNING LABELS
Our Choices should not matter!!!
Let the experiment make the choices. Physicists should avoid
making theoretical choices. If such a choice is required then
show why it doesn’t matter. In other words, it’s just for
computational efficacy. Or, it is compelled by some
symmetry property or conservation law (Noether’s Theorem)

Probability is in our Models!!!

Probability quantifies uncertainty in our representations and
the corresponding inference procedures from data to models.
Failure to understand this will cause our minds to get mixed
into the physical world in a very confusing way.
Do NOT physicalize probability.
The Map is not the World
• Concepts such as coordinate systems and probability are powerful tools for
humans to build models that represent aspects of the universe for our
particular purposes.

• Note, these are choices we make and NOT fundamental attributes of the
universe. They reside in our minds and associated artifacts (e.g., databases,
journal articles, computer programs) as a representation of the universe.

• These choices are both necessary and arbitrary and there is no way to avoid
making them.

• We should NEVER become confused and push even our most useful
concepts onto the universe and imagine they have a reality independent
from the representations in our mind.

• The key is to identify those properties of the universe that remain after we
have removed the effect of our arbitrary choices in how to represent it.
Historical Note: Doing this for coordinate systems led Einstein to General Relativity and the failure to
do this for probability has led to the interpretation morass of Quantum Mechanics.
Types of Models
X – Properties
Y - Observations
Statistical resolvability of a model

The “ball” in a “box”

What could be observed? → How big is the box?
What is observed? → Where is the ball?
What is the fidelity of the observation? → How big is the ball?
“Data” here refers to all the possible observations.
Bayesians believe they can more Frequentists believe they can more
easily specify the size of the “Model” easily specify the size of the “Data”
box using a prior distribution and this box using a sampling distribution and
specifies the sampling distribution. then test the significance of models.

p(data | model) p (data=obs | model)

p (model) p (data)

p (data=obs)
p (model | obs)
Bayesians “pull back” the observations to
the model space as a posterior probability.
Bayesian Inference: Looking backwards through the model
Physicists usually think about what model can explain the data
BUT what about a different view of this question:

How does the data constrain the possible models?

Bayes’ Formula:
LIKELIHOOD PRIOR
POSTERIOR
p ( data | model )  p ( model )
p ( model | data ) =
p ( data )
EVIDENCE

The models are instances of a stochastic

process that fluctuates around the data!
Bayesian Learning: Data chips away the Prior
So how do we get the data
to construct its own
representation?

p ( ,  )
NEW DATA
REPRESENTATION After N samples

Bayes’ Theorem: likelihood prior

p ( x |  , )  p ( , )
p ( ,  | x ) =
p(x)
posterior
evidence

Posterior
Prior
UPDATE 9
1 2
Bayesian Inference: Looking backwards through the model
Goal: Learn the fairness or bias, B, of a coin.
Sequence of coin tosses form the data set:
- Assign prior probability based on beliefs:

3 different priors (or initial conditions) x = H H T H T T H T T H T H

BLUE → Coin is heavily biased to TAILS
RED → Coin is biased to HEADS
GREEN → No information (max. entropy)

- Assign a likelihood to the process:

p( x = H | B ) = B
p( x = T | B ) = 1 − B
NOTE: B is an unknown parameter.

- Update beliefs with data sequence:

likelihood prior

p ( x | B)  p ( B)
p(B | x) =
p (x)
posterior
evidence 10
At some point, we must make a decision …
Decision-theoretic perspective:
• Define a set of probability models, p(X|), for
the data, X, indexed by a parameter, Q.
• Define a procedure d(X) that operates on the
data to produce a decision.
• Define a loss function, L(, d(X)).
• The goal is to use the loss function to
compare procedures via a risk, R, however
both arguments are unknown!

Beliefs about the data you might sample, Beliefs about models prior to acquiring data,
place a probability on data space ˆ = d ( X ) place a probability on model space
statistical
p( X )
estimation
procedure
p ( )
DATA SPACE d MODEL SPACE
X E  Q
L(, d(X)).
loss function

Goal: We need to assign a scalar to each decision procedure, R : d →

Slide inspired from Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/
The two ways to make decisions with probability
Goal: We need to assign a scalar to each decision procedure. R :d →
 is an index over models NOT a random variable.  is an index over models AND a random variable.
Loss function Prior distribution
Sampling distribution
Put a probability distribution L( , d ( X )) Put an initial probability
distribution on the space of
p ( )
on observing various data.
model indices (parameters).
p( X )
Frequentist expectation of risk Bayesian expectation of risk
RF ( , d ) =  dX L( , d ( X ))  p( X ) RB (d ( X ) ) =  d p( )  L( , d ( X ))

Model pessimism: Data optimism:

Consider best decision for Consider the best decision
the worst choice of model given the observed data
 (d ) = arg max RF ( , d ) X = X obs


Frequentist decision rule (Minimax) Bayesian decision rule

d F = arg min RF ( =  (d ), d ) d B = arg min RB (d ( X = X obs ) )
d d
Minimax: For each d select the  that maximizes RF
then select the minimum of these as the risk.

Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

The two ways to make decisions with probability
Goal: We need to assign a scalar to each decision procedure. R :d →
 is an index over models NOT a random variable.  is an index over models AND a random variable.
Loss function Prior distribution
Sampling distribution
Put a probability distribution L( , d ( X )) Put an initial probability
distribution on the space of
p ( )
on observing various data.
model indices (parameters).
p( X )
Frequentist expectation of risk Bayesian expectation of risk
RF ( , d ) =  dX L( , d ( X ))  p( X ) RB (d ( X ) ) =  d p( )  L( , d ( X ))

Model pessimism: Data optimism:

Consider best decision for Consider the best decision
the worst choice of model given the observed data
 (d ) = arg max RF ( , d ) X = X obs


Frequentist decision rule (Minimax) Bayesian decision rule

d F = arg min RF ( =  (d ), d ) d B = arg min RB (d ( X = X obs ) )
d d
Minimax: For each d select the  that maximizes RF
then select the minimum of these as the risk.

Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

Model pessimism: p ( ) p( X ) Data optimism:

Consider best decision for Bayes’ Risk Consider the best decision
the worst choice of model given the observed data
R(d ) =  d dX p ( )  L( , d ( X ))  p( X )
 (d ) = arg max RF ( , d ) X = X obs


Frequentist decision rule (Minimax) Bayesian decision rule

d F = arg min RF ( =  (d ), d ) d B = arg min RB (d ( X = X obs ) )
d d
Minimax: For each d select the  that maximizes RF
then select the minimum of these as the risk.

Coverage:   ( X , d F )  Q Coherence: p( X ) =  d p( X |  )  p( )

True value lies in a bounded region.

Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

MOTIVATION & BACKGROUND

High-dimensional Point Cloud Data

Hyperspectral Imaging Computer Vision – Local Descriptors

Parts Models of Objects Dynamic Reconfiguration

Dynamics in the Data
Understanding the past is as difficult as predicting the future.

NOW
PAST FUTURE

DATA → Xt − 2 → Xt −1 → Xt → Xt +1 → Xt +1 →

State estimation Prediction

“Dynamics” in the Model

• Observe: Sensor representation spaces, input from the recent past

• Orient: Model spaces or Belief spaces of POMDP’s, state estimation
of the present
• Decide: Game-theoretic spaces, look into the possible futures given
the current state.
o Continuous game strategy spaces
o Bayesian games of incomplete/imperfect information
• Act: State spaces of our systems, at some point you have to make a
choice. As an agent in the world, the actions reveal the causal
connections within it.
Complex geometry of models

The University of Florida Sparse Matrix Collection T. A. Davis and Y. Hu, ACM Transactions on Mathematical Software,
Vol 38, Issue 1, 2011, pp. 1-25 http://www.cise.ufl.edu/research/sparse/matrices/synopsis
18
From rules to intelligence
Prosthesis for the mind:
From data to symbolic reasoning

https://www.fiverr.com/bilalahmedd/machine-learning-data-science-tensor-flow-python
Neural Networks

Already did these?

Dynamics in the Model
Stochastic Gradient Descent: Backpropagation on mini-batches
Random sub-sampling of data
N
1
y − f W (x n )
2
Cost Function: C ( W) n Automatic
N n =1
Gradient descent: Differentiation

Neural Network C
wij (t + 1) = wij (t ) − 
y = f W ( x) wij
wij Learning Rate

More sophisticated choices

(e.g., ADAM) are possible.

Magic: “Inductive Bias”

imposed on the network.
MOTIVATION & BACKGROUND

Deep Learning as Layered Neural Nets

MOTIVATION & BACKGROUND

Deep Learning as Recurrent Neural Nets

Is Deep Learning how we do it?
“I’ve worked all my life in Machine
Learning, and I’ve never seen one
algorithm knock over benchmarks like
Deep Learning.”

Playing to the hope of

interpretability
LEVITY

Deep Learning as all kinds of stuff …

Intriguing properties of neural networks:
Adversarial examples
Besides pigs being able to fly in Deep Learning,

MIT CSAIL

Turtles can be very threatening … … and your autonomous

car can play Stopball.

Synthesizing Robust Adversarial Examples

http://proceedings.mlr.press/v80/athalye18b/athalye18b.pdf Use of ShapeShifter by Shang-Tse Chen, Ga.Tech
Adversarial Examples
The algorithm is >99.6% confident of these labels

“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images”,
Anh Nguyen, Jason Yosinski and Jeff Clune, CVPR 2015, p.427-436.
So What’s Happening?
The data lies along a manifold However, the DL Neural Network typically
constrained to low-dimensions chops up the space with hyperplanes to
by the generative mechanism. form decision boundaries for classes.

d << D
So What’s Happening?
The data lies along a manifold However, the DL Neural Network typically
constrained to low-dimensions chops up the space with hyperplanes to form
by the generative mechanism. non-local decision boundaries for classes.

d << D

Exception!
So What’s Happening?
Real data Adversarial data

https://thomas-tanay.github.io/post--L2-regularization/
Generative Adversarial Network (GAN): Bug as Feature
“This, and the variations that are now being proposed is
the most interesting idea in the last 10 years in ML, in
my opinion.” -Yann LeCun (2016)
More Supervised Learning to the rescue … LABEL = { Real, Fake }
Generative Adversarial Network (GAN): Bug as Feature
“This, and the variations that are now being proposed is
the most interesting idea in the last 10 years in ML, in
my opinion.” -Yann LeCun (2016)
More Supervised Learning to the rescue … LABEL = { Real, Fake }
Data manifold
This is a form of Implicit Density Estimation

Is this data from the manifold?

YES
NO

NOT on data manifold

Why density estimation?
• Generate new data for simulators.
• Work with missing data.
• Represent data in more compressed
forms for find latent spaces. DECISION
Think fast … OR
• Detect anomalies and outliers. Think slow?
Finding structure in high-dimensional point clouds
Challenge: Use the data to estimate a crossover regime from
high-dimensional noise to low-dimensional structure.

Solution: Mini-batches as output from Locality Sensitive

Hashing not randomly sampling the entire data set.
LOCAL MODELS
Noise subspace
dN=D−d
Signal
subspace

d<<D
35
Spectral Dynamics
Dynamical Systems Theory Random Matrix Theory
Statistical Mechanics of Knowledge
“Atom of Decidability”

PHYSICS

INFORMATION 0 1
LANGUAGE F T
Cantor’s Nightmare: Developing a proper fear of the infinite
The real number system is a very unstable representation for modelling.
Likewise, functions on the reals are strange objects. Why?

More topics
The function should be viewed as the limit of a
Mathematical Background probabilistic association between the elements of the sets
Thinking of functions as “really big” vectors in the domain and range.

Functional derivatives and integrals

Linear models: Everything linear algebra can do with data

Optimization in high-dimensional parameter space: Saddle points
Data compression while preserving information
• Unsupervised Learning, p(x): Rate Distortion Theory (Reconstructive Loss)
a) Learning how to predict itself From probability to functions:
b) Density estimation p ( y | x) → d ( y − f ( x) ) → y = f ( x)
i. Clustering
H (Y | X ) = − log 2 p( y | x) p ( x , y ) → 0
ii. Manifold learning
• Supervised Learning, p(x, y): Information Bottleneck (Discriminative Loss)
Perspectives on models:
Deterministic → Probabilistic → Information-theoretic → Geometric
From functions to measures to rescaling to
Where do labels come from?
Two spaces
In the beginning, there was only one space … but then we imagined there were two.

Many problems in machine learning and statistics are fundamentally about finding
a decomposition of a space into two subspaces and then learning a mapping
between them using a finite data set.
 x1 
Entire space: z  E = ( B  F ) pU  
 
Base space: x  B x  x 
d
z= = 1 
Fiber: y  F y  y 
 
Trivializing neighborhood: U  B  D − d 
y 
Vapnik’s Gamble: Discriminative Supervised Learning
When solving a problem of interest, do not solve
a more general problem as an intermediate step. L = L f −1
– Vladimir Vapnik

Observations INFERENCE Retrieval LABELING FUNCTION Label

y " f −1 " x̂ L ˆ
Coarse-grained
retrieval

Observations ML Prediction Label

y L

L → p ( | y ) =  dx p ( | x)  p( x | y )

Summary of perspectives on Data-to-Decisions

Deterministic Probabilistic Information-theoretic

p( ) p ( ) H ( L) H ( L)
Q p(  | ) I ( L, L)
Decisions: l l’ l l’ l l’
Label mixing Confusion matrix

L p ( | x) p(  | y) I ( X , L) I (Y , L)
L
f p( y | x) I ( X ,Y )
Data: x Forward model
y x Likelihood
y x y
p( x) p( y ) H (X ) H (Y )

I ( L, L) = H ( L) − H ( L | L)
C = max I ( L, L)
p ( |x )
Information-theoretic view of remote sensing

Using the “Information Bottleneck” approach to solve for the Property space: x  
Classification Design (to support the experiment) Observation space: y  Q
QR indices:  
Given the possible observations of the property space, what are the reliable classification schemes?

 
Quantizing the property space, , to preserve :
L L
p( x)
p ( | x) =
p( ) 
exp  −   p( y | x) log
p( y | x)  f
p( y | x) Z ( x,  )  y p( y | ) 
 p ( | x)
 Q
1
p( y | ) =  p ( | x)  p ( y | x)  p( x)
p( ) x The parameter  represents the
f :  → Q,
tradeoff between compression of x y = f ( x)
p ( ) =  p ( | x)  p( x) x and the predictive accuracy of L. L: → 
x x = L( x)
Experimental Design (to support the classification) L :Q → 
Given a classification scheme of the property space, what are the reliable observations? y = L ( y)

Quantizing the property space, , w.r.t. the labeling function, L :

p( x) p( y)  p ( | x) 
p ( y | x) = exp  −   p( | x) log
p( | y ) 
p( y | x)
p ( | x) Z ( x,  ) 
1
p( | y) =  p ( y | x)  p ( | x)  p( x)
p( y ) x
p ( y ) =  p ( y | x)  p( x)
x
Marginal of a Gaussian PDF
exp ( − 12 ( z − μ)T Σ −1 (z − μ) )
−1 2
p(z | μ, Σ) = (2 ) − d 2 Σ

 x  x    xx2  xy2 
z =  ,μ =  , Σ =  2 2 
, Σ =  xx2  yy2 −  xy4
 
 y  y    xy  yy  Completing the square in the Gaussian:

( )
2
Λ = Σ −1 ( y −  y ) + B ( x −  x ) = ( y −  y ) 2 + 2 B ( x −  x )  ( y −  y ) + B2 ( x −  x ) 2
2

xx xx xx

 xx2 xy2  1   yy2  xy2   xy

 2 =    B=
 xy  yy2   xx2  yy2 −  xy4   xy2  xx2  xx

−1 2 − 12 q 2
p ( x, y ) = (2 ) − d 2 Σ e
q 2 = xx2 ( x −  x ) 2 +  yy2 ( y −  y ) 2 + 2xy2 ( x −  x )( y −  y )
 yy2 ( x −  x ) 2 +  xx2 ( y −  y ) 2 + 2 xy2 ( x −  x )( y −  y )
=
 xx2  yy2 −  xy4
( yy2 − B 2 )( x −  x ) 2 +  xx2  ( y −  y ) 2 + 2  xy2 ( x −  x )( y −  y ) + B2 ( x −  x ) 2 
 2
2

=  xx xx 
 xx yy −  xy
2 2 4

( )
2

(y− ( x) )
4 2
( yy2 −  xy2 )( x −  x ) 2 +  xx2  ( y −  y ) +  xy2 ( x −  x ) ( x − x )
2 2

= = +
xx xx y

 xx2  yy2 −  xy4  xx2  yy2

( )
2
where  y ( x) =  y −  xy2  ( x −  x ) and  yy2 =  xx−2  ( xx2  yy2 −  xy4 ).  ( y −  ( x )) 
( xx2  yy2 −  xy4 )
2
−1 2 ( x −  x )2
xx p ( x, y ) = 1
2 exp  − 2y 2   exp − 2 xx2
 yy


( ) ( x − x ) (y− ( x) )
2 2
( x −  x )2
p ( x |  x ,  ) = (2 )
2
xx
2 −1 2
xx exp − 2 xx
2 q 2
= +
y

 xx2  yy2
p ( y |  y ,  yy2 ) = (2 yy2 ) −1 2 exp − ( ( y −  y )2
2 yy
2 ) 2
where  y ( x) =  y −  xy2  ( x −  x ) and  yy2 =  xx−2  ( xx2  yy2 −  xy4 ).
xx
Mutual information: Gaussians
Bivariate Gaussian PDF: Marginals:
(  yy2 −  xy4 )
 ( y −  ( x )) 
( ) ( )
2
−1 2 ( x −  )2
p( x |  x ,  xx2 ) = (2 xx2 )−1 2 exp − ( x2−2x )
2

p ( x, y ) = 1
2
2
xx exp  − 2y 2   exp − 2 2x
 yy
 xx xx

−1 2  ( y −  ( x )) 
= ( 2 yy2 ) exp  − 2y 2   (2 xx2 ) −1 2 exp − 2 2x
2

(
( x −  )2
) p( y |  y ,  yy2 ) = (2 yy2 )−1 2 exp − ( ( y −  y )2
2 yy
2 )
 yy
 xx

 ( y −  y ( x )) 
 y ( x) = ( 2 yy2 )
2
−1 2 
p ( y | x) p( x) − dy y exp  − 2 yy2 
 ( y −  y ( x )) 
= ( 2 yy )  dy ( y −  y ( x) ) exp  − 2 2 
2

2 −1 2 2
Mean-shift (linear regression):  y ( x) =  y −  xy2  ( x −  x )
2  yy2
xx
−
 yy

Variance reduction:  2
yy =  −   xy4  0 (Note: no x-dependence.)
2
yy
−2
xx

  p ( y | x)
I ( X , Y ) =  dx p ( x)  dy p ( y | x) ln ln
p( y | x)
= ln p( y | x) − ln p( y )
− − p( y ) p( y )
  ( y −  ( x ))  
= ln ( 2 yy2 ) exp  − 2y 2   − ln (2 yy2 ) −1 2 exp − 2 2y  ( )
2
−1 2 ( y −  )2

  yy
  yy 
( y −  ( x ))
2
( y −  )2
= 2 2y − 2y 2 + 12 ln  yy2  yy−2
yy yy

( y −  y ) 2 = y 2 − 2  y y +  y2 +  y2 −  y2 − 2 y y + 2 y y
= ( y 2 − 2 y y +  y2 ) + 2 y y − 2 y y +  y2 −  y2
= ( y −  y ) 2 + 2(  y −  y ) y + (  y2 −  y2 )

 ( y −  y ( x )) 
( 2 yy2 )
2
−1 2 
− dy ( y −  y ) exp  − 2 yy2  =  yy + 2 y ( y −  y ) + ( y −  y )
2 2 2 2
Mutual information: Gaussians
 ( y −  ( x ) )   ( y −  )2 ( y −  ( x ) ) 
= ( 2 yy2 )  dy exp  − 2y 2    2 2y − 2y 2 + 12 ln  yy2  yy−2 
2 2
 p ( y | x) −1 2 
−
dy p ( y | x) ln
p( y ) −
 yy
  yy yy

 ( y −  y ( x )) 
= 12 ln  yy2  yy−2 − 12 + 12  yy−2  ( 2 yy2 )
2
−1 2 
− dy ( y −  y ) exp  − 2 yy2 
2

= yy
2
+ 2  y (  y −  y ) + (  y2 −  y2 )

= 12 ln( yy2  yy−2 ) − 12 + 12  yy−2 yy2 + 12  yy−2  y2 − 12  yy−2  y2 +  yy−2  y (  y −  y )

= 12  yy−2  ( yy2 ln( yy2  yy−2 ) +  yy2 −  yy2 + (  y −  y ) 2 )
= 12  yy−2  ( yy2 ln( yy2  yy−2 ) +  yy2 −  y2y +  xx−4 xy4  ( x −  x ) 2 )

I ( X , Y ) = 12  yy−2   dx p ( x) ( yy2 ln( yy2  yy−2 ) +  yy2 −  yy2 +  xx−4 xy4  ( x −  x ) 2 )



−

= 1
2
−2
yy  ( yy2 ln( yy2  yy−2 ) +  yy2 −  yy2 +  xx−2 xy4 )
 yy−2 yy2 +  xx−2 yy−2 xy4 − 1 =  yy−2 yy2 +  xx−2 yy−2 xy4 − 1
=  ( ln(  ) − 1 +   +   
1
2
2
yy
−2
yy
−2
yy
2
yy
−2
xx
−2
yy
4
xy ) =  yy−2  ( yy2 −  xx−2 xy4 ) +  xx−2 yy−2 xy4 − 1
2  xx yy =0
2 2

= 12 ln  yy2 = 12 ln Σ
yy

 xx yy  xx yy A0
I ( X , Y ) = H ( X ) + H (Y ) − H ( X , Y ) = ln 12 =
Σ Σ
12
A

12 12
Σ XX ΣYY
Proposed generalization: I ( X , Y ) = ln 12
Σ
Class Schedules
Advanced Statistics for Physics Statistical Mechanics
of Complex Systems

Chap11 - Flexible Budgets and Overhead Analysis
100% (1)
Chap11 - Flexible Budgets and Overhead Analysis
58 pages
MediSys Corp
No ratings yet
MediSys Corp
11 pages
AM207 14 Introduction UQ
No ratings yet
AM207 14 Introduction UQ
63 pages
Bark08 Ghahramani Samlbb 01
No ratings yet
Bark08 Ghahramani Samlbb 01
26 pages
Proba&Stats For ML TelParis
No ratings yet
Proba&Stats For ML TelParis
17 pages
Slide 1
No ratings yet
Slide 1
37 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
Error Analysis Lecture 5
No ratings yet
Error Analysis Lecture 5
34 pages
L09 Learning I Bayesian Learning
No ratings yet
L09 Learning I Bayesian Learning
66 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Unit2_5_part 2
No ratings yet
Unit2_5_part 2
1 page
Hexp C13
No ratings yet
Hexp C13
67 pages
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
No ratings yet
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
141 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Practical Statistical Relational AI: Pedro Domingos
No ratings yet
Practical Statistical Relational AI: Pedro Domingos
109 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
An Overview of Edward: A Probabilistic Programming System: Dustin Tran Columbia University
No ratings yet
An Overview of Edward: A Probabilistic Programming System: Dustin Tran Columbia University
34 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
Note 1518944988
No ratings yet
Note 1518944988
27 pages
Module4 Notes
100% (1)
Module4 Notes
31 pages
4. Probability Models
No ratings yet
4. Probability Models
23 pages
4.ML_Estimation
No ratings yet
4.ML_Estimation
19 pages
Priors in Bayesian Learning
No ratings yet
Priors in Bayesian Learning
26 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Bayesian Dark Knowledge: N N I 1 I I N I I N I 1 I D I N N
No ratings yet
Bayesian Dark Knowledge: N N I 1 I I N I I N I 1 I D I N N
9 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Naïve Bayes Classifier: April 25, 2006
No ratings yet
Naïve Bayes Classifier: April 25, 2006
19 pages
Naive Bayes
No ratings yet
Naive Bayes
29 pages
Visualization 9 Dim Reduction
No ratings yet
Visualization 9 Dim Reduction
73 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
BayesianNetworks Reduced
No ratings yet
BayesianNetworks Reduced
14 pages
Var Bayes Linreg
No ratings yet
Var Bayes Linreg
14 pages
unit-3(after_mid)
No ratings yet
unit-3(after_mid)
10 pages
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
No ratings yet
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
21 pages
Predicting The Missing Value by Bayesian Classification: Abstract
No ratings yet
Predicting The Missing Value by Bayesian Classification: Abstract
5 pages
Var PPTS
No ratings yet
Var PPTS
249 pages
Bayesian Uncertainty Quantification
No ratings yet
Bayesian Uncertainty Quantification
23 pages
2412.10683v2
No ratings yet
2412.10683v2
78 pages
W12SGFE
No ratings yet
W12SGFE
3 pages
Outline: - Mathematical Background - PCA - SVD - Some PCA and SVD Applications - Case Study: LSI
No ratings yet
Outline: - Mathematical Background - PCA - SVD - Some PCA and SVD Applications - Case Study: LSI
42 pages
BASAD Research Presentation
No ratings yet
BASAD Research Presentation
87 pages
Bishop-Valencia-07
No ratings yet
Bishop-Valencia-07
22 pages
PRCV Viva Notes
No ratings yet
PRCV Viva Notes
32 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
UNIT4_Part2 aiml
No ratings yet
UNIT4_Part2 aiml
46 pages
Naive Bayes
No ratings yet
Naive Bayes
37 pages
Chapter 9 Data Mining
No ratings yet
Chapter 9 Data Mining
147 pages
BayesianNetworks Reduced
No ratings yet
BayesianNetworks Reduced
14 pages
Module_7
No ratings yet
Module_7
51 pages
Bayesian Data Selection Machine Learning
No ratings yet
Bayesian Data Selection Machine Learning
72 pages
Bayes
No ratings yet
Bayes
31 pages
Geoff Bohling NonParClass
No ratings yet
Geoff Bohling NonParClass
26 pages
24 Intro to Bayesian Inference (1)
No ratings yet
24 Intro to Bayesian Inference (1)
33 pages
Unit 4
No ratings yet
Unit 4
207 pages
Data Mining
No ratings yet
Data Mining
27 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Energy-Efficient Base Station Sleep-Mode Strategies
100% (1)
Energy-Efficient Base Station Sleep-Mode Strategies
101 pages
Glossary - Simplifying Technology
No ratings yet
Glossary - Simplifying Technology
3 pages
Energy Efficient Base Stations Sleep Mode Techniques in Green Cellular Networks
No ratings yet
Energy Efficient Base Stations Sleep Mode Techniques in Green Cellular Networks
25 pages
Perimeter of An Ellipse: Gilles Cazelais
No ratings yet
Perimeter of An Ellipse: Gilles Cazelais
2 pages
Turkish Petroleum Overseas Company About - Google
No ratings yet
Turkish Petroleum Overseas Company About - Google
1 page
Storage and Conveying of Bulk Solids
No ratings yet
Storage and Conveying of Bulk Solids
5 pages
Liquidity Management
0% (1)
Liquidity Management
57 pages
Rishimarksheet
No ratings yet
Rishimarksheet
1 page
CGD Final Report
No ratings yet
CGD Final Report
40 pages
Industrial Visit: Rajiv Gandhi College of Engineering, Research & Technology, Chandrapur
No ratings yet
Industrial Visit: Rajiv Gandhi College of Engineering, Research & Technology, Chandrapur
7 pages
Ammonium Chloride
No ratings yet
Ammonium Chloride
4 pages
Carrier Catalogue2013 en Eb
No ratings yet
Carrier Catalogue2013 en Eb
276 pages
Rocket Express - Order Spreadsheet
No ratings yet
Rocket Express - Order Spreadsheet
162 pages
Bangladesh Film Archive Fellowship Circular
No ratings yet
Bangladesh Film Archive Fellowship Circular
3 pages
Brandhood Wipro Signage BOQ 13-9-23
No ratings yet
Brandhood Wipro Signage BOQ 13-9-23
4 pages
Variable Pay & Executive Compensation Submmitted BY Shweta Nayak Suchitra
No ratings yet
Variable Pay & Executive Compensation Submmitted BY Shweta Nayak Suchitra
63 pages
Bus Route Times in Church Village 2014
No ratings yet
Bus Route Times in Church Village 2014
11 pages
FortiSwitch Secure Access Series
No ratings yet
FortiSwitch Secure Access Series
19 pages
Design Ii: Detailed Design
No ratings yet
Design Ii: Detailed Design
66 pages
TSLA Q3 2024 Update
No ratings yet
TSLA Q3 2024 Update
32 pages
Bridge Fatigue
No ratings yet
Bridge Fatigue
2 pages
List of Sony α Cameras - Wikipedia
No ratings yet
List of Sony α Cameras - Wikipedia
7 pages
Data Protector Develop_Section
No ratings yet
Data Protector Develop_Section
48 pages
OMR Abirami Vasanthmani
No ratings yet
OMR Abirami Vasanthmani
9 pages
Dzexams Bem Anglais 935000
No ratings yet
Dzexams Bem Anglais 935000
5 pages
Treadmill Generator
No ratings yet
Treadmill Generator
3 pages
Digital design theory readings from the field First Edition Armstrong - The ebook with all chapters is available with just one click
100% (2)
Digital design theory readings from the field First Edition Armstrong - The ebook with all chapters is available with just one click
70 pages
Employment Agreement
No ratings yet
Employment Agreement
2 pages
Mediapedia Colored Pencil PDF
67% (3)
Mediapedia Colored Pencil PDF
4 pages
1
No ratings yet
1
2 pages
Purge Script
No ratings yet
Purge Script
3 pages
Live Documentation Link
No ratings yet
Live Documentation Link
43 pages

Jeff Byers - Machine Learning and Advanced Statitics

Uploaded by

Jeff Byers - Machine Learning and Advanced Statitics

Uploaded by

“Physics of Data” Padua Lectures

Dr. Jeff M. Byers

CLASS: Advanced Statistics for Physical Analysis

Probability is in our Models!!!

The “ball” in a “box”

p(data | model) p (data=obs | model)

How does the data constrain the possible models?

The models are instances of a stochastic

Bayes’ Theorem: likelihood prior

3 different priors (or initial conditions) x = H H T H T T H T T H T H

- Assign a likelihood to the process:

- Update beliefs with data sequence:

Goal: We need to assign a scalar to each decision procedure, R : d →

Model pessimism: Data optimism:

Frequentist decision rule (Minimax) Bayesian decision rule

Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

Model pessimism: Data optimism:

Frequentist decision rule (Minimax) Bayesian decision rule

Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

Model pessimism: p ( ) p( X ) Data optimism:

Frequentist decision rule (Minimax) Bayesian decision rule

Coverage:   ( X , d F )  Q Coherence: p( X ) =  d p( X |  )  p( )

Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

High-dimensional Point Cloud Data

Parts Models of Objects Dynamic Reconfiguration

State estimation Prediction

• Observe: Sensor representation spaces, input from the recent past

Already did these?

More sophisticated choices

Magic: “Inductive Bias”

Deep Learning as Layered Neural Nets

Deep Learning as Recurrent Neural Nets

Playing to the hope of

Deep Learning as all kinds of stuff …

Turtles can be very threatening … … and your autonomous

Synthesizing Robust Adversarial Examples

Is this data from the manifold?

NOT on data manifold

Solution: Mini-batches as output from Locality Sensitive

Functional derivatives and integrals

Linear models: Everything linear algebra can do with data

Observations INFERENCE Retrieval LABELING FUNCTION Label

Observations ML Prediction Label

Deterministic Probabilistic Information-theoretic

Quantizing the property space, , w.r.t. the labeling function, L :

 xx2 xy2  1   yy2  xy2   xy

 xx2  yy2 −  xy4  xx2  yy2

= 12 ln( yy2  yy−2 ) − 12 + 12  yy−2 yy2 + 12  yy−2  y2 − 12  yy−2  y2 +  yy−2  y (  y −  y )

I ( X , Y ) = 12  yy−2   dx p ( x) ( yy2 ln( yy2  yy−2 ) +  yy2 −  yy2 +  xx−4 xy4  ( x −  x ) 2 )

You might also like