0% found this document useful (0 votes)
8 views25 pages

Chapter 1

Restricted Boltzmann Machines (RBMs) are a type of neural network architecture that excels in unsupervised learning by modeling probabilistic dependencies between binary hidden and visible states. They are historically significant for enabling pre-training of conventional neural networks and can be adapted for both supervised and unsupervised applications. The document discusses the structure, training methods, and various applications of RBMs, highlighting their role in collaborative filtering, topic modeling, and classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views25 pages

Chapter 1

Restricted Boltzmann Machines (RBMs) are a type of neural network architecture that excels in unsupervised learning by modeling probabilistic dependencies between binary hidden and visible states. They are historically significant for enabling pre-training of conventional neural networks and can be adapted for both supervised and unsupervised applications. The document discusses the structure, training methods, and various applications of RBMs, highlighting their role in collaborative filtering, topic modeling, and classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Charu C.

Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Restricted Boltzmann Machines

Neural Networks and Deep Learning, Springer, 2018


Chapter 6
Restricted Boltzmann Machines

• Most of the neural architectures map inputs to outputs.

– Ideal for supervised models.

– Autoencoders can be used for unsupervised models by


replicating the output.

• Restricted Boltzmann machines are borrowed from proba-


bilistic graphical models.

– Graph of probabilistic dependencies between binary states


that are outcomes of distributions.

– Binary training data provides some examples of states.

– Ideal for unsupervised models.


Key Differences from Conventional Neural Networks

• No input to output mapping

• States are discrete samples of probability distributions with


interdependencies among samples.

• Training points provide examples of some (visible) states.

• Computational graph abstraction: The parameterized


edges define dependencies among states.

– The computational graph abstraction is main commonal-


ity (can be exploited for pre-training)

– Can approximately convert a sampling-based dependency


to real-valued operation for initialization of a related (con-
ventional) neural network
Historical Significance

• Most of the practical applications in neural networks use su-


pervised learning.

• RBMs can still be used for unsupervised pre-training of con-


ventional neural networks and also extended to supervised
learning.

– Replace binary state outcomes with fractional probabilities

– Treat fractional values as the activations of a conventional


neural network

– Pretraining owes its historical origins to RBMs


Defining a Restricted Boltzmann Machine

HIDDEN STATES
h1 h2 h3

v1 v2 v3 v4
VISIBLE STATES

• Bipartite graph of binary hidden states and visible states con-


nected by undirected edges signifying probabilistic dependen-
cies ⇒ Origin of word “restricted” from bipartite model
An Interpretable Boltzmann Machine

BEN’S JERRY’S TOM’S PARENTS SEE


TRUCK TRUCK TRUCK
HIDDEN STATES
[TRUCKS]

PARENTS LIKELY TO
BUY DIFFERENT ITEMS
FROM DIFFERENT TRUCKS
[ENCODED IN WEIGHTS]

CHILD ONLY SEES VISIBLE


CONES SUNDAE POPSICLE CUP
STATES [ICECREAMS] FROM
DAILY TRAINING DATA AND
MODELS THE WEIGHTS

• Undirected model ⇒ Probability to buy icecreams and pick


trucks depend on one another (using weights)
What Kind of Model does a Restricted Boltzmann
Machine Build?

• Probability distributions of the binary hidden and visible


states depend on one another.

– Weights on edges control probabilistic dependencies.

– Training data assumed to be samples of visible states.

• We want to learn weights that are “consistent” with training


samples.

• Use energy function to force “consistency” ⇒ Unsupervised


model.

• The model can use learned weights to output samples that


are consistent with training data ⇒ Generative model.
Notations

• We assume that the binary hidden units are h1 . . . hm and the


visible units are v1 . . . vd.

• The bias associated with the visible node vi be denoted by


(v)
bi .

(h)
• The bias associated with hidden node hj is denoted by bj .

• The weight of the edge between visible node vi and hidden


node hj is denoted by wij .

• Can be generalized to non-binary data with some work.


Probabilistic Relationships

• Want to learn weights wij so that samples of the training


data are most “consistent” with the following relationships:
1
P (hj = 1|v) =  (1)
(h)
1 + exp(−bj − di=1 viwij )

1
P (vi = 1|h) = m (2)
(v)
1 + exp(−bi − j=1 hj wij )

• Use energy function to force consistency by minimizing ex-


 (v)  (h) 
pected value of E = − i bi vi − j bj hj − i,j:i<j wij vihj
How Data is Generated from a Boltzmann Machine

• Data is generated by using Gibb’s sampling.

• Randomly initialize visible states and then sample hidden


states using Equation 1 (previous slide).

• Alternately sample hidden states and visible states using


Equations 1 and 2 until thermal equilibrium is reached.

• A particular set of visible states at thermal equilibrium pro-


vides a sample of a binary training vector.

• The weights implicitly encode the distribution by defining


probabilistic dependencies.
Intuition for Weights

• Consider weights like affinities ⇒ Large positive values of wij


imply that states will be “on ” together.

• We already have samples showing which visible states are


“on” together.

• Weights will be learned in such a way that hidden states will


be connected to correlated visible states with large weights.

– Biological motivation: In Hebbian learning, a synapse


between two neurons is strengthened when the neurons on
either side of the synapse have highly correlated outputs.

– Contrastive divergence algorithm learns weights.


Overview of Contrastive Divergence

• Positive phase: Draw b instances of hidden states based


on visible states fixed to each of a mini-batch of b training
points ⇒ Yields vi, hj pos

• Negative phase: For each of the b instances in positive


phase continue to alternately sample visible states and hidden
states from one another for r iterations ⇒ Yields vi, hj neg
 
wij ⇐ wij + α vi, hj pos − vi, hj neg
(v) (v)
bi ⇐ bi + α (vi, 1pos − vi, 1neg )
 
(h) (h)
bj ⇐ bj + α 1, hj pos − 1, hj neg
Remarks on Contrastive Divergence

• Strictly speaking, the negative phase needs a very large num-


ber of iterations to reach thermal equilibrium in negative
phase.

• Positive phase requires only one iteration because visible


states are fixed to training points.

• Contrastive divergence says that only a small number of it-


erations of negative phase are sufficient for “good” update
of weight vector (even without thermal equilibrium).

• In the early phases of training, one iteration is enough for


“good” update.

• Can increase number of iterations in later phases.


Utility of Unsupervised Learning

• One can use an RBM to initialize an autoencoder for binary


data (later slides).

• Treat the sigmoid-based sampling as an a sigmoid activation.

• Basic idea can be extended to multilayer neural networks by


using stacked RBMs.

– One of the earliest methods for pretraining.


Equivalence of Directed and Undirected Models

HIDDEN STATES HIDDEN STATES


EQUIVALENCE
W W WT

VISIBLE STATES VISIBLE STATES

• Replace undirected edges with directed edges


(h)
h ∼ Sigmoid(v, b ,W)
(v)
v ∼ Sigmoid(h, b ,WT)

• Replace sampling with real-valued operations


Using a Trained RBM to Initialize a Conventional
Autoencoder

VISIBLE STATES
FIX VISIBLE STATES
(RECONSTRUCTED)
HIDDEN STATES IN A LAYER TO
INPUT DATA POINT WT
W WT HIDDEN STATES
REPLACE DISCRETE (REDUCED FEATURES)
VISIBLE STATES SAMPLING WITH
REAL-VALUED W
PROBABILITIES
VISIBLE STATES (FIXED)

• Architecture on right uses real-valued sigmoid operations


rather than discrete operations ⇒ Conventional autoen-
coder!.
1
ĥj = d (3)
(h)
1 + exp(−bj − i=1 viwij )

1
v̂i = m (4)
(v)
1 + exp(−bi − j=1 ĥj wij )
Why Use an RBM to Initialize a Conventional Neural
Network?

• In the early years, conventional neural networks did not train


well (especially with increased depth).

– Vanishing and exploding gradient problems

– RBM trains with contrastive divergence (no vanishing and


exploding gradient)

• Real-valued approximation was used with stacked RBMs to


initialize deep networks

• Approach was generalized to conventional autoencoders later


Stacked RBM

RBM 3

W3

STACKED
COPY
RBM 2 REPRESENTATION
W2

W1

COPY
RBM 1
THE PARAMETER MATRICES W1, W2, and W3
ARE LEARNED BY SUCCESSIVELY TRAINING
RBM1, RBM2, AND RBM3 INDIVIDUALLY
(PRE-TRAINING PHASE)

• Train different layers sequentially


Stacked RBM to Conventional Neural Network

DECODER DECODER
RECONSTRUCTION (TARGET=INPUT) RECONSTRUCTION (TARGET=INPUT)

W1T W1T +E1

W2T W2T+E2

W3T W3T+E3
CODE CODE

W3 W3+E4
FINE-TUNE
(BACKPROP)
W2 W2+E5

W1 W1+E6

FIX TO INPUT FIX TO INPUT


ENCODER ENCODER
Applications

• Pretraining can be used for supervised and unsupervised ap-


plications

– Collaborative filtering: Was a component of Netflix prize


contest

∗ Gives different results from an autoencoder-like archi-


tecture in an earlier lecture

– Topic models

– Classification
Collaborative Filtering

E.T. (RATING=2)
0 1 0
1 0
1 0
1

NIXON (RATING=5)
h1
E.T. (RATING=4) 0 0
1 0
1 0
1 1

HIDDEN UNITS
0 0
1 0
1 1 0
1

h1

HIDDEN UNITS
GANDHI (RATING=4) h2
SHREK (RATING=5) 0 0
1 0
1 1 0
1
0 0
1 0
1 0
1 1
NERO (RATING=3)
h2 0 0
1 1 0
1 0
1

• Changes for softmax activations and shared weights across


RBMs
Topic Models

h1 h2 h3 h4
BINARY HIDDEN
STATES
VISIBLE UNITS SHARE SAME
SET OF PARAMETERS BUT MULTINOMIAL
NOT HIDDEN UNITS VISIBLE STATES

LEXICON
SIZE d IS
TYPICALLY
LARGER
THAN
DOCUMENT
SIZE

NUMBER OF SOFTMAX UNITS EQUALS


DOCUMENT SIZE FOR EACH RBM
Classification

• Can be used for unsupervised pretraining for classification

– Goal of RBM is only to learn features in unsupervised way

– Class label does not get a state in RBM

• Can also be used to train by treating a class label as a state.

– Hidden features are connected to both class variables and


feature variables.

– Generative approach of RBMs does not fully optimize for


classification accuracy ⇒ Need discriminative Boltzmann
Machines (Larochelle et al).
Classification Architecture

BINARY HIDDEN STATES

W U

BINARY VISIBLE STATES MULTINOMIAL VISIBLE STATES


(FEATURES) (CLASSES)
Comments

• RBMs represent a special case of probabilistic graphical mod-


els.

• Provides an alternative to the autoencoder.

• Can be extended to non-binary data.

• These models are not quite as popular anymore.

• Significant historical significance in starting the idea of pre-


training for deep models.

You might also like