0% found this document useful (0 votes)
7 views

Lecture14 - ML (FF, Autoenc, Dense Networks)

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture14 - ML (FF, Autoenc, Dense Networks)

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Natural Language

Processing
Lecture 14:
Machine Learning: Feed-forward Neural Networks,
Autoencoders/embeddings, Dense networks

12 /7/2019

COMS W4705
Yassine Benajiba
Perceptron Expressiveness
• Simple perceptron learning algorithm, starts with an
arbitrary hyperplane and adjusts it using the training data.

• Step function is not differentiable, so no closed-form


solution.

• Perceptron produces a linear separator.

• Can only learn linearly separable patterns.

• Can represent boolean functions like and, or, not but not
the xor function.
The problem with xor
Multi-Layer Neural Networks

input layer hidden layer output layer

• Basic idea: represent any (non-linear) function as a composition of


soft-threshold functions. This is a form of non-linear regression.

• Lippmann 1987: Two hidden layers suffice to represent any arbitrary


region (provided enough neurons), even discontinuous functions!
Activation Functions
• One problem with perceptrons is that the threshold
function (step function) is undifferentiable.

• It is therefore unsuitable for gradient descent.

• One alternative is the sigmoid (logistic) function:

g(z) = 0 if z→-∞
g(z) = 1 if z→∞
Activation Functions
• Two other popular activation functions:
Output Representation
• Many NLP Problems are multi-class classification problems.

• Each output neuron represents one class. Predict the class


with the highest activation.

y0 0.9

y1 0.1

y2 0.7

y3 0.4
Softmax
• We often want the activation at the output layer to
represent probabilities.

• Normalize activation of each output unit by the sum of all


output activations (as in log-linear models).

z0 0.9

z1 0.1

z2 0.7 The network computes a probability

z3 0.4
Softmax
• We often want the activation at the output layer to
represent probabilities.

• Normalize activation of each output unit by the sum of all


output activations (as in log-linear models).

z0 0.35

z1 0.16

z2 0.28 The network computes a probability

z3 0.21
Learning in Multi-Layer
Neural Networks
• Network structure is fixed, but we want to train the weights. Assume
feed-forward neural networks: no connections that are loops.

• Backpropagation Algorithm:

• Given current weights, get network output and compute loss


function (assume multiple outputs / a vector of outputs).

• Can use gradient descent to update weights and minimize loss.

• Problem: We only know how to do this for the last layer!

• Idea: Propagate error backwards through the network.


Backpropagation
feed-forward computation of network outputs

x1 output vector
hw(x)
i hw(x)1 = a1
x2 k
input vector x
target vector y
hw(x)2 = a2
x3

Error function
x4 Etrain(w)

input layer hidden layer output layer

back propagation of error gradients


Negative Log-Likelihood
(also known as cross-entropy)

• Assume target output is a one-hot vector and c(y) is the


target class for target y.

• Compute the negative log-likehood for a single example

• Empirical error for the entire training data:


Stochastic Gradient Descent
(for a single unit)
• Goal: Learn parameters that minimize the empirical error.

Randomly initialize w
for a set number of iterations T:
shuffle training data
for j = 1...N:
for each wi (all weights in the network):

• is the learning rate.


• It often makes sense to compute the gradient over batches of examples,
instead of just one ("mini-batch").
Backpropgation
• Simplified multi-layer case (a single unit per layer):

x g g(x) f f(g(x)) Loss


w1 w2

• Stochastic Gradient Descent should perform the following


update:

• Problem: How do we compute the gradient for parameters w1


and w2?
Chain Rule of Calculus

• To compute gradients for hidden units, we need to apply the


chain rule of calculus:

The derivative of is
Backpropagation

x f f(x) g g(f(x)) Loss


w1 w2
Backpropagation
forward ... x f f(x) ... Loss
w

backward ... f ...


w

Assume we know

We want to compute to propagate it back.

and (for the weight update)


Backpropagation
forward ... x f f(x) ... Loss
w

backward ... f ...


w

to compute these
we have to know
the derivate of the
function f
Autoencoders
Embeddings
(Word level semantics)
Skip-Gram Model
• Input:
A single word in one-hot representation.

• Output: probability to see any single word as a context word.

0.02 a
0 d hidden

neurons 0.0 thought
0 Σ
0.04 cheese
eat 1 Σ
0 ⋮ 0.03 place
⋮ Σ

0 0.0 run
|V| neurons |V| neurons
softmax activation
• Softmax function normalizes the activation of the output neurons to sum up to 1.0.
Skip-Gram Model
• Compute error with respect to each context word.
wt-c place ...a place to eat delicious cheese .

⋮ (eat, place)
(eat, to)
wt-1 to (eat, delicious)
eat (eat,cheese)
wt+1 delicious
wt

wt+c cheese

• Combine errors for each word, then use combined error to update
weights using back-propagation.
Continuous Bag-of-Words
Model (CBOW)
wt-c

wt-1
wt

wt+1
SUM

wt+c

• Input: Context words. Averaged in the hidden layer.

• Output: Probability that each word is the target word.


Embeddings are Magic
(Mikolov 2016)

vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’)


Application: Word Pair
Relationships
Using Word Embeddings
• Word2Vec:

• https://code.google.com/archive/p/word2vec/

• GloVe: Global Vectors for Word Representation

• https://nlp.stanford.edu/projects/glove/

• Can either use pre-trained word embeddings or train them


on a large corpus.
Word embeddings
0.02 a
0 d hidden
⋮ neurons 0.0 thought
0 Σ
0.04 cheese
eat 1 Σ
0 ⋮ 0.03 place
⋮ Σ

0 0.0 run
|V| neurons |V| neurons
softmax activation
Word embeddings
Pros
- Groups semantically
similar words together
- A simple way to measure
similarity
- Great approach to better
deal with unseen words
in the training

Cons
- Doesn’t make a
difference between
function and content
words
- Only one representation How can we build a sentence
for polysemous words representation using word-level
- Non interpretable distributional representations?
semantic dimensions
Acknowledgments
• Some slides by Chris Kedzie

You might also like