Lecture14 - ML (FF, Autoenc, Dense Networks)
Lecture14 - ML (FF, Autoenc, Dense Networks)
Processing
Lecture 14:
Machine Learning: Feed-forward Neural Networks,
Autoencoders/embeddings, Dense networks
12 /7/2019
COMS W4705
Yassine Benajiba
Perceptron Expressiveness
• Simple perceptron learning algorithm, starts with an
arbitrary hyperplane and adjusts it using the training data.
• Can represent boolean functions like and, or, not but not
the xor function.
The problem with xor
Multi-Layer Neural Networks
g(z) = 0 if z→-∞
g(z) = 1 if z→∞
Activation Functions
• Two other popular activation functions:
Output Representation
• Many NLP Problems are multi-class classification problems.
y0 0.9
y1 0.1
y2 0.7
y3 0.4
Softmax
• We often want the activation at the output layer to
represent probabilities.
z0 0.9
z1 0.1
z3 0.4
Softmax
• We often want the activation at the output layer to
represent probabilities.
z0 0.35
z1 0.16
z3 0.21
Learning in Multi-Layer
Neural Networks
• Network structure is fixed, but we want to train the weights. Assume
feed-forward neural networks: no connections that are loops.
• Backpropagation Algorithm:
x1 output vector
hw(x)
i hw(x)1 = a1
x2 k
input vector x
target vector y
hw(x)2 = a2
x3
Error function
x4 Etrain(w)
Randomly initialize w
for a set number of iterations T:
shuffle training data
for j = 1...N:
for each wi (all weights in the network):
The derivative of is
Backpropagation
Assume we know
to compute these
we have to know
the derivate of the
function f
Autoencoders
Embeddings
(Word level semantics)
Skip-Gram Model
• Input:
A single word in one-hot representation.
0.02 a
0 d hidden
⋮
neurons 0.0 thought
0 Σ
0.04 cheese
eat 1 Σ
0 ⋮ 0.03 place
⋮ Σ
⋮
0 0.0 run
|V| neurons |V| neurons
softmax activation
• Softmax function normalizes the activation of the output neurons to sum up to 1.0.
Skip-Gram Model
• Compute error with respect to each context word.
wt-c place ...a place to eat delicious cheese .
⋮ (eat, place)
(eat, to)
wt-1 to (eat, delicious)
eat (eat,cheese)
wt+1 delicious
wt
⋮
wt+c cheese
• Combine errors for each word, then use combined error to update
weights using back-propagation.
Continuous Bag-of-Words
Model (CBOW)
wt-c
wt-1
wt
wt+1
SUM
⋮
wt+c
• https://code.google.com/archive/p/word2vec/
• https://nlp.stanford.edu/projects/glove/
0 0.0 run
|V| neurons |V| neurons
softmax activation
Word embeddings
Pros
- Groups semantically
similar words together
- A simple way to measure
similarity
- Great approach to better
deal with unseen words
in the training
Cons
- Doesn’t make a
difference between
function and content
words
- Only one representation How can we build a sentence
for polysemous words representation using word-level
- Non interpretable distributional representations?
semantic dimensions
Acknowledgments
• Some slides by Chris Kedzie