Deep Learning
Deep Learning
x
optimizers, cost functions and training
regularization methods
Backpropagation
tuning
GANs & Adversarial training
classification vs. regression tasks
Bayesian Deep Learning
DNN basic architectures: Generative models
convolutional Unsupervised / Pretraining
recurrent
attention mechanism
Application example: Relation Extraction
Machine Learning
Labeled Data algorithm
Training
Prediction
Learned
Labeled Data Prediction
model
class A
class A
Anomaly Detection
Sequence labeling
http://mbjoseph.github.io/2013/11/27/measure.html
…
ML vs. Deep Learning
Most machine learning methods work well because of human-designed
representations and input features
ML becomes just optimizing weights to best make a final prediction
What is Deep Learning (DL) ?
A machine learning subfield of learning representations of data.
Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of)
representation by using a hierarchy of multiple layers
If you provide the system tons of information, it begins to understand it
and respond in useful ways.
https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
Why is DL useful?
o Manually designed features are often over-specified, incomplete and
take a long time to design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable
framework for representing world, visual and linguistic information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning
o Utilize large amounts of training data
𝒉 = 𝝈(𝐖𝟏 𝒙 + 𝒃𝟏 )
𝒚 = 𝝈(𝑾𝟐 𝒉 + 𝒃𝟐 )
Activation functions
How do we train?
𝒚
4 + 2 = 6 neurons (not counting inputs)
𝒙 [3 x 4] + [4 x 2] = 20 weights
4 + 2 = 6 biases
𝒉 26 learnable parameters
Demo
Training
Forward it Back-
Sample Update the
labeled data through the
network, get
propagate network
(batch) the errors weights
predictions
ⅆ
𝜃𝑗𝑛𝑒𝑤 = 𝜃𝑗𝑜𝑙𝑑 − 𝛼 𝑜𝑙𝑑 𝐽(𝜃) Update each element of θ
ⅆ𝜃𝑗
learning rate
𝐖 𝒙𝒊 𝒃 𝝈(𝒙𝒊 ; 𝑾, 𝒃)
Activation functions
Non-linearities needed to learn complex (non-linear) representations of
data, otherwise the NN would be just a linear function W1 W2 𝑥 = 𝑊𝑥
http://cs231n.github.io/assets/nn1/layer_sizes.jpeg
http://adilmoujahid.com/images/activation.png
- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
• when the neuron’s activation are 0 or 1 (saturate)
� gradient at these regions almost zero
� almost no signal will flow to its weights
� if initial weights are too large then most neurons would saturate
Activation: Tanh
Takes a real-valued number and
“squashes” it into range between
-1 and 1.
𝑅𝑛 → −1,1
http://adilmoujahid.com/images/activation.png
𝑅𝑛 → 𝑅+𝑛
http://adilmoujahid.com/images/activation.png
http://wiki.bethanycrane.com/overfitting-of-data
https://www.neuraldesigner.com/images/learning/selection_error.svg
Regularization
Dropout
• Randomly drop units (along with their
connections) during training
• Each unit retained with fixed probability
p, independent of other units
• Hyper-parameter p to be chosen (tuned)
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural
networks from overfitting." Journal of machine learning research (2014)
L2 = weight decay
• Regularization term that penalizes big weights,
added to the objective 𝐽𝑟𝑒𝑔 𝜃 = 𝐽 𝜃 + 𝜆 𝜃𝑘2
• Weight decay value determines how dominant 𝑘
regularization is during gradient computation
• Big weight decay coefficient big penalty for big weights
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
• n is called patience
Tuning hyper-parameters
g(x) ≈ g(x) + h(y)
“Grid and random search of 9 trials for optimizing function g(x) ≈ g(x) + h(y)
With grid search, nine trials only test g(x) in three distinct places.
With random search, all nine trials explore distinct values of g. ”
f(x)=x
Convolutional
Input matrix 3x3 filter
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Convolutional Neural
Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards
max pool
2x2 filters
and stride 2
https://shafeentejani.github.io/assets/images/pooling.gif
CNN for text classification
Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for
Twitter Sentiment Classification." SemEval@ NAACL-HLT. 2015.
CNN with multiple filters
https://pbs.twimg.com/media/C2j-8j5UsAACgEK.jpg
𝑦𝑡 = 𝑓 ℎ𝑡 ; ℎ𝑡
Units with short-term dependencies often have reset gates very active
Units with long-term dependencies have active update gates z
Gated Recurrent Units
(GRUs)
Main idea:
keep around memory to capture long dependencies
Allow error messages to flow at different strengths depending on the inputs
Standard RNN computes hidden layer at next
time step directly ℎ𝑡 = 𝜎(𝑊 (ℎℎ) ℎ𝑡−1 + 𝑊 (ℎ𝑥) 𝑥𝑡 )
Bahdanau D. et al. "Neural machine translation by jointly learning to align and translate." ICLR (2015)
𝑠𝑐𝑜𝑟𝑒 ℎ𝑡−1 , ℎത 𝑠 = ℎ𝑡 𝑇 ℎത 𝑠
Compare target and source hidden states
Attention - Normalization
𝑒 𝑠𝑐𝑜𝑟𝑒(𝑠)
𝑎𝑡 𝑠 =
σ𝑠′ 𝑒 𝑠𝑐𝑜𝑟𝑒(𝑠′)
Convert into alignment weights
Attention - Context
𝑐𝑡 = 𝑎𝑡 𝑠 ℎത 𝑠
𝑠
Build context vector: weighted average
Attention - Context
𝑒𝑡
ℎ𝑡 = 𝑓(ℎ𝑡−1 , 𝑐𝑡 , 𝑒𝑡 )
Compute next hidden state
Application Example:
IMDB Movie reviews
sentiment classification
https://uofi.box.com/v/cs510DL
Binary Classification
Dataset of 25,000 movies reviews from IMDB,
labeled by sentiment (positive/negative)
Application Example:
Relation Extraction from text
http://www.mathcs.emory.edu/~dsavenk/slides/relation_extraction/img/distant.png
Useful for:
• knowledge base
completion
• social media analysis
• question answering
• …
Task: binary (or multi-class)
classification
sentence S = w1 w2 .. e1 .. wj .. e2 .. wn e1 and e2 entities
“The new iPhone 7 Plus includes an improved camera to take amazing pictures”
Component-Whole(e1 , e2 ) ?
YES / NO
The new iPhone 7 Plus includes an improved camera that takes amazing pictures
The new iPhone 7 Plus includes an improved camera that takes amazing pictures
Models: MLP
Component-Whole(e1 , e2 )
Sigmoid ?
YES / NO
Dense Layer n
…
Dense Layer 1
Zeng, D.et al. “Relation classication via convolutional deep neural network”.COLING (2014)
Models: CNN (2)
Component-Whole(e1 , e2 )
Sigmoid ?
YES / NO
Nguyen, T.H., Grishman, R. “Relation extraction: Perspective from convolutional neural networks.” VS@ HLT-NAACL. (2015)
Models: Bi-GRU
Component-Whole(e1 , e2 )
Sigmoid ?
YES / NO
Attention or
Max Pooling
Bi-GRU
Assumption:
when two entities co-occur in a sentence, a certain relation is expressed
knowledge base
Relation Entity 1 Entity 2 text
place of Michael Gary Barack Obama moved from Gary ….
birth Jackson Michael Jackson met … in Hawaii
place of Barack Hawaii
birth Obama
place of birth
… … …
For many ambiguous relations, mere co-occurrence does not guarantee the
existence of the relation Distant supervision produces false positives
Attention over Instances
s representation of the sentence set
Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016) [code]
Sentence-level ATT results
NYT10 Dataset
Align Freebase relations with
New York Times corpus (NYT)
53 possible relationships
+NA (no relation between entities)
Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016) [code]
References
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from
overfitting." Journal of machine learning research (2014)
Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." Journal
of Machine Learning Research, Feb (2012)
Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014)
Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network
for Twitter Sentiment Classification." SemEval@ NAACL-HLT (2015)
Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for
statistical machine translation." EMNLP (2014)
Ilya Sutskever et al. “Sequence to sequence learning with neural networks.” NIPS (2014)
Bahdanau et al. "Neural machine translation by jointly learning to align and translate." ICLR (2015)
Gal, Y., Islam, R., Ghahramani, Z. “Deep Bayesian Active Learning with Image Data.” ICML (2017)
Nair, V., Hinton, G.E. “Rectified linear units improve restricted boltzmann machines.” ICML (2010)
Ronan Collobert, et al. “Natural language processing (almost) from scratch.” JMLR (2011)
Kumar, Shantanu. "A Survey of Deep Learning Methods for Relation Extraction." arXiv preprint
arXiv:1705.03645 (2017)
Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016) [code]
Zeng, D.et al. “Relation classification via convolutional deep neural network”. COLING (2014)
Nguyen, T.H., Grishman, R. “Relation extraction: Perspective from CNNs.” VS@ HLT-NAACL. (2015)
Zhang, D., Wang, D. “Relation classification via recurrent NN.” -arXiv preprint arXiv:1508.01006 (2015)
Zhou, P. et al. “Attention-based bidirectional LSTM networks for relation classification . ACL (2016)
Mike Mintz et al. “Distant supervision for relation extraction without labeled data.” ACL- IJCNLP (2009)
References & Resources
http://web.stanford.edu/class/cs224n
https://www.coursera.org/specializations/deep-learning
https://chrisalbon.com/#Deep-Learning
http://www.asimovinstitute.org/neural-network-zoo
http://cs231n.github.io/optimization-2
https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-
algorithm-4106a6702d39
https://arimo.com/data-science/2016/bayesian-optimization-hyperparameter-tuning
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp
https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-
networks-on-the-internet-fbb8b1ad5df8
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-
grulstm-rnn-with-python-and-theano/
http://colah.github.io/posts/2015-08-Understanding-LSTMs
https://github.com/hyperopt/hyperopt
https://github.com/tensorflow/nmt
https://giphy.com/gifs/thanks-thank-you-thnx-3o6ozuHcxTtVWJJn32/download