ML for NLP-LO4
ML for NLP-LO4
Learning Outcomes
LO 1 Concept of deep learning to build artificial neural networks and
traverse layers of data abstraction and get a solid understanding
of deep learning using Tensorflow and Keras
LO 2 Understanding text processing and vectorization of ML Use case
LO3: Developed and built fully automated NLP algorithms in Burt and Transformers
TRANSFORMER BERT
Transformers and BERT
1. A transformer uses Encoder stack to model input, and uses Decoder
stack to model output (using input information from encoder side).
2. But if we do not have input, we just want to model the “next word”, we
can get rid of the Encoder side of a transformer and output “next word”
one by one. This gives us GPT.
3. If we are only interested in training a language model for the input for
some other tasks, then we do not need the Decoder of the transformer,
that gives us BERT.
BERT (Bidirectional Encoder Representation from
Transformers)
Model input dimension 512
Input and output vector size
BERT pretraining
We start with
independent
word embedding
at first level
Vector Embedding of Words
• Either
Word Embeddings uses one hot encoding.
– Each word in the vocabulary is represented by
one bit position in a HUGE vector.
– For example, if we have a vocabulary of 10000
• Traditional Method - Bag words, and “Hello” is the 4th word in the
dictionary, it would be represented by: 0 0 0 1 0 0
of Words Model
Stores each word in as a point in space, where• it Or
.......000
is represented by arepresentation.
uses document dense vector of fixed
–number
Each wordof indimensions (generally
the vocabulary is represented300)
by.its
presence
Unsupervised, builtinjust
documents.
by reading huge corpus.
For example, “Hello” might be represented– as For: [0.4,
example, if we0.55,
-0.11, have a0.3
corpus
. . . of0.1,
1M 0.02].
documents, and “Hello” is in 1th, 3th and 5th
Dimensions are basically projections along different axes, more of a mathematical concept.
documents only, it would be represented by: 1 0
1010.......000
• Context information is not utilized.
14
Example
16
Working with vectors
• Similarity to a group of words
– “Find me words most similar to cat, dog and cow”.
– Calculate the pairwise similarities and sum them:
17
Applications of Word Vectors
• Word Similarity
• Machine Translation
• Part-of-Speech and Named Entity Recognition
• Relation Extraction
• Sentiment Analysis
• Co-reference Resolution
– Chaining entity mentions across multiple documents - can we find and unify the multiple
contexts in which mentions occurs?
• Clustering
– Words in the same class naturally occur in similar contexts, and this feature vector can
directly be used with any conventional clustering algorithms (K-Means, agglomerative, etc).
Human doesn’t have to waste time hand-picking useful word features to cluster on.
• Semantic Analysis of Documents
– Build word distributions for various topics, etc.
18
Vector Embedding of Words
• Three main methods described in the talk :
– Latent Semantic Analysis/Indexing (1988)
• Term weighting-based model
• Consider occurrences of terms at document level.
– Word2Vec (2013)
• Prediction-based model.
• Consider occurrences of terms at context level.
– GloVe (2014)
• Count-based model.
• Consider occurrences of terms at context level.
– ELMo (2018)
• Language model-based.
• A different embedding for each word for each task.
19
word2Vec: Local contexts
• Instead of entire documents, Word2Vec uses words k
positions away from each center word.
– These words are called context words.
• Example for k=3:
– “It was a bright cold day in April, and the clocks were striking”.
– Center word: red (also called focus word).
– Context words: blue (also called target words).
• Word2Vec considers all words as center words, and all
their context words.
20
Word2Vec: Data generation (window size = 2)
• Example: d1 = “king brave man” , d2 = “queen beautiful women”
word Word one hot encoding neighbor Neighbor one hot encoding
21
Word2Vec: Data generation (window size = 2)
• Example: d1 = “king brave man” , d2 = “queen beautiful women”
word Word one hot neighbor Neighbor one hot
encoding encoding
king [1,0,0,0,0,0] brave [0,1,1,0,0,0]
man
brave [0,1,0,0,0,0] king [1,0,1,0,0,0]
man
man [0,0,1,0,0,0] king [1,1,0,0,0,0]
brave
queen [0,0,0,1,0,0] beautiful [0,0,0,0,1,1]
women
beautiful [0,0,0,0,1,0] queen [0,0,0,1,0,1]
women
woman [0,0,0,0,0,1] queen [0,0,0,1,1,0]
beautiful
22
Word2Vec: main context representation models
W-2 W-2
Output Input
W-1 W-1
Sum and
w0 w0 Projection
projection
w1 w1
w2 w2
23
How does word2Vec work?
• Represent each word as a d dimensional vector.
• Represent each context as a d dimensional vector.
• Initialize all vectors to random weights.
• Arrange vectors in two matrices, W and C.
24
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)
|Vw| |Vc|
w1
w2
25
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)
king 1 0
0 w1 1 man
0 w2 0
0 0
0 0
26
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)
0 1 king
|Vw| |Vc|
brave 1 0
0 w1 1 man
0 w2 0
0 0
0 0
27
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)
0 1 king
man 1 w1 0
0 w2 0
0 0
0 0
28
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)
0 0
|Vw| |Vc|
0 0
0 w1 0
queen 1 w2 0
0 1 beautiful
0 1 women
29
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)
0 0
|Vw| |Vc|
0 0
0 w1 0
0 w2 1 queen
beautiful 1 0
0 1 women
30
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)
0 0
|Vw| |Vc|
0 0
0 w1 0
0 w2 1 queen
1 1 beautiful
women 0 0
31
Skip-Ngram: Training method
• The prediction problem is modeled using soft-max:
32
Skip-Ngram: Example
• While more text:
– Extract a word window:
33
Relations Learned by Word2Vec
• A relation is defined by the vector displacement in the first column. For each start word in the
other column, the closest displaced word is shown.
• “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen,
Greg Corrado, Jeffrey Dean, Arxiv 2013
34
What is language modelling?
• Today’s goal: assign a probability to a sentence
– Machine Translation:
• P(high winds tonight) > P(large winds tonight)
– Spell Correction
• The office is about fifteen minuets from my house!
– P(about fifteen minutes from) > P(about fifteen minuets from)
– Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
– + Summarization, question, answering, etc., etc.!!
– Reminder: The Chain Rule
35
RNN Language Model
P(a), p(aaron), …, p(cats), p(zulu) P(average|cats) P(15|cats,average) P(<EOS>|…)
¿1> ¿¿
^
𝑦 ^𝑦 ¿2 >¿¿ ^𝑦 ¿3 >¿ ¿ ^𝑦 ¿9 >¿ ¿
37
ELMo: What input?
• Transformations applied for each token
before being provided to input of first
LSTM layer.
• Pros of character embeddings:
– It allows to pick up on morphological
features that word-level embeddings could
miss.
– It ensures a valid representation even for
out-of-vocabulary words.
– It allows us to pick up on n-gram features
that build more powerful representations.
– The highway network layers allow for
smoother information transfer through the
input.
38
ELMo: Embeddings from Language Models
Intermediate representation
(output vector)
39
ELMo mathematical details
40
41
THANKS!
Any questions?