0% found this document useful (0 votes)
16 views42 pages

ML for NLP-LO4

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views42 pages

ML for NLP-LO4

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Machine Learning for NLP

Learning Outcomes
LO 1 Concept of deep learning to build artificial neural networks and
traverse layers of data abstraction and get a solid understanding
of deep learning using Tensorflow and Keras
LO 2 Understanding text processing and vectorization of ML Use case

LO 3 Developed and built fully automated NLP algorithms in Burt and


Transformers

LO 4 Understand the concepts of NLP, feature engineering, natural


language generation, automated speech recognition, speech-to-
text conversion, text-to-speech conversion
Machine Learning for NLP

LO3: Developed and built fully automated NLP algorithms in Burt and Transformers
TRANSFORMER BERT
Transformers and BERT
1. A transformer uses Encoder stack to model input, and uses Decoder
stack to model output (using input information from encoder side).
2. But if we do not have input, we just want to model the “next word”, we
can get rid of the Encoder side of a transformer and output “next word”
one by one. This gives us GPT.
3. If we are only interested in training a language model for the input for
some other tasks, then we do not need the Decoder of the transformer,
that gives us BERT.
BERT (Bidirectional Encoder Representation from
Transformers)
Model input dimension 512
Input and output vector size
BERT pretraining

ULM-FiT (2018): Pre-training ideas, transfer learning in


NLP.
ELMo: Bidirectional training (LSTM)
Transformer: Although used things from left, but still
missing from the right.
GPT: Use Transformer Decoder half.
BERT: Switches from Decoder to Encoder, so that it can
use both sides in training and invented corresponding
training tasks: masked language model
BERT Pretraining Task 1: masked words

Out of this 15%,


80% are [Mask],
10% random words
10% original words
BERT Pretraining Task 2: two
sentences
BERT Pretraining Task 2: two
sentences

50% true second sentences


50% random second sentences
Fine-tuning BERT for other specific
tasks
MNLI
QQP (Quaro Question Pairs) SST (Stanford
Semantic equivalence) sentiment
QNLI (NL inference dataset)
STS-B (texture similarity) treebank): 215k
MRPC (paraphrase, Microsoft) phrases with fine-
RTE (textual entailment)
SWAG (commonsense inference) grained sentiment
SST-2 (sentiment) labels in the parse
CoLA (linguistic acceptability
SQuAD (question and answer)
trees of 11k
sentences.
Feature Extraction

We end up with some


embedding for each
word related to current
input

We start with
independent
word embedding
at first level
Vector Embedding of Words
• Either
Word Embeddings uses one hot encoding.
– Each word in the vocabulary is represented by
one bit position in a HUGE vector.
– For example, if we have a vocabulary of 10000
• Traditional Method - Bag words, and “Hello” is the 4th word in the
dictionary, it would be represented by: 0 0 0 1 0 0

of Words Model
Stores each word in as a point in space, where• it Or
.......000
is represented by arepresentation.
uses document dense vector of fixed
–number
Each wordof indimensions (generally
the vocabulary is represented300)
by.its
presence
Unsupervised, builtinjust
documents.
by reading huge corpus.
For example, “Hello” might be represented– as For: [0.4,
example, if we0.55,
-0.11, have a0.3
corpus
. . . of0.1,
1M 0.02].
documents, and “Hello” is in 1th, 3th and 5th
Dimensions are basically projections along different axes, more of a mathematical concept.
documents only, it would be represented by: 1 0
1010.......000
• Context information is not utilized.

14
Example

• vector[Queen] vector[King] - vector[Man] + vector[Woman]


• vector[Paris] vector[France] - vector[ Italy] + vector[ Rome]
– This can be interpreted as “France is to Paris as Italy is to
Rome”.
15
Working with vectors
• Finding the most similar words to .
– Compute the similarity from word to all other words.
– This is a single matrix-vector product:
• W is the word embedding matrix of |V| rows and d columns.
• Result is a |V| sized vector of similarities.
• Take the indices of the k-highest values.

16
Working with vectors
• Similarity to a group of words
– “Find me words most similar to cat, dog and cow”.
– Calculate the pairwise similarities and sum them:

– Now find the indices of the highest values as before.


– Matrix-vector products are wasteful. Better option:

17
Applications of Word Vectors
• Word Similarity
• Machine Translation
• Part-of-Speech and Named Entity Recognition
• Relation Extraction
• Sentiment Analysis
• Co-reference Resolution
– Chaining entity mentions across multiple documents - can we find and unify the multiple
contexts in which mentions occurs?
• Clustering
– Words in the same class naturally occur in similar contexts, and this feature vector can
directly be used with any conventional clustering algorithms (K-Means, agglomerative, etc).
Human doesn’t have to waste time hand-picking useful word features to cluster on.
• Semantic Analysis of Documents
– Build word distributions for various topics, etc.

18
Vector Embedding of Words
• Three main methods described in the talk :
– Latent Semantic Analysis/Indexing (1988)
• Term weighting-based model
• Consider occurrences of terms at document level.
– Word2Vec (2013)
• Prediction-based model.
• Consider occurrences of terms at context level.
– GloVe (2014)
• Count-based model.
• Consider occurrences of terms at context level.
– ELMo (2018)
• Language model-based.
• A different embedding for each word for each task.

19
word2Vec: Local contexts
• Instead of entire documents, Word2Vec uses words k
positions away from each center word.
– These words are called context words.
• Example for k=3:
– “It was a bright cold day in April, and the clocks were striking”.
– Center word: red (also called focus word).
– Context words: blue (also called target words).
• Word2Vec considers all words as center words, and all
their context words.

20
Word2Vec: Data generation (window size = 2)
• Example: d1 = “king brave man” , d2 = “queen beautiful women”
word Word one hot encoding neighbor Neighbor one hot encoding

king [1,0,0,0,0,0] brave [0,1,0,0,0,0]


king [1,0,0,0,0,0] man [0,0,1,0,0,0]
brave [0,1,0,0,0,0] king [1,0,0,0,0,0]
brave [0,1,0,0,0,0] man [0,0,1,0,0,0]
man [0,0,1,0,0,0] king [1,0,0,0,0,0]
man [0,0,1,0,0,0] brave [0,1,0,0,0,0]
queen [0,0,0,1,0,0] beautiful [0,0,0,0,1,0]
queen [0,0,0,1,0,0] women [0,0,0,0,0,1]
beautiful [0,0,0,0,1,0] queen [0,0,0,1,0,0]
beautiful [0,0,0,0,1,0] women [0,0,0,0,0,1]
woman [0,0,0,0,0,1] queen [0,0,0,1,0,0]
woman [0,0,0,0,0,1] beautiful [0,0,0,0,1,0]

21
Word2Vec: Data generation (window size = 2)
• Example: d1 = “king brave man” , d2 = “queen beautiful women”
word Word one hot neighbor Neighbor one hot
encoding encoding
king [1,0,0,0,0,0] brave [0,1,1,0,0,0]
man
brave [0,1,0,0,0,0] king [1,0,1,0,0,0]
man
man [0,0,1,0,0,0] king [1,1,0,0,0,0]
brave
queen [0,0,0,1,0,0] beautiful [0,0,0,0,1,1]
women
beautiful [0,0,0,0,1,0] queen [0,0,0,1,0,1]
women
woman [0,0,0,0,0,1] queen [0,0,0,1,1,0]
beautiful
22
Word2Vec: main context representation models

• Continuous Bag of Words Skip-Ngram


Input
• (CBOW) Output

W-2 W-2

Output Input
W-1 W-1
Sum and
w0 w0 Projection
projection
w1 w1

w2 w2

 Word2Vec is a predictive model.


 Will focus on Skip-Ngram model

23
How does word2Vec work?
• Represent each word as a d dimensional vector.
• Represent each context as a d dimensional vector.
• Initialize all vectors to random weights.
• Arrange vectors in two matrices, W and C.

24
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

|Vw| |Vc|

w1

w2

25
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

king 1 0

|Vw| |Vc| brave


0 1

0 w1 1 man

0 w2 0

0 0

0 0

26
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 1 king

|Vw| |Vc|
brave 1 0

0 w1 1 man

0 w2 0

0 0

0 0

27
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 1 king

|Vw| |Vc| brave


0 1

man 1 w1 0

0 w2 0

0 0

0 0

28
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 0

|Vw| |Vc|
0 0

0 w1 0

queen 1 w2 0

0 1 beautiful

0 1 women

29
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 0

|Vw| |Vc|
0 0

0 w1 0

0 w2 1 queen

beautiful 1 0

0 1 women

30
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 0

|Vw| |Vc|
0 0

0 w1 0

0 w2 1 queen

1 1 beautiful

women 0 0

31
Skip-Ngram: Training method
• The prediction problem is modeled using soft-max:

– Predict context words(s) c


– From focus word w
– Looks like logistic regression!
• are features and the evidence is
• The objective function (in log space):

32
Skip-Ngram: Example
• While more text:
– Extract a word window:

– Try setting the vector values such that:


• is high!
– Create a corrupt example by choosing a random word

– Try setting the vector values such that:


• is low!

33
Relations Learned by Word2Vec
• A relation is defined by the vector displacement in the first column. For each start word in the
other column, the closest displaced word is shown.

• “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen,
Greg Corrado, Jeffrey Dean, Arxiv 2013

34
What is language modelling?
• Today’s goal: assign a probability to a sentence
– Machine Translation:
• P(high winds tonight) > P(large winds tonight)
– Spell Correction
• The office is about fifteen minuets from my house!
– P(about fifteen minutes from) > P(about fifteen minuets from)
– Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
– + Summarization, question, answering, etc., etc.!!
– Reminder: The Chain Rule

35
RNN Language Model
P(a), p(aaron), …, p(cats), p(zulu) P(average|cats) P(15|cats,average) P(<EOS>|…)

¿1> ¿¿
^
𝑦 ^𝑦 ¿2 >¿¿ ^𝑦 ¿3 >¿ ¿ ^𝑦 ¿9 >¿ ¿

a<0>= a<1> a<2> a<3> … a<9>


W W W W

x<0>= x<2>=y<1> x<3>=y<2> x<9>=y<8>


cats average day

• Cats average 15 hours of sleep a day. <EOS>


– P(sentence) = P(cats)P(average|cats)P(15|cats,average)

36
Embeddings from Language Models
• ELMo architecture trains a
language model using a 2-layer bi-
directional LSTM (biLMs)
• What input?
– Traditional Neural Language Models
use fixed -length word embedding.
• One-hone encoding.
• Word2Vec.
• Glove.
• Etc.…
– ELMo uses a mode complex
representation.

37
ELMo: What input?
• Transformations applied for each token
before being provided to input of first
LSTM layer.
• Pros of character embeddings:
– It allows to pick up on morphological
features that word-level embeddings could
miss.
– It ensures a valid representation even for
out-of-vocabulary words.
– It allows us to pick up on n-gram features
that build more powerful representations.
– The highway network layers allow for
smoother information transfer through the
input.

38
ELMo: Embeddings from Language Models
Intermediate representation
(output vector)

39
ELMo mathematical details

• The function f performs the following operation on word k


of the input:

– Where ​represents softmax-normalized weights.

 ELMo learns a separate


representation for each
task
 Question answering,
sentiment analysis, etc.

40
41
THANKS!

Any questions?

You might also like