0% found this document useful (0 votes)

16 views42 pages

ML for NLP-LO4

Uploaded by

shahbazhassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views42 pages

ML for NLP-LO4

Uploaded by

shahbazhassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

Machine Learning for NLP

Learning Outcomes
LO 1 Concept of deep learning to build artificial neural networks and
traverse layers of data abstraction and get a solid understanding
of deep learning using Tensorflow and Keras
LO 2 Understanding text processing and vectorization of ML Use case

LO 3 Developed and built fully automated NLP algorithms in Burt and

Transformers

LO 4 Understand the concepts of NLP, feature engineering, natural

language generation, automated speech recognition, speech-to-
text conversion, text-to-speech conversion
Machine Learning for NLP

LO3: Developed and built fully automated NLP algorithms in Burt and Transformers
TRANSFORMER BERT
Transformers and BERT
1. A transformer uses Encoder stack to model input, and uses Decoder
stack to model output (using input information from encoder side).
2. But if we do not have input, we just want to model the “next word”, we
can get rid of the Encoder side of a transformer and output “next word”
one by one. This gives us GPT.
3. If we are only interested in training a language model for the input for
some other tasks, then we do not need the Decoder of the transformer,
that gives us BERT.
BERT (Bidirectional Encoder Representation from
Transformers)
Model input dimension 512
Input and output vector size
BERT pretraining

ULM-FiT (2018): Pre-training ideas, transfer learning in

NLP.
ELMo: Bidirectional training (LSTM)
Transformer: Although used things from left, but still
missing from the right.
GPT: Use Transformer Decoder half.
BERT: Switches from Decoder to Encoder, so that it can
use both sides in training and invented corresponding
training tasks: masked language model
BERT Pretraining Task 1: masked words

Out of this 15%,

80% are [Mask],
10% random words
10% original words
BERT Pretraining Task 2: two
sentences
BERT Pretraining Task 2: two
sentences

50% true second sentences

50% random second sentences
Fine-tuning BERT for other specific
tasks
MNLI
QQP (Quaro Question Pairs) SST (Stanford
Semantic equivalence) sentiment
QNLI (NL inference dataset)
STS-B (texture similarity) treebank): 215k
MRPC (paraphrase, Microsoft) phrases with fine-
RTE (textual entailment)
SWAG (commonsense inference) grained sentiment
SST-2 (sentiment) labels in the parse
CoLA (linguistic acceptability
SQuAD (question and answer)
trees of 11k
sentences.
Feature Extraction

We end up with some

embedding for each
word related to current
input

We start with
independent
word embedding
at first level
Vector Embedding of Words
• Either
Word Embeddings uses one hot encoding.
– Each word in the vocabulary is represented by
one bit position in a HUGE vector.
– For example, if we have a vocabulary of 10000
• Traditional Method - Bag words, and “Hello” is the 4th word in the
dictionary, it would be represented by: 0 0 0 1 0 0

of Words Model
Stores each word in as a point in space, where• it Or
.......000
is represented by arepresentation.
uses document dense vector of fixed
–number
Each wordof indimensions (generally
the vocabulary is represented300)
by.its
presence
Unsupervised, builtinjust
documents.
by reading huge corpus.
For example, “Hello” might be represented– as For: [0.4,
example, if we0.55,
-0.11, have a0.3
corpus
. . . of0.1,
1M 0.02].
documents, and “Hello” is in 1th, 3th and 5th
Dimensions are basically projections along different axes, more of a mathematical concept.
documents only, it would be represented by: 1 0
1010.......000
• Context information is not utilized.

14
Example

• vector[Queen] vector[King] - vector[Man] + vector[Woman]

• vector[Paris] vector[France] - vector[ Italy] + vector[ Rome]
– This can be interpreted as “France is to Paris as Italy is to
Rome”.
15
Working with vectors
• Finding the most similar words to .
– Compute the similarity from word to all other words.
– This is a single matrix-vector product:
• W is the word embedding matrix of |V| rows and d columns.
• Result is a |V| sized vector of similarities.
• Take the indices of the k-highest values.

16
Working with vectors
• Similarity to a group of words
– “Find me words most similar to cat, dog and cow”.
– Calculate the pairwise similarities and sum them:

– Now find the indices of the highest values as before.

– Matrix-vector products are wasteful. Better option:

17
Applications of Word Vectors
• Word Similarity
• Machine Translation
• Part-of-Speech and Named Entity Recognition
• Relation Extraction
• Sentiment Analysis
• Co-reference Resolution
– Chaining entity mentions across multiple documents - can we find and unify the multiple
contexts in which mentions occurs?
• Clustering
– Words in the same class naturally occur in similar contexts, and this feature vector can
directly be used with any conventional clustering algorithms (K-Means, agglomerative, etc).
Human doesn’t have to waste time hand-picking useful word features to cluster on.
• Semantic Analysis of Documents
– Build word distributions for various topics, etc.

18
Vector Embedding of Words
• Three main methods described in the talk :
– Latent Semantic Analysis/Indexing (1988)
• Term weighting-based model
• Consider occurrences of terms at document level.
– Word2Vec (2013)
• Prediction-based model.
• Consider occurrences of terms at context level.
– GloVe (2014)
• Count-based model.
• Consider occurrences of terms at context level.
– ELMo (2018)
• Language model-based.
• A different embedding for each word for each task.

19
word2Vec: Local contexts
• Instead of entire documents, Word2Vec uses words k
positions away from each center word.
– These words are called context words.
• Example for k=3:
– “It was a bright cold day in April, and the clocks were striking”.
– Center word: red (also called focus word).
– Context words: blue (also called target words).
• Word2Vec considers all words as center words, and all
their context words.

20
Word2Vec: Data generation (window size = 2)
• Example: d1 = “king brave man” , d2 = “queen beautiful women”
word Word one hot encoding neighbor Neighbor one hot encoding

king [1,0,0,0,0,0] brave [0,1,0,0,0,0]

king [1,0,0,0,0,0] man [0,0,1,0,0,0]
brave [0,1,0,0,0,0] king [1,0,0,0,0,0]
brave [0,1,0,0,0,0] man [0,0,1,0,0,0]
man [0,0,1,0,0,0] king [1,0,0,0,0,0]
man [0,0,1,0,0,0] brave [0,1,0,0,0,0]
queen [0,0,0,1,0,0] beautiful [0,0,0,0,1,0]
queen [0,0,0,1,0,0] women [0,0,0,0,0,1]
beautiful [0,0,0,0,1,0] queen [0,0,0,1,0,0]
beautiful [0,0,0,0,1,0] women [0,0,0,0,0,1]
woman [0,0,0,0,0,1] queen [0,0,0,1,0,0]
woman [0,0,0,0,0,1] beautiful [0,0,0,0,1,0]

21
Word2Vec: Data generation (window size = 2)
• Example: d1 = “king brave man” , d2 = “queen beautiful women”
word Word one hot neighbor Neighbor one hot
encoding encoding
king [1,0,0,0,0,0] brave [0,1,1,0,0,0]
man
brave [0,1,0,0,0,0] king [1,0,1,0,0,0]
man
man [0,0,1,0,0,0] king [1,1,0,0,0,0]
brave
queen [0,0,0,1,0,0] beautiful [0,0,0,0,1,1]
women
beautiful [0,0,0,0,1,0] queen [0,0,0,1,0,1]
women
woman [0,0,0,0,0,1] queen [0,0,0,1,1,0]
beautiful
22
Word2Vec: main context representation models

• Continuous Bag of Words Skip-Ngram

Input
• (CBOW) Output

W-2 W-2

Output Input
W-1 W-1
Sum and
w0 w0 Projection
projection
w1 w1

w2 w2

 Word2Vec is a predictive model.

 Will focus on Skip-Ngram model

23
How does word2Vec work?
• Represent each word as a d dimensional vector.
• Represent each context as a d dimensional vector.
• Initialize all vectors to random weights.
• Arrange vectors in two matrices, W and C.

24
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

|Vw| |Vc|

25
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

king 1 0

|Vw| |Vc| brave

0 1

0 w1 1 man

0 w2 0

0 0

26
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 1 king

|Vw| |Vc|
brave 1 0

0 w1 1 man

0 w2 0

0 0

27
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 1 king

|Vw| |Vc| brave

0 1

man 1 w1 0

0 w2 0

0 0

28
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 0

|Vw| |Vc|
0 0

0 w1 0

queen 1 w2 0

0 1 beautiful

0 1 women

29
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 0

|Vw| |Vc|
0 0

0 w1 0

0 w2 1 queen

beautiful 1 0

0 1 women

30
Word2Vec : Neural Network representation
Input layer Hidden layer Output (sigmoid)

0 0

|Vw| |Vc|
0 0

0 w1 0

0 w2 1 queen

1 1 beautiful

women 0 0

31
Skip-Ngram: Training method
• The prediction problem is modeled using soft-max:

– Predict context words(s) c

– From focus word w
– Looks like logistic regression!
• are features and the evidence is
• The objective function (in log space):

32
Skip-Ngram: Example
• While more text:
– Extract a word window:

– Try setting the vector values such that:

• is high!
– Create a corrupt example by choosing a random word

– Try setting the vector values such that:

• is low!

33
Relations Learned by Word2Vec
• A relation is defined by the vector displacement in the first column. For each start word in the
other column, the closest displaced word is shown.

• “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen,
Greg Corrado, Jeffrey Dean, Arxiv 2013

34
What is language modelling?
• Today’s goal: assign a probability to a sentence
– Machine Translation:
• P(high winds tonight) > P(large winds tonight)
– Spell Correction
• The office is about fifteen minuets from my house!
– P(about fifteen minutes from) > P(about fifteen minuets from)
– Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
– + Summarization, question, answering, etc., etc.!!
– Reminder: The Chain Rule

35
RNN Language Model
P(a), p(aaron), …, p(cats), p(zulu) P(average|cats) P(15|cats,average) P(<EOS>|…)

¿1> ¿¿
^
𝑦 ^𝑦 ¿2 >¿¿ ^𝑦 ¿3 >¿ ¿ ^𝑦 ¿9 >¿ ¿

a<0>= a<1> a<2> a<3> … a<9>

W W W W

x<0>= x<2>=y<1> x<3>=y<2> x<9>=y<8>

cats average day

• Cats average 15 hours of sleep a day. <EOS>

– P(sentence) = P(cats)P(average|cats)P(15|cats,average)
…
36
Embeddings from Language Models
• ELMo architecture trains a
language model using a 2-layer bi-
directional LSTM (biLMs)
• What input?
– Traditional Neural Language Models
use fixed -length word embedding.
• One-hone encoding.
• Word2Vec.
• Glove.
• Etc.…
– ELMo uses a mode complex
representation.

37
ELMo: What input?
• Transformations applied for each token
before being provided to input of first
LSTM layer.
• Pros of character embeddings:
– It allows to pick up on morphological
features that word-level embeddings could
miss.
– It ensures a valid representation even for
out-of-vocabulary words.
– It allows us to pick up on n-gram features
that build more powerful representations.
– The highway network layers allow for
smoother information transfer through the
input.

38
ELMo: Embeddings from Language Models
Intermediate representation
(output vector)

39
ELMo mathematical details

• The function f performs the following operation on word k

of the input:

– Where represents softmax-normalized weights.

 ELMo learns a separate

representation for each
task
 Question answering,
sentiment analysis, etc.

40
41
THANKS!

Any questions?

Think Literacy: Cross-Curricular Approaches, Grades 7-12
No ratings yet
Think Literacy: Cross-Curricular Approaches, Grades 7-12
241 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
39 pages
wordembed
No ratings yet
wordembed
31 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
BA-LLMS-W2-S2-2024-2025
No ratings yet
BA-LLMS-W2-S2-2024-2025
47 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
4. Word Embadding
No ratings yet
4. Word Embadding
24 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
05. Vector Semantics and Embeddings
No ratings yet
05. Vector Semantics and Embeddings
29 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Unit iv
No ratings yet
Unit iv
57 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Word2Vec
No ratings yet
Word2Vec
33 pages
NLP_slides2
No ratings yet
NLP_slides2
93 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
NLP DL Lecture2
No ratings yet
NLP DL Lecture2
54 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
Word Embedding 9 Mar 23 PDF
No ratings yet
Word Embedding 9 Mar 23 PDF
16 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
CCS369 UNIT-2 20.12.24
No ratings yet
CCS369 UNIT-2 20.12.24
41 pages
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
No ratings yet
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
9 pages
Web Minnig
No ratings yet
Web Minnig
30 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Part 3
No ratings yet
Part 3
5 pages
Neural Network
No ratings yet
Neural Network
23 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
12 Subrata DL
No ratings yet
12 Subrata DL
25 pages
CS490 Advanced Topics in Computing - Deep Learning
No ratings yet
CS490 Advanced Topics in Computing - Deep Learning
20 pages
Word Embeddings in NLP - Gunjan Agicha - Medium
No ratings yet
Word Embeddings in NLP - Gunjan Agicha - Medium
5 pages
lecture 10
No ratings yet
lecture 10
86 pages
Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
No ratings yet
Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
11 pages
Word Embeddings With Neural Network
No ratings yet
Word Embeddings With Neural Network
5 pages
Word 2 Vec
No ratings yet
Word 2 Vec
29 pages
IntroductorySheet
No ratings yet
IntroductorySheet
4 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
XCS224N_Module1_Slides
No ratings yet
XCS224N_Module1_Slides
72 pages
BERT
No ratings yet
BERT
98 pages
unit2
No ratings yet
unit2
15 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
36 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
NLP 2
No ratings yet
NLP 2
8 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Comments on Cheong Lee's Essay (2018) "Peirce's Theory of Interpretation"
From Everand
Comments on Cheong Lee's Essay (2018) "Peirce's Theory of Interpretation"
Razie Mah
No ratings yet
English for Academic Correspondence and Socializing
From Everand
English for Academic Correspondence and Socializing
Adrian Wallwork
No ratings yet
Learning Curves
No ratings yet
Learning Curves
18 pages
Lesson Plan in Science 5-Cot 19
100% (19)
Lesson Plan in Science 5-Cot 19
5 pages
Touching Spirit Bear Totem Poles LP - Full Day
No ratings yet
Touching Spirit Bear Totem Poles LP - Full Day
3 pages
Week 001 Lesson 1 The Importance of Understanding Community Dynamics and Community Action 3
No ratings yet
Week 001 Lesson 1 The Importance of Understanding Community Dynamics and Community Action 3
5 pages
McKinsey 組織結構
100% (1)
McKinsey 組織結構
4 pages
English For Life - Class Observation Report
No ratings yet
English For Life - Class Observation Report
4 pages
Thompson and Mishra 2007
No ratings yet
Thompson and Mishra 2007
3 pages
Avinaash The Only Malaysian Accepted Into Harvard and MIT
No ratings yet
Avinaash The Only Malaysian Accepted Into Harvard and MIT
3 pages
Learn - Unlearn - Relearn
No ratings yet
Learn - Unlearn - Relearn
13 pages
LESSON PLAN For FINALDEMO
No ratings yet
LESSON PLAN For FINALDEMO
7 pages
Project-BASA PA MORE
No ratings yet
Project-BASA PA MORE
22 pages
Group 1 Generosity
No ratings yet
Group 1 Generosity
58 pages
Training Fundamentals Pfeiffer Essential Guides to Training Basics 1st Edition Janis Fisher Chan download pdf
100% (14)
Training Fundamentals Pfeiffer Essential Guides to Training Basics 1st Edition Janis Fisher Chan download pdf
49 pages
English Revisions, 6Th Grade Isabelle Valente
No ratings yet
English Revisions, 6Th Grade Isabelle Valente
6 pages
CSE-AI Third Year
No ratings yet
CSE-AI Third Year
3 pages
Maria Full of Grace Essay
100% (2)
Maria Full of Grace Essay
7 pages
bhatia scoring sheet
No ratings yet
bhatia scoring sheet
3 pages
Course Outline Parrallel Distributed Computing (CS-432)
No ratings yet
Course Outline Parrallel Distributed Computing (CS-432)
4 pages
Current Trends IN Curriculum: Assignment
No ratings yet
Current Trends IN Curriculum: Assignment
6 pages
Acgrandes e Module in Writing For G 12 Students
No ratings yet
Acgrandes e Module in Writing For G 12 Students
41 pages
Seminar Evaluation and Feedback Form
100% (3)
Seminar Evaluation and Feedback Form
2 pages
Copia de Learning As A Context For Differences
No ratings yet
Copia de Learning As A Context For Differences
14 pages
DLL Math 8 Q3 W5 D1 2023
No ratings yet
DLL Math 8 Q3 W5 D1 2023
3 pages
Action - Plan Peace Educ Q34 Catch Up Fridays 2
No ratings yet
Action - Plan Peace Educ Q34 Catch Up Fridays 2
7 pages
Eld Lesson Plan Roxaboxen
No ratings yet
Eld Lesson Plan Roxaboxen
5 pages
Panduan
No ratings yet
Panduan
234 pages
Chap 4 Career Decision Making
No ratings yet
Chap 4 Career Decision Making
33 pages
Narrative Report Rating Tool: Rating Scale Quality Rating Numerical Mark
No ratings yet
Narrative Report Rating Tool: Rating Scale Quality Rating Numerical Mark
1 page
Creating and Using Rubrics
No ratings yet
Creating and Using Rubrics
43 pages

ML for NLP-LO4

Uploaded by

ML for NLP-LO4

Uploaded by

Machine Learning for NLP

LO 3 Developed and built fully automated NLP algorithms in Burt and

LO 4 Understand the concepts of NLP, feature engineering, natural

ULM-FiT (2018): Pre-training ideas, transfer learning in

Out of this 15%,

50% true second sentences

We end up with some

• vector[Queen] vector[King] - vector[Man] + vector[Woman]

– Now find the indices of the highest values as before.

king [1,0,0,0,0,0] brave [0,1,0,0,0,0]

• Continuous Bag of Words Skip-Ngram

 Word2Vec is a predictive model.

|Vw| |Vc| brave

|Vw| |Vc| brave

– Predict context words(s) c

– Try setting the vector values such that:

– Try setting the vector values such that:

a<0>= a<1> a<2> a<3> … a<9>

x<0>= x<2>=y<1> x<3>=y<2> x<9>=y<8>

• Cats average 15 hours of sleep a day. <EOS>

• The function f performs the following operation on word k

– Where ​represents softmax-normalized weights.

 ELMo learns a separate

You might also like

– Where represents softmax-normalized weights.