0% found this document useful (0 votes)
6 views

CSE538 sp25 (4) Lexical and Vector Semantics 2-25 nlp

The document outlines the topics and objectives of a course on Lexical and Vector Semantics in Natural Language Processing, focusing on concepts like lexical ambiguity, word vectors, and topic modeling. It emphasizes the importance of word sense disambiguation and introduces various approaches to operationalize the distributional hypothesis in understanding word meanings. Additionally, it discusses methods such as the Lesk algorithm and contextual embeddings for word sense disambiguation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

CSE538 sp25 (4) Lexical and Vector Semantics 2-25 nlp

The document outlines the topics and objectives of a course on Lexical and Vector Semantics in Natural Language Processing, focusing on concepts like lexical ambiguity, word vectors, and topic modeling. It emphasizes the importance of word sense disambiguation and introduces various approaches to operationalize the distributional hypothesis in understanding word meanings. Additionally, it discusses methods such as the Lesk algorithm and contextual embeddings for word sense disambiguation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

Lexical and Vector

Semantics

CSE538 - Spring 2025


Natural Language Processing
Topics
● Lexical Ambiguity (why word sense disambiguation)
● Word Vectors
● Topic Modeling

Objectives
● Define common semantic tasks in NLP and learn some approaches to solve.
● Understand linguistic information necessary for semantic processing
● Motivate deep learning models necessary to capture language semantics.
● Learn word embeddings (the starting point for modern large language models)
(Jurafsky & Martin, SLP, 2019)
(Jurafsky & Martin, SLP, 2019)
(Schwartz, 2011)
(Jurafsky & Martin, SLP, 2019)
(Jurafsky & Martin, SLP, 2019)
Word Sense Disambiguation

He put the port on the ship.

He walked along the port of the steamer.

He walked along the port next to the steamer.


Word Sense Disambiguation

He put the port on the ship.

He walked along the port of the steamer.

He walked along the port next to the steamer.


Word Sense Disambiguation

He put the port on the ship.

He walked along the port of the steamer.

He walked along the port next to the steamer.


Word Sense Disambiguation
He put the port on the ship. port.n.1 (a place (seaport or airport) where
people and merchandise can enter or leave a
He walked along the port of the steamer.
country)
He walked along the port next to the steamer. port.n.2 port wine (sweet dark-red dessert wine
originally from Portugal)
Word Sense Disambiguation
He put the port on the ship. port.n.1 (a place (seaport or airport) where
people and merchandise can enter or leave a
He walked along the port of the steamer.
country)
He walked along the port next to the steamer. port.n.2 port wine (sweet dark-red dessert wine
originally from Portugal)
port.n.3, embrasure, porthole (an opening (in a
wall or ship or armored vehicle) for firing
through)
Word Sense Disambiguation
He put the port on the ship. port.n.1 (a place (seaport or airport) where
people and merchandise can enter or leave a
He walked along the port of the steamer.
country)
He walked along the port next to the steamer. port.n.2 port wine (sweet dark-red dessert wine
originally from Portugal)
port.n.3, embrasure, porthole (an opening (in a
wall or ship or armored vehicle) for firing
through)
larboard, port.n.4 (the left side of a ship or
aircraft to someone who is aboard and facing
the bow or nose)
Word Sense Disambiguation
He put the port on the ship. port.n.1 (a place (seaport or airport) where
people and merchandise can enter or leave a
He walked along the port of the steamer.
country)
He walked along the port next to the steamer. port.n.2 port wine (sweet dark-red dessert wine
originally from Portugal)
port.n.3, embrasure, porthole (an opening (in a
wall or ship or armored vehicle) for firing
through)
larboard, port.n.4 (the left side of a ship or
aircraft to someone who is aboard and facing
the bow or nose)
interface, port.n.5 ((computer science)
computer circuit consisting of the hardware and
associated circuitry that links one device with
another (especially a computer and a hard disk
drive or other peripherals))
Word Sense Disambiguation
He put the port on the ship. port.n.1 (a place (seaport or airport) where
people and merchandise can enter or leave a
He walked along the port of the steamer.
country)
He walked along the port next to the steamer. port.n.2 port wine (sweet dark-red dessert wine
originally from Portugal)
port.n.3, embrasure, porthole (an opening (in a
As a verb… wall or ship or armored vehicle) for firing
through)
1. port (put or turn on the left side, of a ship) "port the helm"
2. port (bring to port) "the captain ported the ship at night" larboard, port.n.4 (the left side of a ship or
3. port (land at or reach a port) "The ship finally ported" aircraft to someone who is aboard and facing
4. port (turn or go to the port or left side, of a ship) "The big ship was the bow or nose)
slowly porting"
5. port (carry, bear, convey, or bring) "The small canoe could be ported interface, port.n.5 ((computer science)
easily" computer circuit consisting of the hardware and
6. port (carry or hold with both hands diagonally across the body, associated circuitry that links one device with
especially of weapons) "port a rifle" another (especially a computer and a hard disk
7. port (drink port) "We were porting all in the club after dinner" drive or other peripherals))
8. port (modify (software) for use on a different machine or platform)
Objective great.a.1 (relatively large in size or number
or extent; larger than others of its kind)
great.a.2, outstanding (of major significance
or importance)
great.a.3 (remarkable or out of the ordinary
in degree or magnitude or effect)
bang-up, bully, corking, cracking, dandy,
great.a.4, groovy, keen, neat, nifty, not bad,
great peachy, slap-up, swell, smashing, old (very
good)
capital, great.a.5, majuscule (uppercase)
big, enceinte, expectant, gravid, great.a.6,
large, heavy, with child (in an advanced
stage of pregnancy)
Objective great.a.1 (relatively large in size or number
or extent; larger than others of its kind)
great.a.2, outstanding (of major significance
or importance)
great.a.3 (remarkable or out of the ordinary
in degree or magnitude or effect)
bang-up, bully, corking, cracking, dandy,
great.a.4, groovy, keen, neat, nifty, not bad,
great peachy, slap-up, swell, smashing, old (very
good)
capital, great.a.5, majuscule (uppercase)
big, enceinte, expectant, gravid, great.a.6,
large, heavy, with child (in an advanced
stage of pregnancy)
great.n.1 (a person who has achieved
distinction and honor in some field)
port.n.1
port.n.2
Word Sense Disambiguation port.n.3,
port.n.4
A classification problem: port.n.5
General Form:
f (sent_tokens, (target_index, lemma, POS)) -> word_sense

He walked along the port next to the steamer.


Word Sense Disambiguation

A classification problem:
General Form:
f (sent_tokens, (target_index, lemma, POS)) -> word_sense

Logistic Regression (or any discriminative classifier):


Plemma,POS(sense = s | features)

He walked along the port next to the steamer.


Word Sense Disambiguation

A classification problem:
General Form:
f (sent_tokens, (target_index, lemma, POS)) -> word_sense

Logistic Regression (or any discriminative classifier):


Plemma,POS(sense = s | features)

He walked along the port next to the steamer.


(Jurafsky, SLP 3)
Distributional Hypothesis:

Wittgenstein, 1945: “The meaning of a word is its use in the language”


Distributional Hypothesis:

Wittgenstein, 1945: “The meaning of a word is its use in the language”


Distributional hypothesis -- A word’s meaning is defined by all the different
contexts it appears in (i.e. how it is “distributed” in natural language).

Firth, 1957: “You shall know a word by the company it keeps”

The nail hit the beam behind the wall.


Distributional Hypothesis

The nail hit the beam behind the wall.


Distributional Hypothesis

Similarity -

Relatedness -

The nail hit the beam behind the wall.


Distributional Hypothesis

Similarity - Has same or similar meaning.


synonyms (same as), hypernyms (is-a), hyponyms (has-a)

Relatedness - Any relationship:


includes similarity but also antonyms, meronyms (part-of), etc….

The nail hit the beam behind the wall.


Distributional Hypothesis

Similarity - Has same or similar meaning.


synonyms (same as), hypernyms (is-a), hyponyms (has-a)
beam is-a piece of wood
beam is similar to piece of wood

Relatedness - Any relationship:


includes similarity but also antonyms, meronyms (part-of), etc….

The nail hit the beam behind the wall.


Distributional Hypothesis

Similarity - Has same or similar meaning.


synonyms (same as), hypernyms (is-a), hyponyms (has-a)
beam is-a piece of wood
beam is similar to piece of wood

Relatedness - Any relationship:


includes similarity but also antonyms, meronyms (part-of), etc….
beam is-part-of a house
beam is related to a house
beam is similar to a house
The nail hit the beam behind the wall.
Approaches to WSD
I.e. how to operationalize the distributional hypothesis.
1. Bag of words for context
E.g. multi-hot for any word in a defined “context”.
2. Surrounding window with positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings
E.g. real valued vectors that “encode” the context (TBD).
Approaches to WSD
I.e. how to operationalize the distributional hypothesis.
1. Bag of words for context
E.g. multi-hot for any word in a defined “context”.
2. Surrounding window with positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings
E.g. real valued vectors that “encode” the context (TBD).
Lesk Algorithm for WSD
I.e. how to operationalize the distributional hypothesis.

1. Bag of words for context


E.g. multi-hot for any word in a defined “context”.
2. Surrounding window with positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings
E.g. real valued vectors that “encode” the context (TBD).
Lesk Algorithm for WSD
I.e. how to operationalize the distributional hypothesis.

1. Bag of words for context


E.g. multi-hot for any word in a defined “context”.
2. Surrounding window with positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings
E.g. real valued vectors that “encode” the context (TBD).
Lesk Algorithm for WSD
I.e. how to operationalize the distributional hypothesis.

1. Bag of words for context


E.g. multi-hot for any word in a defined “context”.
2. Surrounding window with positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings
E.g. real valued vectors that “encode” the context (TBD).
Lesk Algorithm for WSD
I.e. how to operationalize the distributional hypothesis.
● bank.n.1 (sloping land (especially the slope beside a body of water)) "they
pulled the canoe up on the bank"; "he sat on the bank of the river and
watched the currents"
●1.bank.n.2
Bag of(awords forinstitution
financial context that accepts deposits and channels the
E.g. multi-hot
money for any
into lending word in"he
activities) a defined
cashed“context”.
a check at the bank"; "that bank
holds the mortgage on my home"
2. Surrounding window with positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings
The
E.g.bank can guarantee
real valued vectors thatdeposits
“encode”will cover future
the context (TBD).tuition costs, ...
● Lesk Algorithm
bank.n.1 for WSD
(sloping land (especially the slope beside a body of water)) "they pulled the
canoe up on the bank"; "he sat on the bank of the river and watched the currents"
●I.e. how (atofinancial
bank.n.2 operationalize the
institution that distributional
accepts deposits and hypothesis.
channels the money into
lending activities) "he cashed a check at the bank"; "that bank holds the mortgage on
my home"
●1.... Bag of words for context
● bank.n.4 (an arrangement of similar objects in a row or in tiers) "he operated a bank of
E.g. multi-hot for any word in a defined “context”.
switches"
● ...
2. Surrounding window with positions
● bank.n.8 (a building in which the business of banking transacted) "the bank is on the
E.g. of
corner one-hot
Nassauper
andposition relative to word).
Witherspoon"
● bank.n.9 (a flight maneuver; aircraft tips laterally about its longitudinal axis (especially
3.in Lesk algorithm
turning)) "the plane went into a steep bank"
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings
The
E.g.bank can guarantee
real valued vectors thatdeposits
“encode”will cover future
the context (TBD).tuition costs, ...
Lesk Algorithm for WSD
● striker.n.1 (a forward on a soccer team)
I.e.
● how to operationalize
striker.n.2 theintensive
(someone receiving distributional
training forhypothesis.
a naval technical
rating)
● striker.n.3 (an employee on strike against an employer)
1.● Bag of words
striker.n.4 for context
(someone who hits) "a hard hitter"; "a fine striker of the ball";
E.g. multi-hot are
"blacksmiths for good
any word in a defined “context”.
hitters"
2.● Surrounding
striker.n.5 (thewindow
part of a with
mechanical device that strikes something)
positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings
E.g. real valuedHe addressed
vectors the strikers
that “encode” at the(TBD).
the context rally.
Approaches to WSD
I.e. how to operationalize the distributional hypothesis.
1. Bag of words for context
E.g. multi-hot for any word in a defined “context”.
2. Surrounding window with positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings
E.g. real valued vectors that “encode” the context (TBD).
Approaches to WSD
I.e. how to operationalize the distributional hypothesis.
1. Bag of words for context
E.g. multi-hot for any word in a defined “context”.
2. Surrounding window with positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same
context E.g. counts for any selector.
5. Contextual Embeddings
E.g. real valued vectors that “encode” the context (TBD).
Selectors
… a word which can take the place of another given word within the same local
context (Lin, 1997)

Original version: Local context defined by dependency parse


Selectors
… a word which can take the place of another given word within the same local
context (Lin, 1997)

Original version: Local context defined by dependency parse

object of

He addressed the strikers at the rally.


Selectors
… a word which can take the place of another given word within the same local
context (Lin, 1997)

Original version: Local context defined by dependency parse (Lin, 1997)

Web version: Local context defined by lexical patterns matched on the Web
(Schwartz, 2008).

“He addressed the * at the rally.”


Selectors
Selectors

0
1
0
0
0
1
0
0
0
...
Selectors
Selectors
Leverages hypernymy:
concept1 <is-a> concept2
Why Are Selectors Effective?
Sets of selectors tend to vary extensively by word sense:
Vector Semantics

1. Word2vec

2. Topic Modeling - Latent Dirichlet Allocation (LDA)


Supervised Selectors
Approaches to WSD
I.e. how to operationalize the distributional hypothesis.
1. Bag of words for context
E.g. multi-hot for any word in a defined “context”.
2. Surrounding window with positions
E.g. one-hot per position relative to word).
3. Lesk algorithm
E.g. compare context to sense definitions.
4. Selectors -- other target words that appear with same context
E.g. counts for any selector.
5. Contextual Embeddings - introduced with Transformer LMs
E.g. real valued vectors that “encode” the context (TBD).
Vector Semantics

1. Word2vec

2. Topic Modeling - Latent Dirichlet Allocation (LDA)


Timeline: Language Modeling and Vector Semantics
1913 Markov: Probability that next letter would be vowel or consonant.
1948
Shannon: A Mathematical Theory of Communication (first digital language model)
Jelinek et al. (IBM): Language Models for Speech Recognition
1980
Osgood: The Brown et al.: Class-based ngram models of
Measurement 2003 natural language
of Meaning
Blei et al.: [LDA Topic Modeling]
Deerwater: 2010
Switzer: Vector Mikolov: word2vec
Indexing by Latent
Space Models
Semantic Analysis ELMO 2018
(LSA)
Collobert and
Bengio: GPT
Language Models Weston: A unified RoBERTA
Neural-net
Vector Semantics architecture for
based natural language BERT
LMs + Vectors DpSk-R1
embeddings processing: Deep
~logarithmic scale
neural networks... GPT4o
Objective
To embed: convert a token (or sequence) to a vector that represents meaning.
Objective
To embed: convert a token (or sequence) to a vector that represents meaning, or
is useful to perform downstream NLP application.
Objective

embed
port
Objective

0

embed
port 0
1

0
Objective

Prefer dense vectors


one-hot is sparse vector ● Less parameters (weights) for
machine learning model.
0 ● May generalize better implicitly.
… ● May capture synonyms
embed
port 0
1 For deep learning, in practice, they work
… better. Why? Roughly, less parameters
0 becomes increasingly important when you are
learning multiple layers of weights rather than
just a single layer.
Objective

Prefer dense vectors


one-hot is sparse vector ● Less parameters (weights) for
machine learning model.
0 ● May generalize better implicitly.
… ● May capture synonyms
embed
port 0
1 For deep learning, in practice, they work
… better. Why? Roughly, less parameters
0 becomes increasingly important when you are
learning multiple layers of weights rather than
just a single layer.

(Jurafsky, 2012)
Objective

Prefer dense vectors


one-hot is sparse vector ● Less parameters (weights) for
machine learning model.
0 ● May generalize better implicitly.
10
… ● May capture synonyms
embed
port 0
1 For deep learning, in practice, they work
… 10 better. Why? Roughly,18 less parameters
5 3 0 2 becomes increasingly 10 important when you are
9 learning multiple layers of weights rather than
just a single layer.

0 (Jurafsky, 2012)
0 5 10 15 20
Objective
To embed: convert a token (or sequence) to a vector that represents meaning.
Distributional hypothesis
-- A word’s meaning is defined by all the different contexts it
appears in (i.e. how it is “distributed” in natural language).
To embed: convert a token (or sequence) to a vector that represents meaning.

Wittgenstein, 1945: “The meaning of a word is its use in the language”

Firth, 1957: “You shall know a word by the company it keeps”

The nail hit the beam behind the wall.


Distributional Hypothesis

The nail hit the beam behind the wall.


Word Vectors

Prefer dense vectors


"one-hot encoding" ● Less parameters (weights) for
machine learning model.
0 ● May generalize better implicitly.
… ● May capture synonyms
embed
port 0
1 For deep learning, in practice, they work
… better. Why? Roughly, less parameters
0 becomes increasingly important when you are
learning multiple layers of weights rather than
just a single layer.
Word Vectors

"vector embedding"

0.53
embed 1.5
port 3.21
-2.3
.76
Objective
port.n.1 (a place (seaport or airport) where
people and merchandise can enter or leave a
country)
port.n.2 port wine (sweet dark-red dessert wine
originally from Portugal)
0.53 port.n.3, embrasure, porthole (an opening (in a
embed 1.5 wall or ship or armored vehicle) for firing
port 3.21 through)
-2.3 larboard, port.n.4 (the left side of a ship or
.76 aircraft to someone who is aboard and facing
the bow or nose)
interface, port.n.5 ((computer science)
computer circuit consisting of the hardware and
associated circuitry that links one device with
another (especially a computer and a hard disk
drive or other peripherals))
Objective great.a.1 (relatively large in size or number
or extent; larger than others of its kind)
great.a.2, outstanding (of major significance
or importance)
great.a.3 (remarkable or out of the ordinary
in degree or magnitude or effect)
-0.2 bang-up, bully, corking, cracking, dandy,

great
embed 0.3
-1.1
-2.1
? great.a.4, groovy, keen, neat, nifty, not bad,
peachy, slap-up, swell, smashing, old (very
good)
.26
capital, great.a.5, majuscule (uppercase)
big, enceinte, expectant, gravid, great.a.6,
large, heavy, with child (in an advanced
stage of pregnancy)
great.n.1 (a person who has achieved
distinction and honor in some field)
Word2Vec
Principle: Predict missing word.

Similar to classification where y = context and x = word.

p(context | word)
Word2Vec
Principle: Predict missing word.

Similar to classification where y = context and x = word.

p(context | word)
To learn, maximize
Word2Vec
Principle: Predict missing word.

Similar to classification where y = context and x = word.

p(context | word)

To learn, maximize.
In practice, minimize
J = 1 - p(context | word)
Word2Vec: Context p(context | word)

2 Versions of Context:
1. Continuous bag of words (CBOW): Predict word from context
2. Skip-Grams (SG): predict context words from target
Word2Vec: Context p(context | word)

2 Versions of Context:
1. Continuous bag of words (CBOW): Predict word from context
2. Skip-Grams (SG): predict context words from target

1.Treat the target word and a neighboring context word as positive examples.
2.Randomly sample other words in the lexicon to get negative samples
3.Use logistic regression to train a classifier to distinguish those two cases
4.Use the weights as the embeddings

(Jurafsky, 2017)
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
...

The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1

x = (happy, beam), y = 0
x = (think, beam), y = 0
...
The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

x = (hit, beam), y = 1 k negative samples (y=0) for every positive.


x = (the, beam), y = 1 How?
x = (behind, beam), y = 1

x = (happy, beam), y = 0
x = (think, beam), y = 0
...
The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

x = (hit, beam), y = 1 k negative samples (y=0) for every positive.


x = (the, beam), y = 1 How? Randomly draw from unigram distribution
x = (behind, beam), y = 1

x = (happy, beam), y = 0
x = (think, beam), y = 0
...
The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

x = (hit, beam), y = 1 k negative samples (y=0) for every positive.


x = (the, beam), y = 1 How? Randomly draw from unigram distribution,
x = (behind, beam), y = 1 αdjusted

x = (happy, beam), y = 0
x = (think, beam), y = 0
...
The nail hit the beam behind the wall. where
α = 0.75
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

x = (hit, beam), y = 1
single context:
x = (the, beam), y = 1
x = (behind, beam), y = 1
P(y=1| c, t) =

x = (happy, beam), y = 0
x = (think, beam), y = 0
...
The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

x = (hit, beam), y = 1
single context:
x = (the, beam), y = 1
x = (behind, beam), y = 1
P(y=1| c, t) =
… all contexts
x = (happy, beam), y = 0 P(y=1| c, t) =
x = (think, beam), y = 0
...
The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

single context:
Intuition: tᐧc is a measure
P(y=1| c, t) =
of similarity.
all contexts
P(y=1| c, t) =

The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings

single context:
Intuition: tᐧc is a measure
P(y=1| c, t) =
of similarity:
all contexts
But, it is not a probability! P(y=1| c, t) =
To make it one, apply
logistic activation:
𝜎(z) = 1 / (1 + e-z) The nail hit the beam behind the wall.
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1

x = (happy, beam), y = 0
x = (think, beam), y = 0
...

The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1

x = (happy, beam), y = 0
x = (think, beam), y = 0
...

The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
3a. assume dim * |vocab| weights for each of c and t,

x = (happy, beam), y = 0
x = (think, beam), y = 0
...

The nail hit the beam behind the wall.

c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
3a. assume dim * |vocab| weights for each of c and t,
… initialized to random values (e.g. dim = 50 or dim = 300)
x = (happy, beam), y = 0 3b.
x = (think, beam), y = 0
...
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
3a. assume dim * |vocab| weights for each of c and t,
… initialized to random values (e.g. dim = 50 or dim = 300)
x = (happy, beam), y = 0 3b. optimize loss:
x = (think, beam), y = 0
...
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:

1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
3a. assume dim * |vocab| weights for each of c and t,
… initialized to random values (e.g. dim = 50 or dim = 300)
x = (happy, beam), y = 0 3b. optimize loss:
x = (think, beam), y = 0
...

Maximizes similarity of (c, t) in positive data (y = 1)


Minimizes similarity of (c, t) in negative data (y = 0)
W2V uses the same multi-class loss function as LogReg!

Logistic Regression Likelihood:

Log Likelihood:

Log Loss:

Cross-Entropy Cost: (a “multiclass” log loss)

In vector algebra form: - mean( sum( y*log(y_pred) ) )


Word 2 Vec

(Jurafsky, 2017)
Word2Vec captures analogies (kind of)

(Jurafsky, 2017)
(Jurafsky, 2017)
(Jurafsky, 2017)
Word2Vec: Quantitative Evaluations
1. Compare to manually annotated pairs of words: WordSim-353
(Finkelstein et al., 2002)

2. Compare to words in context (Huang et al., 2012)

3. Answer TOEFL synonym questions.


Word2Vec: Quantitative Evaluations
1. Compare to manually annotated pairs of words: WordSim-353
(Finkelstein et al., 2002)

2. Compare to words in context (Huang et al., 2012)

3. Answer TOEFL synonym questions.


What have we learned since Word2vec? (a lot, but here are 2 important points)

1. Improved loss function: GlOVE embeddings (Pennington et al., 2014)


2. Word2Vec itself performs very similarly to PCA on a co-occurrence matrix
("LSA" Deerwater et al., 1988 – a much much older techniques!).
Topic Modeling

(Doig, 2014)
Topic Modeling
Topic: A group of highly related words and phrases. (aka "semantic field")

example: from WTC responder interviews


(Son et al., 2021)
Topic 1


Topic 2

Topic 50
Topic Modeling
Topic: A group of highly related words and phrases. (aka "semantic field")

Topic 1
Topic
Modeling


Topic 2

Topic 50
Topic Modeling
Topic: A group of highly related words and phrases. (aka "semantic field")

doc 1 word1
word2

extract
doc 2
words or Topic 1
phrases Topic
doc 3
Modeling


Topic 2
doc
38,692
Topic 50
Select Example Topics
Generating Topics from Documents

● Latent Dirichlet Allocation -- a Bayesian probabilistic model where by words


which appear in similar contexts (i.e. in essays that have similar sets of words)
will be clustered into a prespecified number of topics.

● Rule of thumb:
● Each document receives a score per topic -- a probability: p(topic|doc).

Doc 1 Doc 2 Doc 3


topic 1: .05 topic 1: .03 topic 1: .04
topic 2: .02 topic 2: .01 topic 2: .03
topic 3: .01 topic 3: .03 topic 3: .03
… … …
topic 100: .07 topic 100: .05 topic 100: .06
Latent Dirichlet Allocation
(Blei et al., 2003)

● LDA specifies a Bayesian probabilistic model where by


○ documents are viewed as a distribution of topics,
○ topics are a distribution of words. Observed:
W -- observed word in document m
Inferred:
θ -- topic distribution for document m,
Z -- topic for word n in document m
𝛗 --word distribution for topic k
Priors
α -- hyperparameter for Dirichlet prior on the
topics per document.
β -- hyperparameter for Dirichlet prior on the
words per topic.
K – number of topics
Latent Dirichlet Allocation
(Blei et al., 2003)
● LDA specifies a Bayesian probabilistic model where by
documents are viewed as a distribution of topics, and
topics are a distribution of words.
Observed:
● How to estimate (i.e. fit) the model parameters given data
W -- observed word in document m
and priors? Common choices: Inferred:
○ Gibb’s Sampling (best) θ -- topic distribution for document m,
○ variational Bayesian Inference (fastest). Z -- topic for word n in document m
𝛗 --word distribution for topic k
● Key Output: the "posterior" 𝛗 = p(word | topic), the Priors
probability of a word given a topic. α -- hyperparameter for Dirichlet prior on the

From this and p(topic), we can get: p(topic|word)


topics per document.
β -- hyperparameter for Dirichlet prior on the
words per topic.
K – number of topics
Latent Dirichlet Allocation
(Blei et al., 2003)

● LDA specifies a Bayesian probabilistic model where by


documents are viewed as a distribution of topics, and
topics are a distribution of words.
● How to estimate (i.e. fit) the model parameters given data Observed:
and priors? Common choices: W -- observed word in document m
○ Gibb’s Sampling (best) Inferred:

○ variational Bayesian Inference (fastest). θ -- topic distribution for document m,


Z -- topic for word n in document m
● Key Output: the "posterior" 𝛗 = p(word | topic), the 𝛗 --word distribution for topic k
probability of a word given a topic. Priors
α -- hyperparameter for Dirichlet prior on the
From this and p(topic), we can get: p(topic|word) 𝛗 topics per document.
β -- hyperparameter for Dirichlet prior on the
words per topic.
K – number of topics

α
β θ
(Doig, 2014)
Example
Most prevalent words for 4 topics are listed at
the top and words associated with them from
a Yelp review are colored accordingly below.

Ranard, B.L., Werner, R.M., Antanavicius, T., Schwartz, H.A., Smith, R.J.,
Meisel, Z.F., Asch, D.A., Ungar, L.H. & Merchant, R.M. (2016). Yelp Reviews
Of Hospital Care Can Supplement And Inform Traditional Surveys Of The
Patient Experience Of Care. Health Affairs, 35(4), 697-705.
Latent Dirichlet Allocation
(Blei et al., 2003)
● LDA specifies a Bayesian probabilistic model where by
documents are viewed as a distribution of topics, and
topics are a distribution of words.
Observed:
● How to estimate (i.e. fit) the model parameters given data
W -- observed word in document m
and priors? Common choices: Inferred:
○ Gibb’s Sampling (best) θ -- topic distribution for document m,
○ variational Bayesian Inference (fastest). Z -- topic for word n in document m
𝛗 --word distribution for topic k
● Key Output: the "posterior" 𝛗 = p(word | topic), the Priors
probability of a word given a topic. α -- hyperparameter for Dirichlet prior on the

From this and p(topic), we can get: p(topic|word)


topics per document.
β -- hyperparameter for Dirichlet prior on the
words per topic.
K – number of topics
Latent Dirichlet Allocation
(Blei et al., 2003)
● LDA specifies a Bayesian probabilistic model where by
documents are viewed as a distribution of topics, and
topics are a distribution of words.
Observed:
● How to estimate (i.e. fit) the model parameters given data
W -- observed word in document m
and priors? Common choices: Inferred:
○ Gibb’s Sampling (best) θ -- topic distribution for document m,
○ variational Bayesian Inference (fastest). Z -- topic for word n in document m
𝛗 --word distribution for topic k
● Key Output: the "posterior" 𝛗 = p(word | topic), the Priors
probability of a word given a topic. α -- hyperparameter for Dirichlet prior on the

From this and p(topic), we can get: p(topic|word)


topics per document.
β -- hyperparameter for Dirichlet prior on the
To Apply: words per topic.
K – number of topics
Topic Modeling Packages
Most Reliable: Mallet (Java; uses Gibb's Sampling),
pymallet (slower than Mallet but high quality results)

Ease of use: Gensim (python; uses variational inference;


implements word2vec as well)
Topic Modeling
Common applications:

● Open vocabulary content analysis: Describing the latent semantic


categories of words or phrases present across a set of documents

● Embeddings for predictive task: for all topics, use p(topic|document) as


score. Feed to predictive model (e.g. classifier).
Dimensionality reduction
PCA-Based Embeddings -- try to represent with only p’ dimensions
also known as "Latent Semantic Analysis"

Supplement: SVD Implementation details not


within scope but the concept of using PCA on
word co-occurrence matrix was covered.
Dimensionality reduction
PCA-Based Embeddings -- try to represent with only p’ dimensions
also known as "Latent Semantic Analysis"
context words are features
w1, w2, w3, w4, … wp

w1
w2
w3

co-occurrence counts
are cells.

wn

target words are


observations
Dimensionality reduction
PCA-Based Embeddings -- try to represent with only p’ dimensions
p' < p
context words are features
w1, w2, w3, w4, … wp c1, c2, c3, c4, … cp’

w1 w1
w2 w2
w3 w3
… …
co-occurrence counts
are cells.

wn wn

target words are


observations
Concept: Dimensionality Reduction in 3-D, 2-D, and 1-D

P =2
P’ = 1

Data (or, at least, what we want from the data) may be accurately
represented with less dimensions.
Concept: Dimensionality Reduction in 3-D, 2-D, and 1-D

P =2 P =3
P’ = 1 P’ = 2

Data (or, at least, what we want from the data) may be accurately
represented with less dimensions.
Concept: Dimensionality Reduction
Rank: Number of linearly independent columns of A.
(i.e. columns that can’t be derived from the other columns through addition).

Q: How many columns do we really 1 -2 3


need?
2 -3 5

1 1 0
Concept: Dimensionality Reduction
Rank: Number of linearly independent columns of A.
(i.e. columns that can’t be derived from the other columns through addition).

Q: How many columns do we really 1 -2 3


need?
2 -3 5

1 1 0

A: 2. The 1st is just the sum of the second two columns


1 -2

2 -3
… we can represent as linear combination of 2 vectors:
1 1
Dimensionality reduction
SVD-Based Embeddings -- try to represent with only p’ dimensions

context words are features


f1, f2, f3, f4, … fp c1, c2, c3, c4, … cp’

o1 o1
o2 o2
o3 o3
… …
co-occurence counts
are cells.

on on

target words are


observations
Dimensionality Reduction - PCA
Linear approximates of data in r dimensions.

Found via Singular Value Decomposition:


X[nxp] = U[nxr] D[rxr] V[pxr]T

X: original matrix, U: “left singular vectors”,


D: “singular values” (diagonal), V: “right singular vectors”
Dimensionality Reduction - PCA
Linear approximates of data in r dimensions.

Found via Singular Value Decomposition:


X[nxp] = U[nxr] D[rxr] V[pxr]T

X: original matrix, U: “left singular vectors”,


D: “singular values” (diagonal), V: “right singular vectors”
p p

n X ≈ n
Dimensionality Reduction - PCA - Example

X[nxp] = U[nxr] D[rxr] V[pxr]T

Word co-occurrence
counts:
Dimensionality Reduction - PCA - Example

X[nxp] ≅ U[nxr] D[rxr] V[pxr]T

target
co-occ
count with
“nail”
Observation: “beam.”
count(beam, hit) = 100 -- horizontal dimension
count(beam, nail) = 80 -- vertical dimension

target co-occurence count with “hit”


Dimensionality Reduction - PCA
Linear approximates of data in r dimensions.

Found via Singular Value Decomposition:


X[nxp] ≅ U[nxr] D[rxr] V[pxr]T

X: original matrix, U: “left singular vectors”,


D: “singular values” (diagonal), V: “right singular vectors”

Projection (dimensionality reduced space) in 3 dimensions:


(U[nx3] D[3x3] V[px3]T)
Dimensionality Reduction - PCA
Linear approximates of data in r dimensions.

Found via Singular Value Decomposition:


X[nxp] ≅ U[nxr] D[rxr] V[pxr]T

X: original matrix, U: “left singular vectors”,


D: “singular values” (diagonal), V: “right singular vectors”

To check how well the original matrix can be reproduced:


Z[nxp] = U D VT , How does Z compare to original X?
Dimensionality Reduction - PCA
Linear approximates of data in r dimensions.

Found via Singular Value Decomposition:


X[nxp] ≅U[nxr] D[rxr] V[pxr]T

X: original matrix, U: “left singular vectors”,


D: “singular values” (diagonal), V: “right singular vectors”

To check how well the original matrix can be reproduced:


Z[nxp] = U D VT , How does Z compare to original X?
The loss function that
Dimensionality Reduction - PCA SVD solves

Linear approximates of data in r dimensions.

Found via Singular Value Decomposition:


X[nxp] ≅U[nxr] D[rxr] V[pxr]T

X: original matrix, U: “left singular vectors”,


D: “singular values” (diagonal), V: “right singular vectors”

To check how well the original matrix can be reproduced:


Z[nxp] = U D VT , How does Z compare to original X?
Dimensionality Reduction - PCA
Linear approximates of data in r dimensions.

Found via Singular Value Decomposition:


X[nxp] ≅ U[nxr] D[rxr] V[pxr]T

U, D, and V are unique

D: always positive

You might also like