CSE538 sp25 (4) Lexical and Vector Semantics 2-25 nlp
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 nlp
Semantics
Objectives
● Define common semantic tasks in NLP and learn some approaches to solve.
● Understand linguistic information necessary for semantic processing
● Motivate deep learning models necessary to capture language semantics.
● Learn word embeddings (the starting point for modern large language models)
(Jurafsky & Martin, SLP, 2019)
(Jurafsky & Martin, SLP, 2019)
(Schwartz, 2011)
(Jurafsky & Martin, SLP, 2019)
(Jurafsky & Martin, SLP, 2019)
Word Sense Disambiguation
A classification problem:
General Form:
f (sent_tokens, (target_index, lemma, POS)) -> word_sense
A classification problem:
General Form:
f (sent_tokens, (target_index, lemma, POS)) -> word_sense
Similarity -
Relatedness -
object of
Web version: Local context defined by lexical patterns matched on the Web
(Schwartz, 2008).
0
1
0
0
0
1
0
0
0
...
Selectors
Selectors
Leverages hypernymy:
concept1 <is-a> concept2
Why Are Selectors Effective?
Sets of selectors tend to vary extensively by word sense:
Vector Semantics
1. Word2vec
1. Word2vec
embed
port
Objective
0
…
embed
port 0
1
…
0
Objective
(Jurafsky, 2012)
Objective
0 (Jurafsky, 2012)
0 5 10 15 20
Objective
To embed: convert a token (or sequence) to a vector that represents meaning.
Distributional hypothesis
-- A word’s meaning is defined by all the different contexts it
appears in (i.e. how it is “distributed” in natural language).
To embed: convert a token (or sequence) to a vector that represents meaning.
"vector embedding"
0.53
embed 1.5
port 3.21
-2.3
.76
Objective
port.n.1 (a place (seaport or airport) where
people and merchandise can enter or leave a
country)
port.n.2 port wine (sweet dark-red dessert wine
originally from Portugal)
0.53 port.n.3, embrasure, porthole (an opening (in a
embed 1.5 wall or ship or armored vehicle) for firing
port 3.21 through)
-2.3 larboard, port.n.4 (the left side of a ship or
.76 aircraft to someone who is aboard and facing
the bow or nose)
interface, port.n.5 ((computer science)
computer circuit consisting of the hardware and
associated circuitry that links one device with
another (especially a computer and a hard disk
drive or other peripherals))
Objective great.a.1 (relatively large in size or number
or extent; larger than others of its kind)
great.a.2, outstanding (of major significance
or importance)
great.a.3 (remarkable or out of the ordinary
in degree or magnitude or effect)
-0.2 bang-up, bully, corking, cracking, dandy,
great
embed 0.3
-1.1
-2.1
? great.a.4, groovy, keen, neat, nifty, not bad,
peachy, slap-up, swell, smashing, old (very
good)
.26
capital, great.a.5, majuscule (uppercase)
big, enceinte, expectant, gravid, great.a.6,
large, heavy, with child (in an advanced
stage of pregnancy)
great.n.1 (a person who has achieved
distinction and honor in some field)
Word2Vec
Principle: Predict missing word.
p(context | word)
Word2Vec
Principle: Predict missing word.
p(context | word)
To learn, maximize
Word2Vec
Principle: Predict missing word.
p(context | word)
To learn, maximize.
In practice, minimize
J = 1 - p(context | word)
Word2Vec: Context p(context | word)
2 Versions of Context:
1. Continuous bag of words (CBOW): Predict word from context
2. Skip-Grams (SG): predict context words from target
Word2Vec: Context p(context | word)
2 Versions of Context:
1. Continuous bag of words (CBOW): Predict word from context
2. Skip-Grams (SG): predict context words from target
1.Treat the target word and a neighboring context word as positive examples.
2.Randomly sample other words in the lexicon to get negative samples
3.Use logistic regression to train a classifier to distinguish those two cases
4.Use the weights as the embeddings
(Jurafsky, 2017)
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
...
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
…
x = (happy, beam), y = 0
x = (think, beam), y = 0
...
The nail hit the beam behind the wall.
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
x = (hit, beam), y = 1
single context:
x = (the, beam), y = 1
x = (behind, beam), y = 1
P(y=1| c, t) =
…
x = (happy, beam), y = 0
x = (think, beam), y = 0
...
The nail hit the beam behind the wall.
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
x = (hit, beam), y = 1
single context:
x = (the, beam), y = 1
x = (behind, beam), y = 1
P(y=1| c, t) =
… all contexts
x = (happy, beam), y = 0 P(y=1| c, t) =
x = (think, beam), y = 0
...
The nail hit the beam behind the wall.
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
single context:
Intuition: tᐧc is a measure
P(y=1| c, t) =
of similarity.
all contexts
P(y=1| c, t) =
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings
single context:
Intuition: tᐧc is a measure
P(y=1| c, t) =
of similarity:
all contexts
But, it is not a probability! P(y=1| c, t) =
To make it one, apply
logistic activation:
𝜎(z) = 1 / (1 + e-z) The nail hit the beam behind the wall.
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
…
x = (happy, beam), y = 0
x = (think, beam), y = 0
...
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
…
x = (happy, beam), y = 0
x = (think, beam), y = 0
...
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
3a. assume dim * |vocab| weights for each of c and t,
…
x = (happy, beam), y = 0
x = (think, beam), y = 0
...
c1 c2 c3 c4
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
3a. assume dim * |vocab| weights for each of c and t,
… initialized to random values (e.g. dim = 50 or dim = 300)
x = (happy, beam), y = 0 3b.
x = (think, beam), y = 0
...
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
3a. assume dim * |vocab| weights for each of c and t,
… initialized to random values (e.g. dim = 50 or dim = 300)
x = (happy, beam), y = 0 3b. optimize loss:
x = (think, beam), y = 0
...
Skip-Grams (SG): predict context words from target
p(context | word)
Steps:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the weights as the embeddings all contexts
P(y=1| c, t) =
x = (hit, beam), y = 1
x = (the, beam), y = 1
x = (behind, beam), y = 1
3a. assume dim * |vocab| weights for each of c and t,
… initialized to random values (e.g. dim = 50 or dim = 300)
x = (happy, beam), y = 0 3b. optimize loss:
x = (think, beam), y = 0
...
Log Likelihood:
Log Loss:
(Jurafsky, 2017)
Word2Vec captures analogies (kind of)
(Jurafsky, 2017)
(Jurafsky, 2017)
(Jurafsky, 2017)
Word2Vec: Quantitative Evaluations
1. Compare to manually annotated pairs of words: WordSim-353
(Finkelstein et al., 2002)
(Doig, 2014)
Topic Modeling
Topic: A group of highly related words and phrases. (aka "semantic field")
…
Topic 2
Topic 50
Topic Modeling
Topic: A group of highly related words and phrases. (aka "semantic field")
Topic 1
Topic
Modeling
…
Topic 2
Topic 50
Topic Modeling
Topic: A group of highly related words and phrases. (aka "semantic field")
doc 1 word1
word2
extract
doc 2
words or Topic 1
phrases Topic
doc 3
Modeling
…
…
Topic 2
doc
38,692
Topic 50
Select Example Topics
Generating Topics from Documents
● Rule of thumb:
● Each document receives a score per topic -- a probability: p(topic|doc).
α
β θ
(Doig, 2014)
Example
Most prevalent words for 4 topics are listed at
the top and words associated with them from
a Yelp review are colored accordingly below.
Ranard, B.L., Werner, R.M., Antanavicius, T., Schwartz, H.A., Smith, R.J.,
Meisel, Z.F., Asch, D.A., Ungar, L.H. & Merchant, R.M. (2016). Yelp Reviews
Of Hospital Care Can Supplement And Inform Traditional Surveys Of The
Patient Experience Of Care. Health Affairs, 35(4), 697-705.
Latent Dirichlet Allocation
(Blei et al., 2003)
● LDA specifies a Bayesian probabilistic model where by
documents are viewed as a distribution of topics, and
topics are a distribution of words.
Observed:
● How to estimate (i.e. fit) the model parameters given data
W -- observed word in document m
and priors? Common choices: Inferred:
○ Gibb’s Sampling (best) θ -- topic distribution for document m,
○ variational Bayesian Inference (fastest). Z -- topic for word n in document m
𝛗 --word distribution for topic k
● Key Output: the "posterior" 𝛗 = p(word | topic), the Priors
probability of a word given a topic. α -- hyperparameter for Dirichlet prior on the
w1
w2
w3
…
co-occurrence counts
are cells.
wn
w1 w1
w2 w2
w3 w3
… …
co-occurrence counts
are cells.
wn wn
P =2
P’ = 1
Data (or, at least, what we want from the data) may be accurately
represented with less dimensions.
Concept: Dimensionality Reduction in 3-D, 2-D, and 1-D
P =2 P =3
P’ = 1 P’ = 2
Data (or, at least, what we want from the data) may be accurately
represented with less dimensions.
Concept: Dimensionality Reduction
Rank: Number of linearly independent columns of A.
(i.e. columns that can’t be derived from the other columns through addition).
1 1 0
Concept: Dimensionality Reduction
Rank: Number of linearly independent columns of A.
(i.e. columns that can’t be derived from the other columns through addition).
1 1 0
2 -3
… we can represent as linear combination of 2 vectors:
1 1
Dimensionality reduction
SVD-Based Embeddings -- try to represent with only p’ dimensions
o1 o1
o2 o2
o3 o3
… …
co-occurence counts
are cells.
on on
n X ≈ n
Dimensionality Reduction - PCA - Example
Word co-occurrence
counts:
Dimensionality Reduction - PCA - Example
target
co-occ
count with
“nail”
Observation: “beam.”
count(beam, hit) = 100 -- horizontal dimension
count(beam, nail) = 80 -- vertical dimension
D: always positive