0% found this document useful (0 votes)
7 views76 pages

week2and3

Uploaded by

Neuer Löwe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views76 pages

week2and3

Uploaded by

Neuer Löwe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

CENG7811: Applied Natural Language Processing

Week 2 & 3: Review of NLP Approaches & Text Semantics

Asst. Prof. Cagri Toraman


Computer Engineering Department
[email protected]

17.10.2024
* The Course Slides are subject to CC BY-NC. Either the original work or a derivative work can be shared with appropriate attribution, but only for noncommercial purposes.
Course Project

- Propose or Select a Project Topic (real-world use case)


Due date is 17.10.24 23:59
- Find and Propose Baseline SOTA Papers from Literature
Due date is 17.10.24 23:59

- Review, Present, and Implement Baseline


- Write a Final Report in ACL Proceedings Style
- Present a demo
Project Topics:

- Read project_topics.pdf at ODTUClass


Recent Technologies

Machine Learning
Deep Learning

Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Self-supervised Learning
Supervised Learning

Discover patterns in the data that relate data attributes with a target (class) attribute.

Patterns are utilized to predict the values of the target attribute in future data instances.
Supervised Learning
Supervised Learning

• Classi cation uses an algorithm to accurately assign test data into speci c categories.

• Regression uses an algorithm to understand the relationship between dependent and


independent variables.
fi
fi
Supervised Learning

• Input:
•a document d
• a fixed set of classes C = {c1, c2,…, cJ}
•A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output:
•a learned classifier γ:d ! c
Supervised Learning

• Input:
•document d’s full text?
(paragraphs, sentences, words, subwords, characters)

•document d’s other features?


(text length, author, timestamp, meta-attributes)

•How to represent d in vector space?


Supervised Learning
Supervised Learning

• Sahin, U., Kucukkaya, I. E., Ozcelik, O., & Toraman, C. (2023, September). ARC-NLP at Multimodal Hate Speech Event Detection 2023: Multimodal Methods
Boosted by Ensemble Learning, Syntactical and Entity Features. In Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of
Socio-political Events from Text (pp. 71-78).
Supervised Learning

• Sahin, U., Kucukkaya, I. E., & Toraman, C. (2023). ARC-NLP at PAN 2023: Hierarchical Long Text Classification for Trigger Detection. CLEF Working Notes, 2023.
Supervised Learning

• Any kind of classifier


• Naïve Bayes
• Logistic regression
• Support-vector machines
• k-Nearest Neighbors
•…
Supervised Learning

• LazyPredict Library
https://github.com/shankarpandala/lazypredict
Unsupervised Learning

Data have no target (class) attribute.

We want to explore the data to nd some intrinsic structures.


fi
Unsupervised Learning
Unsupervised Learning

Clustering is a technique for nding similarity groups in data, called clusters.

Association uses di erent rules to nd relationships between variables in a given data set.

Dimensionality reduction is a learning technique when the number of features (or dimensions)
in a given data set is too high.
ff
fi
fi
Unsupervised Learning
Unsupervised Learning

• Some clustering algorithms:

- Partitional clustering (K-means)


- Hierarchical clustering

• A distance (similarity, or dissimilarity) function

• Clustering quality
- Inter-clusters distance ⇒ maximized
- Intra-clusters distance ⇒ minimized

• How to represent d in vector space?


How to nd similarity?
fi
Unsupervised Learning
Unsupervised Learning

• Dimensionality Reduction (linear PCA, non-linear t-SNE and UMAP)


Semi-supervised Learning
Self-supervised Learning

Need
labeled
data!
Self-supervised Learning
Text Semantics

Ambiguity

Every fteen minutes a woman in this country gives birth. Our job is to nd this woman, and stop her!
fi
fi
Text Semantics

Expressivity

She gave the book to Tom


She gave Tom the book
Text Semantics

Sparsity

Power Law (Zipf’s Law)


Text Semantics

How to represent the knowledge a human has/needs?

What is the “meaning” of a word or sentence?


Text Semantics

Concepts or word senses


◦ Have a complex many-to-many association with words (homonymy, multiple senses)

Have relations with each other


◦ Synonymy
◦ Antonymy
◦ Similarity
◦ Relatedness
◦ Connotation
Text Semantics

Lemmas and senses


lemma

mouse (N)
sense
1. any of numerous small rodents...
2. a hand-operated device that controls
a cursor... Modified from the online thesaurus WordNet

A sense or “concept” is the meaning component of a word


Lemmas can be polysemous (have multiple senses)
Text Semantics

Synonyms have the same meaning in some or all contexts.


◦ filbert / hazelnut
◦ couch / sofa
◦ big / large
◦ automobile / car
◦ vomit / throw up
◦ water / H20
Text Semantics

Words with similar meanings. Not synonyms, but


sharing some element of meaning

car, bicycle
cow, horse
Text Semantics

Relatedness
Words can be related in any way, perhaps via a semantic
frame or field

◦ coffee, tea: similar


◦ coffee, cup: related, not similar
Text Semantics

Antonymy: Senses that are opposites with respect to only one feature of meaning
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
Text Semantics

Positive connotations (happy, great, love)


Negative connotations (sad, terrible, hate)

Can be subtle:
Positive connotation: copy, replica, reproduction
Negative connotation: fake, knockoff, forgery
Text Semantics

Can we build a theory of how to represent word meaning that accounts for those semantic concepts?

Vector semantics
Basic model for language processing
Handles many of our goals
Text Semantics

Idea 1: Defining meaning by linguistic distribution


Idea 2: Meaning as a point in multidimensional space
Text Semantics

The meaning of a word is its use in the language.

Words are de ned by their environments (the words around them).

I'm going to the bank to deposit my paycheck.

The bank of the river was lined with trees.


fi
Text Semantics

Word Score Word Score


3 affective dimensions for a word Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
valence: pleasantness
arousal: intensity of emotion Arousal elated 0.960 mellow 0.069
dominance: the degree of control exerted frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045

leadership 0.983 empty 0.081

A word is a vector in 3-space.


Text Semantics

Defining meaning as a point in space based on distribution


Each word = a vector (not just "good" or "w45")
Similar words are "nearby in semantic space"
We build this space automatically by seeing which words are nearby in text
Text Semantics

We define meaning of a word as a vector.

Called an "embedding" because it's embedded into a space.

The standard way to represent meaning in NLP:


Every modern NLP algorithm uses embeddings as the representation of
word meaning.
Text Semantics

Why do we need embeddings compared to word features?


With words, a feature is a word identity

Feature number 729: "terrible"


Requires exact same word to be in training and test

With embeddings:

Feature is a word vector


The previous word was vector [35,22,17]
Now in the test set we might see a similar vector [34,21,14]
We can generalize to similar but unseen words!
Text Semantics

Bag-of-Words (e.g. tf-idf (embedding) vector)


A common baseline model from Information Retrieval
Sparse vectors
Words are represented by (a simple function of) the counts of nearby words

Word embeddings (e.g. Word2vec embedding vector)


Dense vectors
Representation is created by training a classifier to predict whether a word is
likely to appear nearby
Later we'll discuss extensions called contextual embeddings
Text Semantics

Term-document matrix
Each document is represented by a vector of words
Text Semantics
Text Semantics

Vectors are similar for the two comedies.

But comedies are different than the other two:


Comedies have more fools and wit and fewer battles.
Text Semantics

Idea for word meaning: Words can be vectors too

battle is "the kind of word that occurs in Julius Caesar and Henry V"

fool is "the kind of word that occurs in comedies, especially Twelfth Night"
Text Semantics

Two words are similar in meaning if their context vectors are similar.
Text Semantics

Similarity calculation between vectors:

The dot product tends to be high when the two vectors have large values in the same dimensions.

But Dot product favors long vectors: Dot product is higher if a vector is longer (has higher values in
many dimension)
Frequent words (of, the, you) have long vectors (since they occur many times with other words).
So dot product overly favors frequent words
Text Semantics

Similarity calculation between vectors:


-1: vectors point in opposite directions
+1: vectors point in same directions
0: vectors are orthogonal

pie data computer


cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325
Text Semantics

Frequency is useful: If sugar appears a lot near apricot, that's useful information.

But most frequent words like the, it, or they are not very informative.

How can we balance these two conflicting constraints?


Text Semantics

Two common solutions for word weighting

tf-idf: tf-idf value for word t in document d:

Words like "the" or "it" have very low idf

PMI: (Pointwise mutual information) If words like "good" appear more often
( , ) with "great" than we would expect by
PMI( , )= chance
( ) ( )
𝟏
𝟐
𝒑
𝒘
𝒑
𝒘
𝟏
𝟐
𝒘
𝒘
𝒍
𝒐
𝒈
𝟏
𝟐
𝒑
𝒘
𝒘
Text Semantics
Text Semantics

Sparse vs. dense vectors

tf-idf (or PMI) vectors


long (length |V|= 10,000 to 50,000)
sparse (most elements are zero)

Alternative: learn vectors


short (length 50-1000)
dense (most elements are non-zero)
Text Semantics

Why dense vectors?

Short vectors may be easier to use as features in deep learning (fewer weights to tune)
Dense vectors may generalize better than explicit counts
Dense vectors may do better at capturing synonymy:
car and automobile are synonyms; but are distinct dimensions
In practice, they work better
Text Semantics

Common methods for getting short dense vectors

Static embeddings

Word2vec (skipgram, CBOW), GloVe

Singular Value Decomposition (SVD)

A special case of this is called LSA – Latent Semantic Analysis

Dynamic embeddings:

Contextual Embeddings (ELMo, BERT)

Compute distinct embeddings for a word in its context

Separate embeddings for each token of a word


Text Semantics

Static embeddings that you can download (no training needed)

Word2vec (Google, 2013)


https://code.google.com/archive/p/word2vec/

GloVe (Stanford, 2014)


http://nlp.stanford.edu/projects/glove/

fastText (Facebook/Meta, 2015)


https://fasttext.cc
Text Semantics

Word2Vec: skip-gram with negative sampling

Instead of counting how often each word w occurs near “apricot”,


train a classifier on a binary prediction task:
◦ Is w likely to show up near "apricot"?

We don’t actually care about this task


◦ But we'll take the learned classifier weights as the word embeddings
Text Semantics

Word2Vec: skip-gram with negative sampling

Predict if candidate word c is a "neighbor"


1. Treat the target word t and a neighboring context word c as positive examples
2. Randomly sample other words in the lexicon to get negative examples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the learned weights as the embeddings
Text Semantics

Word2Vec: skip-gram with negative sampling

(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4
Goal: train a classifier that is given a candidate (word, context) pair
(apricot, jam)
(apricot, aardvark)

And assigns each pair a probability: Sim(w,c) ≈ w · c


P(+|w, c)
P(−|w, c) = 1 − P(+|w, c) To turn this into a probability, use the sigmoid:
Text Semantics

Word2Vec: skip-gram with negative sampling

This is for one context word, but we have lots of context words.

Assume independence and just multiply them:


Text Semantics

Word2Vec: skip-gram with negative sampling

Skip-gram classifier: summary


A probabilistic classifier, given
a test target word w
and its context window of L words c1:L
Estimates probability that w occurs in this window based on similarity of w (embeddings) to
c1:L (embeddings).

To compute this, we just need embeddings for all the words.


Text Semantics

Word2Vec: skip-gram with negative sampling

Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

63
Text Semantics

Word2vec: How to Learn Vectors

Given the set of positive and negative training instances, and an initial set of
embedding vectors:
The goal of learning is to adjust those word vectors such that:

Maximize the similarity of the target word, context word pairs (w , cpos) drawn
from the positive data

Minimize the similarity of the (w , cneg) pairs drawn from the negative data.
Text Semantics

Word2vec: How to Learn Vectors

Maximize the similarity of the target


with the actual context words, and
minimize the similarity of the target
with the k negative sampled non-
neighbor words.

Use Stochastic Gradient Descent!


Text Semantics

Intuition of one step of gradient descent

Reminder
At each step:
We move in the reverse
direction from the gradient of
the loss function.
We move the value of this
gradient
Text Semantics

The derivatives of the loss function


Text Semantics

Start with randomly initialized C and W matrices, then incrementally do updates


Text Semantics

Learns two sets of embeddings:


Target embeddings matrix W
Context embedding matrix C

Add them together, representing word i as the vector wi + ci


Text Semantics

Implications of Word Embeddings

The classic parallelogram model of analogical reasoning (Rumelhart and Abrahamson, 1973)

To solve: "apple is to tree as grape is to _____"


Text Semantics

Implications of Word Embeddings

king – man + woman is close to queen


Paris – France + Italy is close to Rome

For a problem a:a*::b:b*, the parallelogram method is:


Text Semantics
Text Semantics

Implications of Word Embeddings

It only seems to work for frequent words, small distances and certain
relations (relating countries to capitals, or parts of speech), but not
others. (Linzen 2016, Gladkova et al. 2016, Ethayarajh et al. 2019a)
Text Semantics

Implications of Word Embeddings

Train embeddings on different decades of historical text to see meanings shift

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. Proceedings of ACL.
Text Semantics

Implications of Word Embeddings

They reflect cultural bias

Ask “Paris : France :: Tokyo : x”


◦ x = Japan
Ask “father : doctor :: mother : x”
◦ x = nurse
Ask “man : computer programmer :: woman : x”
◦ x = homemaker

Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS, pp. 4349-4357. 2016.
Thanks for your participation!

You might also like