0% found this document useful (0 votes)

7 views76 pages

week2and3

Uploaded by

Neuer Löwe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views76 pages

week2and3

Uploaded by

Neuer Löwe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

CENG7811: Applied Natural Language Processing

Week 2 & 3: Review of NLP Approaches & Text Semantics

Asst. Prof. Cagri Toraman

Computer Engineering Department
[email protected]

17.10.2024
* The Course Slides are subject to CC BY-NC. Either the original work or a derivative work can be shared with appropriate attribution, but only for noncommercial purposes.
Course Project

- Propose or Select a Project Topic (real-world use case)

Due date is 17.10.24 23:59
- Find and Propose Baseline SOTA Papers from Literature
Due date is 17.10.24 23:59

- Review, Present, and Implement Baseline

- Write a Final Report in ACL Proceedings Style
- Present a demo
Project Topics:

- Read project_topics.pdf at ODTUClass

Recent Technologies

Machine Learning
Deep Learning

Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Self-supervised Learning
Supervised Learning

Discover patterns in the data that relate data attributes with a target (class) attribute.

Patterns are utilized to predict the values of the target attribute in future data instances.
Supervised Learning
Supervised Learning

• Classi cation uses an algorithm to accurately assign test data into speci c categories.

• Regression uses an algorithm to understand the relationship between dependent and

independent variables.
fi
fi
Supervised Learning

• Input:
•a document d
• a fixed set of classes C = {c1, c2,…, cJ}
•A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output:
•a learned classifier γ:d ! c
Supervised Learning

• Input:
•document d’s full text?
(paragraphs, sentences, words, subwords, characters)

•document d’s other features?

(text length, author, timestamp, meta-attributes)

•How to represent d in vector space?

Supervised Learning
Supervised Learning

• Sahin, U., Kucukkaya, I. E., Ozcelik, O., & Toraman, C. (2023, September). ARC-NLP at Multimodal Hate Speech Event Detection 2023: Multimodal Methods
Boosted by Ensemble Learning, Syntactical and Entity Features. In Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of
Socio-political Events from Text (pp. 71-78).
Supervised Learning

• Sahin, U., Kucukkaya, I. E., & Toraman, C. (2023). ARC-NLP at PAN 2023: Hierarchical Long Text Classification for Trigger Detection. CLEF Working Notes, 2023.
Supervised Learning

• Any kind of classifier

• Naïve Bayes
• Logistic regression
• Support-vector machines
• k-Nearest Neighbors
•…
Supervised Learning

• LazyPredict Library
https://github.com/shankarpandala/lazypredict
Unsupervised Learning

Data have no target (class) attribute.

We want to explore the data to nd some intrinsic structures.

fi
Unsupervised Learning
Unsupervised Learning

Clustering is a technique for nding similarity groups in data, called clusters.

Association uses di erent rules to nd relationships between variables in a given data set.

Dimensionality reduction is a learning technique when the number of features (or dimensions)
in a given data set is too high.
ff
fi
fi
Unsupervised Learning
Unsupervised Learning

• Some clustering algorithms:

- Partitional clustering (K-means)

- Hierarchical clustering

• A distance (similarity, or dissimilarity) function

• Clustering quality
- Inter-clusters distance ⇒ maximized
- Intra-clusters distance ⇒ minimized

• How to represent d in vector space?

How to nd similarity?
fi
Unsupervised Learning
Unsupervised Learning

• Dimensionality Reduction (linear PCA, non-linear t-SNE and UMAP)

Semi-supervised Learning
Self-supervised Learning

Need
labeled
data!
Self-supervised Learning
Text Semantics

Ambiguity

Every fteen minutes a woman in this country gives birth. Our job is to nd this woman, and stop her!
fi
fi
Text Semantics

Expressivity

She gave the book to Tom

She gave Tom the book
Text Semantics

Sparsity

Power Law (Zipf’s Law)

Text Semantics

How to represent the knowledge a human has/needs?

What is the “meaning” of a word or sentence?

Text Semantics

Concepts or word senses

◦ Have a complex many-to-many association with words (homonymy, multiple senses)

Have relations with each other

◦ Synonymy
◦ Antonymy
◦ Similarity
◦ Relatedness
◦ Connotation
Text Semantics

Lemmas and senses

lemma

mouse (N)
sense
1. any of numerous small rodents...
2. a hand-operated device that controls
a cursor... Modified from the online thesaurus WordNet

A sense or “concept” is the meaning component of a word

Lemmas can be polysemous (have multiple senses)
Text Semantics

Synonyms have the same meaning in some or all contexts.

◦ filbert / hazelnut
◦ couch / sofa
◦ big / large
◦ automobile / car
◦ vomit / throw up
◦ water / H20
Text Semantics

Words with similar meanings. Not synonyms, but

sharing some element of meaning

car, bicycle
cow, horse
Text Semantics

Relatedness
Words can be related in any way, perhaps via a semantic
frame or field

◦ coffee, tea: similar

◦ coffee, cup: related, not similar
Text Semantics

Antonymy: Senses that are opposites with respect to only one feature of meaning
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
Text Semantics

Positive connotations (happy, great, love)

Negative connotations (sad, terrible, hate)

Can be subtle:
Positive connotation: copy, replica, reproduction
Negative connotation: fake, knockoff, forgery
Text Semantics

Can we build a theory of how to represent word meaning that accounts for those semantic concepts?

Vector semantics
Basic model for language processing
Handles many of our goals
Text Semantics

Idea 1: Defining meaning by linguistic distribution

Idea 2: Meaning as a point in multidimensional space
Text Semantics

The meaning of a word is its use in the language.

Words are de ned by their environments (the words around them).

I'm going to the bank to deposit my paycheck.

The bank of the river was lined with trees.

fi
Text Semantics

Word Score Word Score

3 affective dimensions for a word Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
valence: pleasantness
arousal: intensity of emotion Arousal elated 0.960 mellow 0.069
dominance: the degree of control exerted frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045

leadership 0.983 empty 0.081

A word is a vector in 3-space.

Text Semantics

Defining meaning as a point in space based on distribution

Each word = a vector (not just "good" or "w45")
Similar words are "nearby in semantic space"
We build this space automatically by seeing which words are nearby in text
Text Semantics

We define meaning of a word as a vector.

Called an "embedding" because it's embedded into a space.

The standard way to represent meaning in NLP:

Every modern NLP algorithm uses embeddings as the representation of
word meaning.
Text Semantics

Why do we need embeddings compared to word features?

With words, a feature is a word identity

Feature number 729: "terrible"

Requires exact same word to be in training and test

With embeddings:

Feature is a word vector

The previous word was vector [35,22,17]
Now in the test set we might see a similar vector [34,21,14]
We can generalize to similar but unseen words!
Text Semantics

Bag-of-Words (e.g. tf-idf (embedding) vector)

A common baseline model from Information Retrieval
Sparse vectors
Words are represented by (a simple function of) the counts of nearby words

Word embeddings (e.g. Word2vec embedding vector)

Dense vectors
Representation is created by training a classifier to predict whether a word is
likely to appear nearby
Later we'll discuss extensions called contextual embeddings
Text Semantics

Term-document matrix
Each document is represented by a vector of words
Text Semantics
Text Semantics

Vectors are similar for the two comedies.

But comedies are different than the other two:

Comedies have more fools and wit and fewer battles.
Text Semantics

Idea for word meaning: Words can be vectors too

battle is "the kind of word that occurs in Julius Caesar and Henry V"

fool is "the kind of word that occurs in comedies, especially Twelfth Night"
Text Semantics

Two words are similar in meaning if their context vectors are similar.
Text Semantics

Similarity calculation between vectors:

The dot product tends to be high when the two vectors have large values in the same dimensions.

But Dot product favors long vectors: Dot product is higher if a vector is longer (has higher values in
many dimension)
Frequent words (of, the, you) have long vectors (since they occur many times with other words).
So dot product overly favors frequent words
Text Semantics

Similarity calculation between vectors:

-1: vectors point in opposite directions
+1: vectors point in same directions
0: vectors are orthogonal

pie data computer

cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325
Text Semantics

Frequency is useful: If sugar appears a lot near apricot, that's useful information.

But most frequent words like the, it, or they are not very informative.

How can we balance these two conflicting constraints?

Text Semantics

Two common solutions for word weighting

tf-idf: tf-idf value for word t in document d:

Words like "the" or "it" have very low idf

PMI: (Pointwise mutual information) If words like "good" appear more often
( , ) with "great" than we would expect by
PMI( , )= chance
( ) ( )
𝟏
𝟐
𝒑
𝒘
𝒑
𝒘
𝟏
𝟐
𝒘
𝒘
𝒍
𝒐
𝒈
𝟏
𝟐
𝒑
𝒘
𝒘
Text Semantics
Text Semantics

Sparse vs. dense vectors

tf-idf (or PMI) vectors

long (length |V|= 10,000 to 50,000)
sparse (most elements are zero)

Alternative: learn vectors

short (length 50-1000)
dense (most elements are non-zero)
Text Semantics

Why dense vectors?

Short vectors may be easier to use as features in deep learning (fewer weights to tune)
Dense vectors may generalize better than explicit counts
Dense vectors may do better at capturing synonymy:
car and automobile are synonyms; but are distinct dimensions
In practice, they work better
Text Semantics

Common methods for getting short dense vectors

Static embeddings

Word2vec (skipgram, CBOW), GloVe

Singular Value Decomposition (SVD)

A special case of this is called LSA – Latent Semantic Analysis

Dynamic embeddings:

Contextual Embeddings (ELMo, BERT)

Compute distinct embeddings for a word in its context

Separate embeddings for each token of a word

Text Semantics

Static embeddings that you can download (no training needed)

Word2vec (Google, 2013)

https://code.google.com/archive/p/word2vec/

GloVe (Stanford, 2014)

http://nlp.stanford.edu/projects/glove/

fastText (Facebook/Meta, 2015)

https://fasttext.cc
Text Semantics

Word2Vec: skip-gram with negative sampling

Instead of counting how often each word w occurs near “apricot”,

train a classifier on a binary prediction task:
◦ Is w likely to show up near "apricot"?

We don’t actually care about this task

◦ But we'll take the learned classifier weights as the word embeddings
Text Semantics

Word2Vec: skip-gram with negative sampling

Predict if candidate word c is a "neighbor"

1. Treat the target word t and a neighboring context word c as positive examples
2. Randomly sample other words in the lexicon to get negative examples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the learned weights as the embeddings
Text Semantics

Word2Vec: skip-gram with negative sampling

(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4
Goal: train a classifier that is given a candidate (word, context) pair
(apricot, jam)
(apricot, aardvark)

And assigns each pair a probability: Sim(w,c) ≈ w · c

P(+|w, c)
P(−|w, c) = 1 − P(+|w, c) To turn this into a probability, use the sigmoid:
Text Semantics

Word2Vec: skip-gram with negative sampling

This is for one context word, but we have lots of context words.

Assume independence and just multiply them:

Text Semantics

Word2Vec: skip-gram with negative sampling

Skip-gram classifier: summary

A probabilistic classifier, given
a test target word w
and its context window of L words c1:L
Estimates probability that w occurs in this window based on similarity of w (embeddings) to
c1:L (embeddings).

To compute this, we just need embeddings for all the words.

Text Semantics

Word2Vec: skip-gram with negative sampling

Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

63
Text Semantics

Word2vec: How to Learn Vectors

Given the set of positive and negative training instances, and an initial set of
embedding vectors:
The goal of learning is to adjust those word vectors such that:

Maximize the similarity of the target word, context word pairs (w , cpos) drawn
from the positive data

Minimize the similarity of the (w , cneg) pairs drawn from the negative data.
Text Semantics

Word2vec: How to Learn Vectors

Maximize the similarity of the target

with the actual context words, and
minimize the similarity of the target
with the k negative sampled non-
neighbor words.

Use Stochastic Gradient Descent!

Text Semantics

Intuition of one step of gradient descent

Reminder
At each step:
We move in the reverse
direction from the gradient of
the loss function.
We move the value of this
gradient
Text Semantics

The derivatives of the loss function

Text Semantics

Start with randomly initialized C and W matrices, then incrementally do updates

Text Semantics

Learns two sets of embeddings:

Target embeddings matrix W
Context embedding matrix C

Add them together, representing word i as the vector wi + ci

Text Semantics

Implications of Word Embeddings

The classic parallelogram model of analogical reasoning (Rumelhart and Abrahamson, 1973)

To solve: "apple is to tree as grape is to _____"

Text Semantics

Implications of Word Embeddings

king – man + woman is close to queen

Paris – France + Italy is close to Rome

For a problem a:a::b:b, the parallelogram method is:

Text Semantics
Text Semantics

Implications of Word Embeddings

It only seems to work for frequent words, small distances and certain
relations (relating countries to capitals, or parts of speech), but not
others. (Linzen 2016, Gladkova et al. 2016, Ethayarajh et al. 2019a)
Text Semantics

Implications of Word Embeddings

Train embeddings on different decades of historical text to see meanings shift

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. Proceedings of ACL.
Text Semantics

Implications of Word Embeddings

They reflect cultural bias

Ask “Paris : France :: Tokyo : x”

◦ x = Japan
Ask “father : doctor :: mother : x”
◦ x = nurse
Ask “man : computer programmer :: woman : x”
◦ x = homemaker

Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS, pp. 4349-4357. 2016.
Thanks for your participation!

Vector Semantics
No ratings yet
Vector Semantics
83 pages
Lecture 3. 6 - Vector - Apr18 - 2021
No ratings yet
Lecture 3. 6 - Vector - Apr18 - 2021
106 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
lecture 10
No ratings yet
lecture 10
86 pages
Mid-Term Exam - SUNDAY - 72V2 - ONLINE
No ratings yet
Mid-Term Exam - SUNDAY - 72V2 - ONLINE
5 pages
XCS224N_Module1_Slides
No ratings yet
XCS224N_Module1_Slides
72 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
Sample Computer Practical File 12
No ratings yet
Sample Computer Practical File 12
130 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
TextFeatureEnginerring-NLP lec2
No ratings yet
TextFeatureEnginerring-NLP lec2
60 pages
Christopher Manning Lecture 1: Introduction and Word Vectors
No ratings yet
Christopher Manning Lecture 1: Introduction and Word Vectors
42 pages
Vector Semantics Embeddings PPT
No ratings yet
Vector Semantics Embeddings PPT
11 pages
Vector Semantics and Embedding (part 1)
No ratings yet
Vector Semantics and Embedding (part 1)
66 pages
NLP_Module 2
No ratings yet
NLP_Module 2
54 pages
Neural Models For NLP
No ratings yet
Neural Models For NLP
67 pages
wordembed
No ratings yet
wordembed
31 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
4. Word Embeddings 1
No ratings yet
4. Word Embeddings 1
42 pages
lect5
No ratings yet
lect5
40 pages
DSP Tut Solutions
No ratings yet
DSP Tut Solutions
83 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Lecture 3. Vector Semantics
No ratings yet
Lecture 3. Vector Semantics
51 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Lab 5
No ratings yet
Lab 5
27 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Lect04
No ratings yet
Lect04
44 pages
Airline Crew Scheduling: A New Formulation and Decomposition Algorithm
No ratings yet
Airline Crew Scheduling: A New Formulation and Decomposition Algorithm
32 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Query Evaluation
No ratings yet
Query Evaluation
51 pages
Hika ppt
No ratings yet
Hika ppt
38 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Numerical Analysis Lecture by Prof Tony J
No ratings yet
Numerical Analysis Lecture by Prof Tony J
31 pages
05. Vector Semantics and Embeddings
No ratings yet
05. Vector Semantics and Embeddings
29 pages
4.Machine Learning Word Embedding-1
No ratings yet
4.Machine Learning Word Embedding-1
36 pages
feature eng
No ratings yet
feature eng
34 pages
AIML Updated 5th, 6th (25 - 08 - 22)
No ratings yet
AIML Updated 5th, 6th (25 - 08 - 22)
31 pages
Lecture12 - Word RepEmb
No ratings yet
Lecture12 - Word RepEmb
28 pages
NLP 2
No ratings yet
NLP 2
8 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Advanced Econometrics: Masters Class
No ratings yet
Advanced Econometrics: Masters Class
38 pages
Week5
No ratings yet
Week5
26 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
36 pages
Performance Evaluation Metrics and Approaches for Target Tracking- A Survey
No ratings yet
Performance Evaluation Metrics and Approaches for Target Tracking- A Survey
20 pages
Lesson 2: What Is A Time Series Model: Umberto Triacca
No ratings yet
Lesson 2: What Is A Time Series Model: Umberto Triacca
33 pages
Derive Expression of ABCD Parameters in Terms of Z and Y Parameters
67% (3)
Derive Expression of ABCD Parameters in Terms of Z and Y Parameters
4 pages
The Electromagnetic (EM) Field Serves As A Model For Particle Fields
No ratings yet
The Electromagnetic (EM) Field Serves As A Model For Particle Fields
22 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Consecutive Integers
No ratings yet
Consecutive Integers
6 pages
Week11 WordEmbedding
No ratings yet
Week11 WordEmbedding
20 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
Machine Learning
No ratings yet
Machine Learning
18 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
unit2
No ratings yet
unit2
15 pages
The Vector Space Model of Word Meaning: Informatics 1 CG: Lecture 13
No ratings yet
The Vector Space Model of Word Meaning: Informatics 1 CG: Lecture 13
46 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
13
No ratings yet
13
10 pages
Chemical Engineering Mathematics: Engr. I. I. Cheema
No ratings yet
Chemical Engineering Mathematics: Engr. I. I. Cheema
13 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
2.Cause Effect Graph Technique
No ratings yet
2.Cause Effect Graph Technique
4 pages
ML PPT Ca4
No ratings yet
ML PPT Ca4
8 pages
Chapter Transformers
No ratings yet
Chapter Transformers
8 pages
ICDAR2017 Competition On Layout Analysis For Challenging Medieval Manuscripts
No ratings yet
ICDAR2017 Competition On Layout Analysis For Challenging Medieval Manuscripts
10 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
All Quizzes
No ratings yet
All Quizzes
9 pages
SLIQ
No ratings yet
SLIQ
15 pages
Iterative Search Algorithms
No ratings yet
Iterative Search Algorithms
8 pages
Jurnal Litrev 23
No ratings yet
Jurnal Litrev 23
6 pages
Numerical Methods For Engineers - 9780073397924 - Exercise 13 - Quizlet
No ratings yet
Numerical Methods For Engineers - 9780073397924 - Exercise 13 - Quizlet
4 pages
Population at Risk: Risk Analysis For Information and Systems Engineering
No ratings yet
Population at Risk: Risk Analysis For Information and Systems Engineering
9 pages
A00-EE4211-Module-Info-2024
No ratings yet
A00-EE4211-Module-Info-2024
1 page
Robust Control
No ratings yet
Robust Control
7 pages
This Study Resource Was: CS402 Theory of Automata
No ratings yet
This Study Resource Was: CS402 Theory of Automata
4 pages
Digital Image Processing
No ratings yet
Digital Image Processing
2 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Coreference: Fundamentals and Applications
From Everand
Coreference: Fundamentals and Applications
Fouad Sabry
No ratings yet
Comments on Cheong Lee's Essay (2018) "Peirce's Theory of Interpretation": Re-Articulations, #8
From Everand
Comments on Cheong Lee's Essay (2018) "Peirce's Theory of Interpretation": Re-Articulations, #8
Razie Mah
No ratings yet
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet