week2and3
week2and3
17.10.2024
* The Course Slides are subject to CC BY-NC. Either the original work or a derivative work can be shared with appropriate attribution, but only for noncommercial purposes.
Course Project
Machine Learning
Deep Learning
Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Self-supervised Learning
Supervised Learning
Discover patterns in the data that relate data attributes with a target (class) attribute.
Patterns are utilized to predict the values of the target attribute in future data instances.
Supervised Learning
Supervised Learning
• Classi cation uses an algorithm to accurately assign test data into speci c categories.
• Input:
•a document d
• a fixed set of classes C = {c1, c2,…, cJ}
•A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output:
•a learned classifier γ:d ! c
Supervised Learning
• Input:
•document d’s full text?
(paragraphs, sentences, words, subwords, characters)
• Sahin, U., Kucukkaya, I. E., Ozcelik, O., & Toraman, C. (2023, September). ARC-NLP at Multimodal Hate Speech Event Detection 2023: Multimodal Methods
Boosted by Ensemble Learning, Syntactical and Entity Features. In Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of
Socio-political Events from Text (pp. 71-78).
Supervised Learning
• Sahin, U., Kucukkaya, I. E., & Toraman, C. (2023). ARC-NLP at PAN 2023: Hierarchical Long Text Classification for Trigger Detection. CLEF Working Notes, 2023.
Supervised Learning
• LazyPredict Library
https://github.com/shankarpandala/lazypredict
Unsupervised Learning
Association uses di erent rules to nd relationships between variables in a given data set.
Dimensionality reduction is a learning technique when the number of features (or dimensions)
in a given data set is too high.
ff
fi
fi
Unsupervised Learning
Unsupervised Learning
• Clustering quality
- Inter-clusters distance ⇒ maximized
- Intra-clusters distance ⇒ minimized
Need
labeled
data!
Self-supervised Learning
Text Semantics
Ambiguity
Every fteen minutes a woman in this country gives birth. Our job is to nd this woman, and stop her!
fi
fi
Text Semantics
Expressivity
Sparsity
mouse (N)
sense
1. any of numerous small rodents...
2. a hand-operated device that controls
a cursor... Modified from the online thesaurus WordNet
car, bicycle
cow, horse
Text Semantics
Relatedness
Words can be related in any way, perhaps via a semantic
frame or field
Antonymy: Senses that are opposites with respect to only one feature of meaning
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
Text Semantics
Can be subtle:
Positive connotation: copy, replica, reproduction
Negative connotation: fake, knockoff, forgery
Text Semantics
Can we build a theory of how to represent word meaning that accounts for those semantic concepts?
Vector semantics
Basic model for language processing
Handles many of our goals
Text Semantics
With embeddings:
Term-document matrix
Each document is represented by a vector of words
Text Semantics
Text Semantics
battle is "the kind of word that occurs in Julius Caesar and Henry V"
fool is "the kind of word that occurs in comedies, especially Twelfth Night"
Text Semantics
Two words are similar in meaning if their context vectors are similar.
Text Semantics
The dot product tends to be high when the two vectors have large values in the same dimensions.
But Dot product favors long vectors: Dot product is higher if a vector is longer (has higher values in
many dimension)
Frequent words (of, the, you) have long vectors (since they occur many times with other words).
So dot product overly favors frequent words
Text Semantics
Frequency is useful: If sugar appears a lot near apricot, that's useful information.
But most frequent words like the, it, or they are not very informative.
PMI: (Pointwise mutual information) If words like "good" appear more often
( , ) with "great" than we would expect by
PMI( , )= chance
( ) ( )
𝟏
𝟐
𝒑
𝒘
𝒑
𝒘
𝟏
𝟐
𝒘
𝒘
𝒍
𝒐
𝒈
𝟏
𝟐
𝒑
𝒘
𝒘
Text Semantics
Text Semantics
Short vectors may be easier to use as features in deep learning (fewer weights to tune)
Dense vectors may generalize better than explicit counts
Dense vectors may do better at capturing synonymy:
car and automobile are synonyms; but are distinct dimensions
In practice, they work better
Text Semantics
Static embeddings
Dynamic embeddings:
This is for one context word, but we have lots of context words.
63
Text Semantics
Given the set of positive and negative training instances, and an initial set of
embedding vectors:
The goal of learning is to adjust those word vectors such that:
Maximize the similarity of the target word, context word pairs (w , cpos) drawn
from the positive data
Minimize the similarity of the (w , cneg) pairs drawn from the negative data.
Text Semantics
Reminder
At each step:
We move in the reverse
direction from the gradient of
the loss function.
We move the value of this
gradient
Text Semantics
The classic parallelogram model of analogical reasoning (Rumelhart and Abrahamson, 1973)
It only seems to work for frequent words, small distances and certain
relations (relating countries to capitals, or parts of speech), but not
others. (Linzen 2016, Gladkova et al. 2016, Ethayarajh et al. 2019a)
Text Semantics
William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. Proceedings of ACL.
Text Semantics
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS, pp. 4349-4357. 2016.
Thanks for your participation!