L4_CSE256_FA24_WE
L4_CSE256_FA24_WE
Ndapa Nakashole
1
Today
2
• We have a cost function . + we want to minimize
Recap: Gradient Descent Iterative Improvement of W
• Gradient Descent is an algorithm to minimize . +
• Idea: for current value of +, calculate gradient of . + , then take small step
of negative gradient. Repeat.
until convergence
33
x1
x2
x3 Layer 5
x4
Layer 2 Layer 4
Layer 1 Layer 3
Convert text to vector, apply FFN, pick label with highest score
p = softmax(s)
s = W 2h
h = g(W 1 x + b)
x (input)
6
Learning rate is a crucial hyperparameter
7
Learning Rate Schedules: change the learning rate over time
Can start with a high learning rate and then decrease it over time to improve convergence
▶ Cosine:
▶ Step: Reduce learning rate αt = 12 α0 (1 + cos(tπ/T ))
at a few fixed points. E.g. ⊙ α0 : Initial learning rate
for ResNets, multiply LR by ⊙ αt : Learning rate at
0.1 after epochs 30,60 , and epoch t
⊙ T : Total number of
90 . epochs 8
Where can I find text classification data?
More project guidance coming closer to the (ungraded) proposal deadline
See: https://huggingface.co/datasets 9
Text Classification: we have looked at
10
Word
Learning Word Representations:
Embeddings
11
1-hot Vectors
12
Word Embeddings
Word Embeddings
Interesting result: Word meaning can be represented by a vector of real numbers
Wantaavector
‣ Want vectorspace
space where
where similar
similar words
words have
have similar
similar vectors
embeddings
the movie was great
Word Embeddings great
~~ good
‣ Want a vector space where similar words have similar embeddings
the movie was good enjoyable
the movie was great
great
‣ Goal: come~~ up with a way to dog
good
produce these
the movie embeddings
was good enjoyable
‣‣ For each
Goal: word,
come want
up with a way to dog
“medium” dimensional
produce these embeddings vector bad
(50-300 dims) represenQng it is
‣ For▶each word,
Goal: want
come up with a way to produce these embeddings
“medium” dimensional vector bad
▶ Fordims)
(50-300 eachrepresenQng
word, wantit"medium" dimensional vectoris (50-300
dims)
13
Distributional Semantics
We will use the idea of distributional semantics to learn word embeddings
kombucha
I had a glass of kombucha at the farmers market
14
Kombucha
15
Word Vectors
Word
▶ vectors
We will build a dense vector for each word, chosen so that it is
similar to vectors of words that appear in similar contexts,
We will build a densesimilarity
measuring vector for as
each
theword, chosen
vector dot so that it isproduct
(scalar) similar to vectors of wor
that appear in similar contexts, measuring similarity as the vector dot (scalar) product
0.286 0.413
0.792 0.582
−0.177 −0.007
banking = −0.107 monetary = 0.247
0.109 0.216
−0.542 −0.718
0.349 0.147
0.271 0.051
Note: word vectors are also called (word) embeddings or (neural) word representations
They are a distributed representation
22
16
▶ Word2Vec (Mikolov et al., 2013): simpler and faster than
previous models
⊙ Two algorithms: Skip-gram and Continuous Bag of Words
(CBOW)
Skip-gram
put: a corpus of raw text. (Same as the input to “real” l
Skip Gram: Learning Word Embedding (Mikolov et al. 2013)
odeling)
▶ Input: a corpus of text (e.g. all of Wikipedia, News articles,
tput: a set ofetc.)
Books, embeddings: a real-valued vector for eac
cabulary
▶ Output: a set of embeddings: a real-valued vector for each
word in the vocabulary
19
Example:
d2Vec Center is “banking”, predict context words
Overview
Context Window Size = 2
20
Skip-gram Loss Function
21
Skip-gram: loss function
22
Training Word Vectors: minimize the loss with gradient
descent
o train the model: Optimize value of parameters
▶ Recall: θ represents all parameters: the word vectors and the
weights of the neural network, in one big vector
o train a model, we gradually adjust parameters to minimize a loss
▶ Each word has two vectors - works better for training
T
1X X
L(θ) = − log P (wt+j | wt ; θ)
T
t=1 m≤j≤m
j̸=0
exp u⊤
∂ 0 vc
log Pv
∂vc w=1 exp (uw vc )
24
Computing the Gradients wrt the center word vector
exp u⊤
∂ 0 vc
log PV
∂vc w=1 exp (uw vc )
V
∂
⊤
∂ X
= log exp uo vc − log exp u⊤
w vc
∂vc ∂vc
w=1
∂ ⊤ ∂ ⊤
u vc = u vc = uo
∂vc o ∂vc o
25
Gradient the gradient wrt the center word vector (2/4)
26
Gradient the gradient wrt the center word vector (3/4)
27
Gradient the gradient wrt the center word vector (4/4)
Putting it all together
V
∂ 1 X
log (p(o | c)) = uo − PV . exp u⊤
x c ux
v
∂vc w=1 exp (u⊤v )
w c x=1
29
Glove Embeddings
30
Key Limitation of Word2Vec: Capturing cooccurrences
inefficiently
31
GloVe: Global Vectors for Word Representation”
(Pennington et al. (2014)
Example Corpus:
32
GloVe: Global Vectors for Word Representation”
(Pennington et al. (2014)
▶ V: vocabulary size
Next, we note that Eqn. (6) would exhibit the ex- 1.0
▶ Xijsymmetry
change : cooccurrence
if not for count on thei
ofi )word
the log(X 0.8
right-hand
and jside. However, this term is indepen- 0.6
34
Evaluation Methods for Word Embeddings
Intrinsic:
▶ Directly evaluate the embeddings
▶ Fast to compute
▶ Not clear if really helpful unless correlation to real task is
established
Extrinsic:
▶ Evaluation on a real task (MT, QA, Parsing, Summarization,
etc.)
⊙ For every embedding method, retrain the model on the task
▶ Can take a long time to compute accuracy
35
ic Intrinsic
word vector evaluation
word vector evaluation
Word Vector Analogies:
rd Vector Analogies
a:b :: c:?
man:woman :: king:?
37
How can I use Word Embeddings for my task? (PA1)
Extrinsic Evaluation
38
Hyperparameters in Word Embeddings
▶ Learning rate
▶ Training epochs
39
Limitation of Word Embeddings: conflated word senses
40
Out-of-vocabulary (OOV) words, or rare words: how do we learn
good embeddings for them?
Tokenization
41
What is tokenization?
42
Tokenizer
43
Tokenization: source of some of the problems in LLMs
44
Problems with Whitespace Tokenization
45
GPT3 Tokenizer
https://platform.openai.com/tokenizer
46
GPT3.5/4 Tokenizer
Groups more white space into a single token, densifies Python code, can attend to more
code in a single pass
47
Big problem with word tokenization
48
Limitations of UNK
We lose information about the word that was replaced by <UNK>
E.g.
50
An alternative: character tokenization
51
Problem with character tokenization
52
Character, Word, and Subword Tokenization
53
2016: subword tokenization
Example from
https://huggingface.co/transformers/tokenizersummary.html 56
Byte Pair encoding
Next, count the frequency of each character pair in the data, and
choose the one that occurs most frequently
57
Byte Pair Encoding
Next, choose the most common pair (ug) and then merge the
characters together into one symbol. Add this new symbol to the
vocabulary. Then, retokenize the data
58
Byte pPir Encoding
59
Byte pair encoding
60
BPE
▶ Initialize vocabulary
with all characters in
the training data
▶ While vocab size <
max vocab size:
⊙ Count frequency
of all character
pairs
⊙ Merge most
frequent pair
⊙ Update vocabulary
61
BPE Tutorial with Implementation
62
BPE impact on Machine Translation of English to German,
OOVs
63
Byte Pair Encoding
▶ GPT-2 uses bytes as the base vocabulary (size 256) and then
applies BPE on top of this sequence (with some rules to
prevent certain types of merges).
⊙ GPT-2 has a vocabulary size of 50,257, which corresponds to
the 256 bytes base tokens, a special end-of-text token and the
symbols learned with 50, 000 merges.
64
Other subword encoding schemes
65
Subword Considerations
66
Main takeaway: Tokenization
67
Questions?
68