0% found this document useful (0 votes)
15 views68 pages

L4_CSE256_FA24_WE

Uploaded by

ALFA MOVIES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views68 pages

L4_CSE256_FA24_WE

Uploaded by

ALFA MOVIES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

CSE 256 | Lecture 4:

Word Embeddings & Tokenization

Ndapa Nakashole

October 10, 2024

1
Today

1 Finish FeedForward Neural Networks and Text Classification


2 Word Embeddings
3 Tokenization

2
• We have a cost function . + we want to minimize
Recap: Gradient Descent Iterative Improvement of W
• Gradient Descent is an algorithm to minimize . +
• Idea: for current value of +, calculate gradient of . + , then take small step
of negative gradient. Repeat.

▶ Idea: for current value of Not


obje
W, calculate gradient of may
be c
L(W), then take small step like

in the direction of the But


out
negative gradient. Repeat oka

until convergence
33

Wt+1 = Wt − α∇L (Wt ) Gradient Descent Update Rule

▶ Three ways to compute the gradient


1. Numerically (finite differences)
2. Analytically (write down the gradient by hand)
3. Backpropagation (algorithm for computing the gradient)
3
Recap: A feedforward network = running several logistic
regressions at the same time

x1

x2

x3 Layer 5

x4
Layer 2 Layer 4
Layer 1 Layer 3

Allows us to re-represent and compose our data multiple times


and to learn a classifier that is highly non-linear in terms of
the original inputs (but, typically, is linear in terms of the pre-final layer representations)
4
Recap: Deep Averaging Networks (PA1)

▶ Deep Averaging Networks: feedforward neural network on


average of word embeddings from input text

Iyyer et al. (2015) 5


Text Classification with a FFN: Summary

Convert text to vector, apply FFN, pick label with highest score
p = softmax(s)

s = W 2h

h = g(W 1 x + b)

x (input)

6
Learning rate is a crucial hyperparameter

▶ Learning rate: Step size in gradient descent. Too small: slow


convergence. Too large: overshoots the minimum

7
Learning Rate Schedules: change the learning rate over time
Can start with a high learning rate and then decrease it over time to improve convergence

▶ Cosine:
▶ Step: Reduce learning rate αt = 12 α0 (1 + cos(tπ/T ))
at a few fixed points. E.g. ⊙ α0 : Initial learning rate
for ResNets, multiply LR by ⊙ αt : Learning rate at
0.1 after epochs 30,60 , and epoch t
⊙ T : Total number of
90 . epochs 8
Where can I find text classification data?
More project guidance coming closer to the (ungraded) proposal deadline

See: https://huggingface.co/datasets 9
Text Classification: we have looked at

1 The Text Classification Problem


2 Example Applications
3 Models: Linear Classifiers, Feedforward Neural Networks

10
Word
Learning Word Representations:

Embeddings

11
1-hot Vectors

▶ One-hot vectors map


objects/ words into
fixed-length vectors

▶ These vectors only contain


the identity information of
the object

▶ They do not contain any


semantic information about
the words,
⟨x, y⟩ = ⟨z, y⟩ = 0
▶ Bag of Words: is a summation of 1-hot vectors, hence also
lacks semantic information

12
Word Embeddings
Word Embeddings
Interesting result: Word meaning can be represented by a vector of real numbers

Wantaavector
‣ Want vectorspace
space where
where similar
similar words
words have
have similar
similar vectors
embeddings
the movie was great
Word Embeddings great
~~ good
‣ Want a vector space where similar words have similar embeddings
the movie was good enjoyable
the movie was great
great
‣ Goal: come~~ up with a way to dog
good
produce these
the movie embeddings
was good enjoyable

‣‣ For each
Goal: word,
come want
up with a way to dog
“medium” dimensional
produce these embeddings vector bad
(50-300 dims) represenQng it is
‣ For▶each word,
Goal: want
come up with a way to produce these embeddings
“medium” dimensional vector bad
▶ Fordims)
(50-300 eachrepresenQng
word, wantit"medium" dimensional vectoris (50-300
dims)
13
Distributional Semantics
We will use the idea of distributional semantics to learn word embeddings

kombucha
I had a glass of kombucha at the farmers market

Children or pregnant women should not drink kombucha

▶ Distributional semantics: words that appear in similar


contexts have similar meanings
⊙ Can substitute “kombucha” with “beer” or “wine” in the above
sentences
▶ Idea: just find at a lot of usage of a word, and build up its
meaning from there!
▶ “You shall know a word by the company it keeps” Firth (1957)

14
Kombucha

15
Word Vectors

Word
▶ vectors
We will build a dense vector for each word, chosen so that it is
similar to vectors of words that appear in similar contexts,
We will build a densesimilarity
measuring vector for as
each
theword, chosen
vector dot so that it isproduct
(scalar) similar to vectors of wor
that appear in similar contexts, measuring similarity as the vector dot (scalar) product

0.286 0.413
0.792 0.582
−0.177 −0.007
banking = −0.107 monetary = 0.247
0.109 0.216
−0.542 −0.718
0.349 0.147
0.271 0.051

Note: word vectors are also called (word) embeddings or (neural) word representations
They are a distributed representation
22
16
▶ Word2Vec (Mikolov et al., 2013): simpler and faster than
previous models
⊙ Two algorithms: Skip-gram and Continuous Bag of Words
(CBOW)

⊙ They are similar, we will focus on Skip-gram

Skip-gram
put: a corpus of raw text. (Same as the input to “real” l
Skip Gram: Learning Word Embedding (Mikolov et al. 2013)
odeling)
▶ Input: a corpus of text (e.g. all of Wikipedia, News articles,
tput: a set ofetc.)
Books, embeddings: a real-valued vector for eac
cabulary
▶ Output: a set of embeddings: a real-valued vector for each
word in the vocabulary

e are going togoing


▶ We are learn these
to learn theseby sehng
by setting up aup
fakeaprediction
fake predicXo
edict a word’s
problem: context from
predict a word’s thatfrom
context word
that word

the dog bit the man


(word = bit, context = dog)
(word = bit, context = the) M
18
Example: Center is “into”, predict context words
Context Window Size = 2
rd2Vec Overview
Example windows and process for computing ! "!+# | "!
P (wt+j | wt)
! "!%$ | "! ! "!"$ | "!
! "!%# | "! ! "!"# | "!

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

19
Example:
d2Vec Center is “banking”, predict context words
Overview
Context Window Size = 2

Example windows and process for computing 𝑃 𝑤𝑡+𝑗 | 𝑤𝑡


P (wt+j | wt)
𝑃 𝑤𝑡−2 | 𝑤𝑡 𝑃 𝑤𝑡+2 | 𝑤𝑡
𝑃 𝑤𝑡−1 | 𝑤𝑡 𝑃 𝑤𝑡+1 | 𝑤𝑡

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

▶ We do this for every word in the corpus, and learn the


embeddings in the process

20
Skip-gram Loss Function

For each position t = 1, . . . , T , predict context words within a


window of fixed size m, given center word wj .
T
Y Y
Data Likelihood = P (wt+j | wt ; θ)
t=1 m≤j≤m
j̸=0

where θ are all the parameters of the model

We can write the loss as:


T
1X X
L(θ) = − log P (wt+j | wt ; θ)
T
t=1 m≤j≤m
j̸=0

21
Skip-gram: loss function

We want to minimize the loss function:


T
1X X
L(θ) = − log P (wt+j | wt ; θ)
T
t=1 m≤j≤m
j̸=0

Question: How to calculate


P (wt+j | wt ; θ) ? We will use two vectors

Answer: Use the softmax function! per word w:


▶ vw when w is a center
For a center word c and a context word
word
o:
▶ uw when w is a
exp uTo vc

P (o | c) = P context word
T
w∈V exp (uw vc )

22
Training Word Vectors: minimize the loss with gradient
descent
o train the model: Optimize value of parameters
▶ Recall: θ represents all parameters: the word vectors and the
weights of the neural network, in one big vector
o train a model, we gradually adjust parameters to minimize a loss
▶ Each word has two vectors - works better for training

Recall: + represents all the


model parameters, in one
long vector
In our case, with
d-dimensional vectors and
V-many words, we have:
Remember: every word has
two vectors
23
Computing the Gradients

T
1X X
L(θ) = − log P (wt+j | wt ; θ)
T
t=1 m≤j≤m
j̸=0

P (wt+j | wt ; θ) is the softmax function:


For a center word c and a context word o:
exp uTo vc

P (o | c) = P T
w∈V exp (uw vc )

exp u⊤

∂ 0 vc
log Pv
∂vc w=1 exp (uw vc )

24
Computing the Gradients wrt the center word vector

exp u⊤

∂ 0 vc
log PV
∂vc w=1 exp (uw vc )
V
∂ 

 ∂ X  
= log exp uo vc − log exp u⊤
w vc
∂vc ∂vc
w=1

Step 1: Differentiate the First Term

∂  ⊤  ∂ ⊤
u vc = u vc = uo
∂vc o ∂vc o

25
Gradient the gradient wrt the center word vector (2/4)

Step 2: Differentiate the Second Term


V
X  
log exp u⊤ v
w c
w=1

Use chain rule, derivative of log(z) wrt z is z1 :


V
∂ X  
log exp u⊤
w vc
∂vc
w=1
V
1 ∂ X  
= PV exp u⊤ v
x c

w=1 exp (uw vc )
∂vc
x=1

26
Gradient the gradient wrt the center word vector (3/4)

Step 2: Differentiate the Second Term (continued)


V
∂ X  
exp u⊤
x vc
∂vc
x=1

Move derivative inside the sum:


V
∂ X  
exp u⊤
x vc
∂vc
x=1
V
X ∂  
= exp u⊤ v
x c
∂vc
x=1
V
X  
= exp u⊤
x vc ux
x=1

27
Gradient the gradient wrt the center word vector (4/4)
Putting it all together

V
∂ 1 X  
log (p(o | c)) = uo − PV . exp u⊤
x c ux
v
∂vc w=1 exp (u⊤v )
w c x=1

Distribute term across the sum:


V
exp u⊤

x vc
X
= uo − PV ux

x=1 w=1 exp (uw vc )
XV
= uo − P (x | c)ux
x=1
= observed context vector − expected context vector
Thus center word is pulled towards words that are observed in its
context, and away from those that are not. i.e
vcnew = vcold + observed − expected
28
Gradient wrt the context word vector

▶ It is similar to the gradient wrt the center word vectors


▶ At home: derive the gradient wrt the context word vector

29
Glove Embeddings

30
Key Limitation of Word2Vec: Capturing cooccurrences
inefficiently

▶ Go through each word of the whole corpus

▶ Predict surrounding words of each (window’s center) word

▶ This captures cooccurrence of words one at a time

▶ Glove: why not capture cooccurrence counts directly?

31
GloVe: Global Vectors for Word Representation”
(Pennington et al. (2014)
Example Corpus:

I like deep learning .


I like NLP .
I enjoy flying .

32
GloVe: Global Vectors for Word Representation”
(Pennington et al. (2014)

Glove loss function:


V
1 X 2
L(θ) = f (Xij ) uTi vj − log Xij
2
i,j=1

▶ V: vocabulary size
Next, we note that Eqn. (6) would exhibit the ex- 1.0

▶ Xijsymmetry
change : cooccurrence
if not for count on thei
ofi )word
the log(X 0.8

right-hand
and jside. However, this term is indepen- 0.6

dent of k so it can be absorbed into a bias bi for 0.4


▶i . uFinally,
w i , vj : adding
word anvectors
additional bias b̃k for w̃k 0.2
restores the symmetry,
▶ f (X ): weighting function to 0.0
ij
reduce the
wTi w̃ influence of very
k + bi + b̃k = log(X ik ) . (7)
frequent or very rare word pairs Figure 1: Weighting
Figure: function
f (Xij ) f with ↵ = 3/4.
Eqn. (7) is a drastic simplification over Eqn. (1),
but it is actually ill-defined since the logarithm di- The performance of the model depends weakly o
the cutoff, which we fix to x max = 100 for all33o
Evaluation of
Are my word embeddings any good?
Word Embeddings

34
Evaluation Methods for Word Embeddings

Intrinsic:
▶ Directly evaluate the embeddings
▶ Fast to compute
▶ Not clear if really helpful unless correlation to real task is
established

Extrinsic:
▶ Evaluation on a real task (MT, QA, Parsing, Summarization,
etc.)
⊙ For every embedding method, retrain the model on the task
▶ Can take a long time to compute accuracy

35
ic Intrinsic
word vector evaluation
word vector evaluation
Word Vector Analogies:
rd Vector Analogies

a:b :: c:?

man:woman :: king:?

luate word vectors by how well their


▶ Evaluate
ine distance word vectors
after addition by
captures
how well
uitive semantic andtheir cosineanalogy
syntactic king
distance after addition
estions
captures intuitive semantic
carding theandinput wordsanalogy
syntactic from the
rch (!!!) questions woman
man
blem: ▶What if the information
Discarding is
the input words
re but notfrom
linear?
the search (!)
▶ Problem: What if the
information is there but not
linear? 36
GloVe Visualization
e Visualization

37
How can I use Word Embeddings for my task? (PA1)
Extrinsic Evaluation

▶ Approach 1: Learn embeddings from scratch as parameters


from your data: Often works pretty well

▶ Approach 2: Initialize using pretrained word embeddings,


keep fixed. Faster because no need to update these parameters

▶ Approach 3: initialize using pretrained word embeddings,


fine-tune on task. Often works best

38
Hyperparameters in Word Embeddings

▶ Dimensionality: 50-300 dimensions

▶ Window size: 5-10 words

▶ Initialization of word embeddings

▶ Learning rate

▶ Training epochs

39
Limitation of Word Embeddings: conflated word senses

▶ One word, one vector: conflated word senses (polysemy)

▶ E.g., “bank” in “river bank” vs. “bank account” or “rock” in


“rock music” vs. “rock climbing”, or “apple” in “apple fruit” vs.
“Apple Inc.” or “bass” in “bass guitar” vs. “bass fish”, ...

▶ Solution: contextual embeddings coming later (ELMo,


BERT, GPT-2)

40
Out-of-vocabulary (OOV) words, or rare words: how do we learn
good embeddings for them?

Tokenization

41
What is tokenization?

▶ Token: basic processing unit in NLP models

▶ Tokenization is the process of breaking up text into tokens


(words, or subwords, etc.), which then are converted to IDs
through a look-up table.

⊙ So far in this class... the process has been simple: splitting a


string into words based on spaces and punctuation
⊙ The quick brown fox jumps over the lazy dog
⊙ 11 298 34 567 432 13 11 49 305

This tokenization step requires an external tokenizer to detect word


boundaries!

42
Tokenizer

▶ Tokenizer is independent of the model, it is a separate


module with its own training data

▶ The tokenizer is a translation layer between the raw text


and token IDs
⊙ Encode: raw text → token IDs
⊙ Decode: token IDs → raw text

43
Tokenization: source of some of the problems in LLMs

▶ Python Processing in GPT-2: affected significantly by


tokenization approach

▶ Math processing: Tokenization can lead to unexpected


results when numbers are split into multiple tokens

▶ Other sources of quirks: punctuation, contractions,


hyphenated words, numbers, etc.

44
Problems with Whitespace Tokenization

Mr.O’Neill thinks that the boys’ stories about San Francisco


aren’t amusing.

▶ Whitespace tokenizer failures


⊙ prize-winning ⇒ prize , - , winning
⊙ 1850s ⇒ 1850 , s
⊙ U.K. ⇒ U , . , K ,
⊙ 2.5 ⇒ 2 , . , 5

▶ Word tokenizers require lots of specialized rules about how to


handle specific inputs
⊙ Check out spaCy’s tokenizers! (https://spacy.io/)

45
GPT3 Tokenizer
https://platform.openai.com/tokenizer

46
GPT3.5/4 Tokenizer
Groups more white space into a single token, densifies Python code, can attend to more
code in a single pass

47
Big problem with word tokenization

Problem: What happens when we encounter a word at test time


that we’ve never seen in our training data?
▶ We can’t assign an index to it! We don’t have a word
embedding for that word!

Solution: replace low-frequency words in training data with a


special <UNK> token, use this token to handle unseen words at
test time too

48
Limitations of UNK
We lose information about the word that was replaced by <UNK>

E.g.

▶ The chapel is sometimes referred to as "Hen Gapel


Lligwy" ("hen" being the Welsh word for "old" and
"capel" meaning "chapel").
▶ The chapel is sometimes referred to as " Hen <UNK>
<UNK> " (" hen " being the Welsh word for " old "
and " <UNK> " meaning " chapel ").

▶ We don’t want to generate UNK when generating text


(imagine ChatGPT doing this)
▶ In languages with productive morphology, lots of long words
are formed by composing smaller pieces, e.g., German, Finnish,
Turkish removing rare words can lead significant loss of
information
49
Other limitations

Word-level tokenization treats different forms of the same word


(e.g., "open", "opened", "opens", "opening", etc) as separate
types − > separate embeddings for each

▶ When words are related to each other, we’d like to share


information between them
▶ Despite these limitations, word-level tokenization was the
approach until recently (circa 2016)

50
An alternative: character tokenization

▶ Small vocabulary, just the number of unique characters in the


training data!

▶ However, we end up with longer input sequences



...
▶ E.g., two sentences might be three tokens apart in a
word model, but 30 tokens in a character model, e.g.,
"the children are happy" ⇒ " vs "t h e", "c h i l d r e
n", "a r e", "h a p p y"

51
Problem with character tokenization

▶ Character-level models must discover that words exist and


are delimited by spaces — this information is built into
word-based models

52
Character, Word, and Subword Tokenization

▶ Character: Learning a meaningful context-independent


representation for the letter "t" is much harder than learning a
context-independent representation for the word "today"

▶ Word: Big vocabulary size forces model to use huge


embedding matrix at input & output layer → increased
memory and compute time

▶ Hybrid: between word-level and character-level tokenization:


subword tokenization
⊙ Allows the model to have a reasonable vocabulary size while
being able to learn meaningful context-independent
representations

⊙ Enables the model to process words it has never seen


before, by decomposing them into known subwords

53
2016: subword tokenization

▶ Subword tokenization: break words into multiple word pieces


⊙ Generally want more frequent words to be represented by fewer
tokens, e.g, "the" should be a single token
⊙ Can handle rare words better than word-level tokenization
⊙ Can share parameters between related words (e.g., "open",
"opened", "opens", "opening")
⊙ Reduce vocabulary size → reduce num parameters,
compute+memory

▶ Developed for machine translation by Sennrich et al., ACL


201, Later used in BERT, T5, RoBERTa, GPT, etc.

▶ Relies on a simple algorithm called Byte Pair Encoding (Gage,


1994)
⊙ Iteratively merges the most frequent pair of consecutive
characters, until a fixed vocabulary size is reached
54
Work in MT led to Strong subword Tokenization

MT also led to strong sequence encoders, the attention-based models -


Transformers! 55
Byte Pair Encoding

Form base vocabulary (all characters that occur in the training


data)
word frequency
hug 10
pug 5
pun 12
bun 4
hugs 5
Base vocab: b, g, h, n, p, s, u

Example from
https://huggingface.co/transformers/tokenizersummary.html 56
Byte Pair encoding

Next, count the frequency of each character pair in the data, and
choose the one that occurs most frequently

word frequency character pair frequency


h+u+g 10 ug 20
p+u+g 5 pu 17
p+u+n 12 un 16
b+u+n 4 hu 15
h+u+g+s 5 gs 5

57
Byte Pair Encoding

Next, choose the most common pair (ug) and then merge the
characters together into one symbol. Add this new symbol to the
vocabulary. Then, retokenize the data

word frequency character pair frequency


h+ug 10 un 16
p+ug 5 h + ug 15
p+u+n 12 pu 12
b+ u+ n 4 p + ug 5
h+ ug+s 5 ug + s 5
Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug"]

58
Byte pPir Encoding

Keep repeating this process! This time we choose un to merge,


next time we choose h + ug, etc.
word frequency character pair frequency
h + ug 10 un 16
p + ug 5 h + ug 15
p+u+n 12 pu 12
b+u+n 4 p + ug 5
h + ug + s 5 ug + s 5

59
Byte pair encoding

Eventually, after a fixed number of merge steps, we stop


word frequency
hug 10
p+ug 5
p+un 12
b+un 4
hug +s 5
new vocab: b, g, h, n, p, s, u, ug, un, hug

60
BPE

▶ Initialize vocabulary
with all characters in
the training data
▶ While vocab size <
max vocab size:
⊙ Count frequency
of all character
pairs
⊙ Merge most
frequent pair
⊙ Update vocabulary

61
BPE Tutorial with Implementation

Refer to this tutorial for implementing BPE tokenization, notice


that it works on byte pairs, not character pairs but the idea is the
same - For PA1, need to implement both the encoding and
decoding methods
Tutorial video on tokenization
https://www.youtube.com/watch?v=zduSFxRajkE&t=20s&ab_channel=AndrejKarpathy

62
BPE impact on Machine Translation of English to German,
OOVs

63
Byte Pair Encoding

▶ To avoid <UNK>, all possible characters / symbols need to


be included in the base vocab. This can be a lot if including all
unicode characters (there are ∼ 150 K unicode symbols)!

▶ GPT-2 uses bytes as the base vocabulary (size 256) and then
applies BPE on top of this sequence (with some rules to
prevent certain types of merges).
⊙ GPT-2 has a vocabulary size of 50,257, which corresponds to
the 256 bytes base tokens, a special end-of-text token and the
symbols learned with 50, 000 merges.

▶ Common vocabulary sizes: 32 K to 64 K tokens

64
Other subword encoding schemes

▶ WordPiece (Schuster et al., ICASSP 2012): merge by


likelihood as measured by language model, not by frequency

▶ SentencePiece (Kudo et al., 2018): can do subword


tokenization without pretokenization (good for languages that
don’t always separate words w/ spaces), although
pretokenization usually improves performance
⊙ It works directly on text stream without notion of individual
words

65
Subword Considerations

▶ Multilingual Models: Subword models are hard to use


multilingually because they will over-segment less common
languages naively (Ács 2019)

▶ Work-around: Upsample less represented languages in the


training data, or use a language-specific tokenizer

66
Main takeaway: Tokenization

▶ All pre-trained models use some kind of subword


tokenization with a tuned vocabulary; usually less 50k (but
can be around 250k pieces for multilingual models)

▶ Still a number of heuristics are needed ...


⊙ Adjustments like restricting merges across categories are
crucial for optimizing performance and vocabulary usage.

▶ Looking forward: learn tokenization jointly with the model


(already some work on this, but still in early stages)

67
Questions?

68

You might also like