0% found this document useful (0 votes)
9 views57 pages

02-Transformer Based NLP Applications

The document outlines a lecture on Transformer-based Natural Language Processing (NLP) by Moez Ben Haj Hamida, focusing on attention mechanisms, transformer architecture, and subword modeling. It discusses the advantages of attention in solving bottleneck problems in sequence-to-sequence models and the motivations behind the design of the transformer architecture. Additionally, it covers pretraining methods for improving NLP applications through effective parameter initialization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views57 pages

02-Transformer Based NLP Applications

The document outlines a lecture on Transformer-based Natural Language Processing (NLP) by Moez Ben Haj Hamida, focusing on attention mechanisms, transformer architecture, and subword modeling. It discusses the advantages of attention in solving bottleneck problems in sequence-to-sequence models and the motivations behind the design of the transformer architecture. Additionally, it covers pretraining methods for improving NLP applications through effective parameter initialization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Mastère TICV

Transformer Based NLP


Applications
Lecture 2

Moez BEN HAJ HMIDA


[email protected]
January, 2025
Outline

1. Attention

2. Transformers

3. Transformers architecture

4. Subword modeling

5. Pretraining

[Content adapted from CS224N: Natural Language Processing with Deep


Learning, The Stanford NLP Group, Stanford]

2 January, 2025
Multi-layer deep encoder-decoder MT Net

Conditioning
Bottleneck!

3 January, 2025
Seq-2-seq: the bottleneck problem

4 January, 2025
Attention
● Attention provides a solution to the bottleneck problem.

● Core idea: on each step of the decoder, use direct


connection to the encoder to focus on a particular part
of the source sequence

5 January, 2025
Sequence-to-sequence with attention
Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence

6 January, 2025
Sequence-to-sequence with attention
Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence

7 January, 2025
Sequence-to-sequence with attention
On this decoder timestep, we’re
mostly focusing on the first
encoder hidden state (”he”)

softmax softmax softmax softmax

8 January, 2025
Sequence-to-sequence with attention

9 January, 2025
Sequence-to-sequence with attention

10 January, 2025
Sequence-to-sequence with attention

11 January, 2025
Sequence-to-sequence with attention

12 January, 2025
Sequence-to-sequence with attention

13 January, 2025
Sequence-to-sequence with attention

14 January, 2025
Sequence-to-sequence with attention

15 January, 2025
Attention: in equations
● We have encoder hidden states h1, … , hN ∊ ℝh
● On timestep t, we have decoder hidden state st ∊ ℝh
● We get the attention scores et for this step:

● We take softmax to get the attention distribution 𝛼 t for this step


(this is a probability distribution and sums to 1)

● We use 𝛼 t to take a weighted sum of the encoder hidden states to


get the attention output at

● Finally we concatenate the attention output at with the decoder


hidden state st and proceed as in the non-attention seq2seq model

16 January, 2025
Attention
● Significantly improves NMT performance
● Very useful to allow decoder to focus on certain parts of the
source
● Provides a more “human-like” model of the MT process
● The model can look back at the source sentence while
translating, rather than needing to remember it all
● Solves the bottleneck problem
● Allows decoder to look directly at source; bypass bottleneck

● Attention helps with the vanishing gradient problem


● Provides shortcut to faraway states

17 January, 2025
Attention: a general DL technique
● More general definition of attention:
● Given a set of vector values, and a vector query,
attention is a technique to compute a weighted sum of
the values, dependent on the query.

● We sometimes say that the query attends to the values.


● For example, in the seq2seq + attention model, each
decoder hidden state (query) attends to (pays attention
to) all the encoder hidden states (values).
● Intuition:
● The weighted sum is a selective summary of the
information contained in the values, where the query
determines which values to focus on.
● Attention is a way to obtain a fixed-size representation of
an arbitrary set of representations (the values),
dependent on some other representation (the query).
18 January, 2025
Transformers

19 January, 2025
Attention Is All You Need

20 January, 2025
Scaling Laws: Are Transformers All We Need?
● With Transformers, language modeling performance improves
smoothly as we increase model size, training data, and compute
resources in tandem.
● This power-law relationship has been observed over multiple
orders of magnitude with no sign of slowing.
● If we keep scaling up these models (with no change to the
architecture), could they eventually match or exceed human-level
performance?

Kaplan, Jared et al. “Scaling Laws for Neural Language Models.” ArXiv abs/2001.08361 (2020).

21 January, 2025
Motivation for Transformer Architecture
● The Transformers authors had 3 desiderata when
designing this architecture:
1. Minimize (or at least not increase) computational
complexity per layer.
2. Minimize path length between any pair of words to
facilitate learning of long-range dependencies.
3. Maximize the amount of computation that can be
parallelized.

[Vaswani et al., Attention Is All You Need, 2017]

22 January, 2025
Transformer Motivation
1. Computational Complexity Per Layer
When sequence length (n) << representation dimension (d), complexity per
layer is lower for a Transformer compared to the recurrent models

[Vaswani et al., Attention Is All You Need, 2017]

23 January, 2025
Transformer Motivation
2. Minimize Linear Interaction Distance
● RNNs are unrolled “left-to-right”.
● It encodes linear locality: a useful heuristic
■ Nearby words often affect each other’s
meanings.

● Problem: RNNs take O(sequence length) steps for


distant word pairs to interact.

24 January, 2025
Transformer Motivation
2. Minimize Linear Interaction Distance
● O(sequence length) steps for distant word pairs to interact
means:
● Hard to learn long-distance dependencies (because
gradient problems!)
● Linear order of words is “baked in”; we already know
sequential structure doesn't tell the whole story...

25 January, 2025
Transformer Motivation
3. Maximize Parallelizability
● Forward and backward passes have O(sequence length) non
parallelizable operations
● GPUs (and TPUs) can perform many independent
computations at once
● But future RNN hidden states can’t be computed in full
before past RNN hidden states have been computed
● Inhibits training on very large datasets!
● Particularly problematic as sequence length increases, as
we can no longer batch many examples together due to
memory limitations

26 January, 2025
(Self) Attention
● Attention treats each word’s representation as a query to
access and incorporate information from a set of values.
● In NMT: attention from the decoder to the encoder in a
recurrent seq-2-seq model.
● Self-attention is encoder-encoder (or decoder-decoder)
attention where each word attends to each other word
within the input (or output).

27 January, 2025
Recurrence vs. Attention

io n
tent
t
it hA
w
d el el
Mo M od
e r r
cod o de
e c
e r-D r- De
od e
c od
E n
Enc
e d d
as a se
N-B -B
RN er
rm
sfo
an
Tr
Number of unparallelizable
operations does not increase with
sequence length.
Each "word" interacts with each
other, so maximum interaction
distance is O(1).

28 January, 2025
Transformer Architecture

29 January, 2025
Transformer Architecture

[Vaswani et al., 2017]

30 January, 2025
Intuition for Attention Mechanism
● Consider attention as an approximate hashtable:
● To look up a value, we compare a query against keys in a table.
● In a hashtable
■ Each query (hash) maps to exactly one key-value pair.
● In (self-)attention
■ Each query matches each key to varying degrees.
■ We return a sum of values weighted by the query-key match

Hashtable Attention

K0 V0 K0 V0
K1 V1 K1 V1
q q
K2 V2 K2 V2
K3 V3 K3 V3
K4 V4 K4 V4
K5 V5 K5 V5
31 January, 2025
Encoder: Self-Attention

K0 V0

K1 V1
q
K2 V2

K3 V3

K4 V4

K5 V5

32 January, 2025
Encoder: Self-Attention Vectors

33 January, 2025
Feedforward layer
● Apply a feedforward layer to the output of attention
● providing non-linear activation.

● Why?
● self-attention is simply performing a re-averaging of the
value vectors.
■ self-attention
■ we need non-linearity to learn

● Equation for Feed Forward Layer


FFN(x) = RELU(x * W1 + b1) * W2 + b2

34 January, 2025
Residual Connections
● Residual connections are a simple but powerful
technique from computer vision.
● Deep networks are surprisingly bad at learning
the identity function.
● Therefore, directly passing "raw" embeddings to
the next layer can actually be very helpful.

● This prevents the network from "forgetting" or


distorting important information as it is processed
by many layers.

He et al. (2016) Deep Residual Learning for Image


Recognition. 2016.
35 January, 2025
Layer Normalization
● Problem
● Difficult to train the parameters of a given layer
because its input from the layer beneath keeps
shifting.

● Solution
● Reduce variation by normalizing to zero mean
and standard deviation of one Encoder within
each layer.

Ba et al. (2016) Layer


Normalization.
36 January, 2025
Scaled Dot Product Attention
● After LayerNorm, the mean and variance of
vector elements is 0 and 1, respectively.
● However, the dot product still tends to take on
extreme values
● For large dimensions of keys (dk), the dot
products grow large in magnitude, pushing
the softmax function into regions where it
has extremely small gradients.
● To counteract this effect, we scale the dot
products by 1/√dk

Updated Self-Attention Equation:

37 January, 2025
Positional Encoding
● Since self-attention doesn’t build in order
information, we need to encode the order of the
sentence in the keys, queries, and values.
● Consider representing each sequence index as a
vector

● add the 𝑝𝑖 to the inputs

38 January, 2025
Multi-Headed Self-Attention
● Performs self-attention multiple times in
parallel and combines the results.

39 January, 2025
Multi-Headed Self-Attention
Improves the performance of the attention layer:
● Expands the model’s ability to focus on
different positions
■ “The animal didn’t cross the street
because it was too tired”.
■ Which the word it refers to?
● Gives the attention layer multiple
representation subspaces
■ Multiple sets of Query/Key/Value
weight matrices
■ Each set is randomly initialized
■ Then, after training, each set is used https://jalammar.github.io/illustrated-transformer/

to project the input vectors into a


different representation subspace.

40 January, 2025
Multi-Headed Self-Attention

https://jalammar.github.io/illustrated-transformer/

41 January, 2025
Decoder Attention
● For a language modeling training, the
network should not know the next token to
generate.
⇒ Hide (mask) information about future tokens
from the model.
⇒ Mask out attention to future words by
setting attention scores to −∞.

42 January, 2025
Encoder-Decoder Attention
● Encoder
● output vectors: ℎ1 , … , ℎ𝑇
● keys: ki = K ℎi
● values: vi = V ℎi
● Decoder
● input vectors: z1 , … , z𝑇
● queries: qi = Q zi

43 January, 2025
Decoder final layers
● Linear layer to project the embeddings into
a much longer vector of length vocab size

● Softmax to generate a probability


distribution of possible next words.

44 January, 2025
Subword modeling

45 January, 2025
LM vocabulary
● We assume a fixed vocab of tens of thousands of
words, built from the training set.
● All novel words seen at test time are mapped to a
single UNK.

46 January, 2025
Subword modeling
● Subword modeling in NLP encompasses a wide range of
methods for reasoning about structure below the word level.
● The dominant modern paradigm is to learn a vocabulary of parts
of words (subword tokens).
● At training and testing time, each word is split into a sequence of
known subwords.
Byte-pair encoding strategy
1. Start with a vocabulary containing only characters and an
“end-of-word” symbol.
2. Using a corpus of text, find the most common adjacent characters
“a,b”; add “ab” as a subword.
3. Replace instances of the character pair with the new subword;
repeat until desired vocab size.

● Similar methods WordPiece and SentencePiece are used in


pretrained models, like BERT, GPT. [Sennrich et al., 2016, Wu et al., 2016]
47 January, 2025
Subword modeling

48 January, 2025
Train on an NLP task
● Start with pretrained word embeddings (no context)
● Learn how to incorporate context in an LSTM or Transformer
while training on the task.
● Issue considerations
● The training data for our downstream task (e.g. sentiment
analysis) must be sufficient to teach all contextual aspects of
language.
● Most of the parameters in the network are randomly initialized.

49 January, 2025
Pretraining
● All parameters in the networks are initialized via pretraining.
● Pretraining methods hide parts of the input from the model,
and train the model to reconstruct those parts.
● This has been exceptionally effective at building strong:
● Representations of language
● Parameter initializations for strong NLP models.
● Probability distributions over language that we can sample from

50 January, 2025
Pretraining through language modeling
● Language Modeling
● Model the probability distribution over words given their
past contexts.
● Large amount of data available in the Web.
● Pretraining through language modeling
● Train a neural network to perform language modeling on
a large amount of text.
● Save the network parameters.

51 January, 2025
Pretraining / Finetuning
● Pretraining can improve NLP applications by serving as
parameter initialization.

52 January, 2025
Pretraining encoders
● Encoders get bidirectional context, so we can’t do language
modeling.
● Idea: replace some fraction of words in the input with a
special [MASK] token; predict these words.
● Only add loss terms from words that are “masked out.”
● If 𝑥M is the masked version of, we’re learning 𝑝𝛉(𝑥|𝑥M). Called
Masked LM.

[Devlin et al., 2018]


53 January, 2025
BERT: Bidirectional Encoder Representations from Transformers

● Devlin et al., 2018 proposed the “Masked LM” objective and


released the weights of a pretrained Transformer.
Masked LM for BERT:
● Predict a random 15% of (sub)word
tokens.
● Replace input word with [MASK]
80% of the time
● Replace input word with a random
token 10% of the time
● Leave input word unchanged 10%
of the time (but still predict it)

[Devlin et al., 2018]


54 January, 2025
Limitations of pretrained encoders
BERT and other pretrained encoders don’t naturally lead to nice
autoregressive (1-word-at-a-time) generation methods.

55 January, 2025
Pretraining decoders
● When using language model pretrained decoders, we
can ignore that they were trained to model the
probability of the next token.
● We can finetune them by training a softmax classifier
on the last word’s hidden state.
● Gradients backpropagate through the whole network.

56 January, 2025
Finetuning decoders

GPT: [Radford et al., 2018]


57 January, 2025

You might also like