02-Transformer Based NLP Applications
02-Transformer Based NLP Applications
1. Attention
2. Transformers
3. Transformers architecture
4. Subword modeling
5. Pretraining
2 January, 2025
Multi-layer deep encoder-decoder MT Net
Conditioning
Bottleneck!
3 January, 2025
Seq-2-seq: the bottleneck problem
4 January, 2025
Attention
● Attention provides a solution to the bottleneck problem.
5 January, 2025
Sequence-to-sequence with attention
Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence
6 January, 2025
Sequence-to-sequence with attention
Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence
7 January, 2025
Sequence-to-sequence with attention
On this decoder timestep, we’re
mostly focusing on the first
encoder hidden state (”he”)
8 January, 2025
Sequence-to-sequence with attention
9 January, 2025
Sequence-to-sequence with attention
10 January, 2025
Sequence-to-sequence with attention
11 January, 2025
Sequence-to-sequence with attention
12 January, 2025
Sequence-to-sequence with attention
13 January, 2025
Sequence-to-sequence with attention
14 January, 2025
Sequence-to-sequence with attention
15 January, 2025
Attention: in equations
● We have encoder hidden states h1, … , hN ∊ ℝh
● On timestep t, we have decoder hidden state st ∊ ℝh
● We get the attention scores et for this step:
16 January, 2025
Attention
● Significantly improves NMT performance
● Very useful to allow decoder to focus on certain parts of the
source
● Provides a more “human-like” model of the MT process
● The model can look back at the source sentence while
translating, rather than needing to remember it all
● Solves the bottleneck problem
● Allows decoder to look directly at source; bypass bottleneck
17 January, 2025
Attention: a general DL technique
● More general definition of attention:
● Given a set of vector values, and a vector query,
attention is a technique to compute a weighted sum of
the values, dependent on the query.
19 January, 2025
Attention Is All You Need
20 January, 2025
Scaling Laws: Are Transformers All We Need?
● With Transformers, language modeling performance improves
smoothly as we increase model size, training data, and compute
resources in tandem.
● This power-law relationship has been observed over multiple
orders of magnitude with no sign of slowing.
● If we keep scaling up these models (with no change to the
architecture), could they eventually match or exceed human-level
performance?
Kaplan, Jared et al. “Scaling Laws for Neural Language Models.” ArXiv abs/2001.08361 (2020).
21 January, 2025
Motivation for Transformer Architecture
● The Transformers authors had 3 desiderata when
designing this architecture:
1. Minimize (or at least not increase) computational
complexity per layer.
2. Minimize path length between any pair of words to
facilitate learning of long-range dependencies.
3. Maximize the amount of computation that can be
parallelized.
22 January, 2025
Transformer Motivation
1. Computational Complexity Per Layer
When sequence length (n) << representation dimension (d), complexity per
layer is lower for a Transformer compared to the recurrent models
23 January, 2025
Transformer Motivation
2. Minimize Linear Interaction Distance
● RNNs are unrolled “left-to-right”.
● It encodes linear locality: a useful heuristic
■ Nearby words often affect each other’s
meanings.
24 January, 2025
Transformer Motivation
2. Minimize Linear Interaction Distance
● O(sequence length) steps for distant word pairs to interact
means:
● Hard to learn long-distance dependencies (because
gradient problems!)
● Linear order of words is “baked in”; we already know
sequential structure doesn't tell the whole story...
25 January, 2025
Transformer Motivation
3. Maximize Parallelizability
● Forward and backward passes have O(sequence length) non
parallelizable operations
● GPUs (and TPUs) can perform many independent
computations at once
● But future RNN hidden states can’t be computed in full
before past RNN hidden states have been computed
● Inhibits training on very large datasets!
● Particularly problematic as sequence length increases, as
we can no longer batch many examples together due to
memory limitations
26 January, 2025
(Self) Attention
● Attention treats each word’s representation as a query to
access and incorporate information from a set of values.
● In NMT: attention from the decoder to the encoder in a
recurrent seq-2-seq model.
● Self-attention is encoder-encoder (or decoder-decoder)
attention where each word attends to each other word
within the input (or output).
27 January, 2025
Recurrence vs. Attention
io n
tent
t
it hA
w
d el el
Mo M od
e r r
cod o de
e c
e r-D r- De
od e
c od
E n
Enc
e d d
as a se
N-B -B
RN er
rm
sfo
an
Tr
Number of unparallelizable
operations does not increase with
sequence length.
Each "word" interacts with each
other, so maximum interaction
distance is O(1).
28 January, 2025
Transformer Architecture
29 January, 2025
Transformer Architecture
30 January, 2025
Intuition for Attention Mechanism
● Consider attention as an approximate hashtable:
● To look up a value, we compare a query against keys in a table.
● In a hashtable
■ Each query (hash) maps to exactly one key-value pair.
● In (self-)attention
■ Each query matches each key to varying degrees.
■ We return a sum of values weighted by the query-key match
Hashtable Attention
K0 V0 K0 V0
K1 V1 K1 V1
q q
K2 V2 K2 V2
K3 V3 K3 V3
K4 V4 K4 V4
K5 V5 K5 V5
31 January, 2025
Encoder: Self-Attention
K0 V0
K1 V1
q
K2 V2
K3 V3
K4 V4
K5 V5
32 January, 2025
Encoder: Self-Attention Vectors
33 January, 2025
Feedforward layer
● Apply a feedforward layer to the output of attention
● providing non-linear activation.
● Why?
● self-attention is simply performing a re-averaging of the
value vectors.
■ self-attention
■ we need non-linearity to learn
34 January, 2025
Residual Connections
● Residual connections are a simple but powerful
technique from computer vision.
● Deep networks are surprisingly bad at learning
the identity function.
● Therefore, directly passing "raw" embeddings to
the next layer can actually be very helpful.
● Solution
● Reduce variation by normalizing to zero mean
and standard deviation of one Encoder within
each layer.
37 January, 2025
Positional Encoding
● Since self-attention doesn’t build in order
information, we need to encode the order of the
sentence in the keys, queries, and values.
● Consider representing each sequence index as a
vector
38 January, 2025
Multi-Headed Self-Attention
● Performs self-attention multiple times in
parallel and combines the results.
39 January, 2025
Multi-Headed Self-Attention
Improves the performance of the attention layer:
● Expands the model’s ability to focus on
different positions
■ “The animal didn’t cross the street
because it was too tired”.
■ Which the word it refers to?
● Gives the attention layer multiple
representation subspaces
■ Multiple sets of Query/Key/Value
weight matrices
■ Each set is randomly initialized
■ Then, after training, each set is used https://jalammar.github.io/illustrated-transformer/
40 January, 2025
Multi-Headed Self-Attention
https://jalammar.github.io/illustrated-transformer/
41 January, 2025
Decoder Attention
● For a language modeling training, the
network should not know the next token to
generate.
⇒ Hide (mask) information about future tokens
from the model.
⇒ Mask out attention to future words by
setting attention scores to −∞.
42 January, 2025
Encoder-Decoder Attention
● Encoder
● output vectors: ℎ1 , … , ℎ𝑇
● keys: ki = K ℎi
● values: vi = V ℎi
● Decoder
● input vectors: z1 , … , z𝑇
● queries: qi = Q zi
43 January, 2025
Decoder final layers
● Linear layer to project the embeddings into
a much longer vector of length vocab size
44 January, 2025
Subword modeling
45 January, 2025
LM vocabulary
● We assume a fixed vocab of tens of thousands of
words, built from the training set.
● All novel words seen at test time are mapped to a
single UNK.
46 January, 2025
Subword modeling
● Subword modeling in NLP encompasses a wide range of
methods for reasoning about structure below the word level.
● The dominant modern paradigm is to learn a vocabulary of parts
of words (subword tokens).
● At training and testing time, each word is split into a sequence of
known subwords.
Byte-pair encoding strategy
1. Start with a vocabulary containing only characters and an
“end-of-word” symbol.
2. Using a corpus of text, find the most common adjacent characters
“a,b”; add “ab” as a subword.
3. Replace instances of the character pair with the new subword;
repeat until desired vocab size.
48 January, 2025
Train on an NLP task
● Start with pretrained word embeddings (no context)
● Learn how to incorporate context in an LSTM or Transformer
while training on the task.
● Issue considerations
● The training data for our downstream task (e.g. sentiment
analysis) must be sufficient to teach all contextual aspects of
language.
● Most of the parameters in the network are randomly initialized.
49 January, 2025
Pretraining
● All parameters in the networks are initialized via pretraining.
● Pretraining methods hide parts of the input from the model,
and train the model to reconstruct those parts.
● This has been exceptionally effective at building strong:
● Representations of language
● Parameter initializations for strong NLP models.
● Probability distributions over language that we can sample from
50 January, 2025
Pretraining through language modeling
● Language Modeling
● Model the probability distribution over words given their
past contexts.
● Large amount of data available in the Web.
● Pretraining through language modeling
● Train a neural network to perform language modeling on
a large amount of text.
● Save the network parameters.
51 January, 2025
Pretraining / Finetuning
● Pretraining can improve NLP applications by serving as
parameter initialization.
52 January, 2025
Pretraining encoders
● Encoders get bidirectional context, so we can’t do language
modeling.
● Idea: replace some fraction of words in the input with a
special [MASK] token; predict these words.
● Only add loss terms from words that are “masked out.”
● If 𝑥M is the masked version of, we’re learning 𝑝𝛉(𝑥|𝑥M). Called
Masked LM.
55 January, 2025
Pretraining decoders
● When using language model pretrained decoders, we
can ignore that they were trained to model the
probability of the next token.
● We can finetune them by training a softmax classifier
on the last word’s hidden state.
● Gradients backpropagate through the whole network.
56 January, 2025
Finetuning decoders