0% found this document useful (0 votes)

9 views57 pages

02-Transformer Based NLP Applications

The document outlines a lecture on Transformer-based Natural Language Processing (NLP) by Moez Ben Haj Hamida, focusing on attention mechanisms, transformer architecture, and subword modeling. It discusses the advantages of attention in solving bottleneck problems in sequence-to-sequence models and the motivations behind the design of the transformer architecture. Additionally, it covers pretraining methods for improving NLP applications through effective parameter initialization.

Uploaded by

Wiem Ben Romdhane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views57 pages

02-Transformer Based NLP Applications

Uploaded by

Wiem Ben Romdhane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Mastère TICV

Transformer Based NLP

Applications
Lecture 2

Moez BEN HAJ HMIDA

[email protected]
January, 2025
Outline

1. Attention

2. Transformers

3. Transformers architecture

4. Subword modeling

5. Pretraining

[Content adapted from CS224N: Natural Language Processing with Deep

Learning, The Stanford NLP Group, Stanford]

2 January, 2025
Multi-layer deep encoder-decoder MT Net

Conditioning
Bottleneck!

3 January, 2025
Seq-2-seq: the bottleneck problem

4 January, 2025
Attention
● Attention provides a solution to the bottleneck problem.

● Core idea: on each step of the decoder, use direct

connection to the encoder to focus on a particular part
of the source sequence

5 January, 2025
Sequence-to-sequence with attention
Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence

6 January, 2025
Sequence-to-sequence with attention
Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence

7 January, 2025
Sequence-to-sequence with attention
On this decoder timestep, we’re
mostly focusing on the ﬁrst
encoder hidden state (”he”)

softmax softmax softmax softmax

8 January, 2025
Sequence-to-sequence with attention

9 January, 2025
Sequence-to-sequence with attention

10 January, 2025
Sequence-to-sequence with attention

11 January, 2025
Sequence-to-sequence with attention

12 January, 2025
Sequence-to-sequence with attention

13 January, 2025
Sequence-to-sequence with attention

14 January, 2025
Sequence-to-sequence with attention

15 January, 2025
Attention: in equations
● We have encoder hidden states h1, … , hN ∊ ℝh
● On timestep t, we have decoder hidden state st ∊ ℝh
● We get the attention scores et for this step:

● We take softmax to get the attention distribution 𝛼 t for this step

(this is a probability distribution and sums to 1)

● We use 𝛼 t to take a weighted sum of the encoder hidden states to

get the attention output at

● Finally we concatenate the attention output at with the decoder

hidden state st and proceed as in the non-attention seq2seq model

16 January, 2025
Attention
● Signiﬁcantly improves NMT performance
● Very useful to allow decoder to focus on certain parts of the
source
● Provides a more “human-like” model of the MT process
● The model can look back at the source sentence while
translating, rather than needing to remember it all
● Solves the bottleneck problem
● Allows decoder to look directly at source; bypass bottleneck

● Attention helps with the vanishing gradient problem

● Provides shortcut to faraway states

17 January, 2025
Attention: a general DL technique
● More general deﬁnition of attention:
● Given a set of vector values, and a vector query,
attention is a technique to compute a weighted sum of
the values, dependent on the query.

● We sometimes say that the query attends to the values.

● For example, in the seq2seq + attention model, each
decoder hidden state (query) attends to (pays attention
to) all the encoder hidden states (values).
● Intuition:
● The weighted sum is a selective summary of the
information contained in the values, where the query
determines which values to focus on.
● Attention is a way to obtain a ﬁxed-size representation of
an arbitrary set of representations (the values),
dependent on some other representation (the query).
18 January, 2025
Transformers

19 January, 2025
Attention Is All You Need

20 January, 2025
Scaling Laws: Are Transformers All We Need?
● With Transformers, language modeling performance improves
smoothly as we increase model size, training data, and compute
resources in tandem.
● This power-law relationship has been observed over multiple
orders of magnitude with no sign of slowing.
● If we keep scaling up these models (with no change to the
architecture), could they eventually match or exceed human-level
performance?

Kaplan, Jared et al. “Scaling Laws for Neural Language Models.” ArXiv abs/2001.08361 (2020).

21 January, 2025
Motivation for Transformer Architecture
● The Transformers authors had 3 desiderata when
designing this architecture:
1. Minimize (or at least not increase) computational
complexity per layer.
2. Minimize path length between any pair of words to
facilitate learning of long-range dependencies.
3. Maximize the amount of computation that can be
parallelized.

[Vaswani et al., Attention Is All You Need, 2017]

22 January, 2025
Transformer Motivation
1. Computational Complexity Per Layer
When sequence length (n) << representation dimension (d), complexity per
layer is lower for a Transformer compared to the recurrent models

[Vaswani et al., Attention Is All You Need, 2017]

23 January, 2025
Transformer Motivation
2. Minimize Linear Interaction Distance
● RNNs are unrolled “left-to-right”.
● It encodes linear locality: a useful heuristic
■ Nearby words often aﬀect each other’s
meanings.

● Problem: RNNs take O(sequence length) steps for

distant word pairs to interact.

24 January, 2025
Transformer Motivation
2. Minimize Linear Interaction Distance
● O(sequence length) steps for distant word pairs to interact
means:
● Hard to learn long-distance dependencies (because
gradient problems!)
● Linear order of words is “baked in”; we already know
sequential structure doesn't tell the whole story...

25 January, 2025
Transformer Motivation
3. Maximize Parallelizability
● Forward and backward passes have O(sequence length) non
parallelizable operations
● GPUs (and TPUs) can perform many independent
computations at once
● But future RNN hidden states can’t be computed in full
before past RNN hidden states have been computed
● Inhibits training on very large datasets!
● Particularly problematic as sequence length increases, as
we can no longer batch many examples together due to
memory limitations

26 January, 2025
(Self) Attention
● Attention treats each word’s representation as a query to
access and incorporate information from a set of values.
● In NMT: attention from the decoder to the encoder in a
recurrent seq-2-seq model.
● Self-attention is encoder-encoder (or decoder-decoder)
attention where each word attends to each other word
within the input (or output).

27 January, 2025
Recurrence vs. Attention

io n
tent
t
it hA
w
d el el
Mo M od
e r r
cod o de
e c
e r-D r- De
od e
c od
E n
Enc
e d d
as a se
N-B -B
RN er
rm
sfo
an
Tr
Number of unparallelizable
operations does not increase with
sequence length.
Each "word" interacts with each
other, so maximum interaction
distance is O(1).

28 January, 2025
Transformer Architecture

29 January, 2025
Transformer Architecture

[Vaswani et al., 2017]

30 January, 2025
Intuition for Attention Mechanism
● Consider attention as an approximate hashtable:
● To look up a value, we compare a query against keys in a table.
● In a hashtable
■ Each query (hash) maps to exactly one key-value pair.
● In (self-)attention
■ Each query matches each key to varying degrees.
■ We return a sum of values weighted by the query-key match

Hashtable Attention

K0 V0 K0 V0
K1 V1 K1 V1
q q
K2 V2 K2 V2
K3 V3 K3 V3
K4 V4 K4 V4
K5 V5 K5 V5
31 January, 2025
Encoder: Self-Attention

K0 V0

K1 V1
q
K2 V2

K3 V3

K4 V4

K5 V5

32 January, 2025
Encoder: Self-Attention Vectors

33 January, 2025
Feedforward layer
● Apply a feedforward layer to the output of attention
● providing non-linear activation.

● Why?
● self-attention is simply performing a re-averaging of the
value vectors.
■ self-attention
■ we need non-linearity to learn

● Equation for Feed Forward Layer

FFN(x) = RELU(x * W1 + b1) * W2 + b2

34 January, 2025
Residual Connections
● Residual connections are a simple but powerful
technique from computer vision.
● Deep networks are surprisingly bad at learning
the identity function.
● Therefore, directly passing "raw" embeddings to
the next layer can actually be very helpful.

● This prevents the network from "forgetting" or

distorting important information as it is processed
by many layers.

He et al. (2016) Deep Residual Learning for Image

Recognition. 2016.
35 January, 2025
Layer Normalization
● Problem
● Diﬃcult to train the parameters of a given layer
because its input from the layer beneath keeps
shifting.

● Solution
● Reduce variation by normalizing to zero mean
and standard deviation of one Encoder within
each layer.

Ba et al. (2016) Layer

Normalization.
36 January, 2025
Scaled Dot Product Attention
● After LayerNorm, the mean and variance of
vector elements is 0 and 1, respectively.
● However, the dot product still tends to take on
extreme values
● For large dimensions of keys (dk), the dot
products grow large in magnitude, pushing
the softmax function into regions where it
has extremely small gradients.
● To counteract this eﬀect, we scale the dot
products by 1/√dk

Updated Self-Attention Equation:

37 January, 2025
Positional Encoding
● Since self-attention doesn’t build in order
information, we need to encode the order of the
sentence in the keys, queries, and values.
● Consider representing each sequence index as a
vector

● add the 𝑝𝑖 to the inputs

38 January, 2025
Multi-Headed Self-Attention
● Performs self-attention multiple times in
parallel and combines the results.

39 January, 2025
Multi-Headed Self-Attention
Improves the performance of the attention layer:
● Expands the model’s ability to focus on
diﬀerent positions
■ “The animal didn’t cross the street
because it was too tired”.
■ Which the word it refers to?
● Gives the attention layer multiple
representation subspaces
■ Multiple sets of Query/Key/Value
weight matrices
■ Each set is randomly initialized
■ Then, after training, each set is used https://jalammar.github.io/illustrated-transformer/

to project the input vectors into a

diﬀerent representation subspace.

40 January, 2025
Multi-Headed Self-Attention

https://jalammar.github.io/illustrated-transformer/

41 January, 2025
Decoder Attention
● For a language modeling training, the
network should not know the next token to
generate.
⇒ Hide (mask) information about future tokens
from the model.
⇒ Mask out attention to future words by
setting attention scores to −∞.

42 January, 2025
Encoder-Decoder Attention
● Encoder
● output vectors: ℎ1 , … , ℎ𝑇
● keys: ki = K ℎi
● values: vi = V ℎi
● Decoder
● input vectors: z1 , … , z𝑇
● queries: qi = Q zi

43 January, 2025
Decoder final layers
● Linear layer to project the embeddings into
a much longer vector of length vocab size

● Softmax to generate a probability

distribution of possible next words.

44 January, 2025
Subword modeling

45 January, 2025
LM vocabulary
● We assume a ﬁxed vocab of tens of thousands of
words, built from the training set.
● All novel words seen at test time are mapped to a
single UNK.

46 January, 2025
Subword modeling
● Subword modeling in NLP encompasses a wide range of
methods for reasoning about structure below the word level.
● The dominant modern paradigm is to learn a vocabulary of parts
of words (subword tokens).
● At training and testing time, each word is split into a sequence of
known subwords.
Byte-pair encoding strategy
1. Start with a vocabulary containing only characters and an
“end-of-word” symbol.
2. Using a corpus of text, ﬁnd the most common adjacent characters
“a,b”; add “ab” as a subword.
3. Replace instances of the character pair with the new subword;
repeat until desired vocab size.

● Similar methods WordPiece and SentencePiece are used in

pretrained models, like BERT, GPT. [Sennrich et al., 2016, Wu et al., 2016]
47 January, 2025
Subword modeling

48 January, 2025
Train on an NLP task
● Start with pretrained word embeddings (no context)
● Learn how to incorporate context in an LSTM or Transformer
while training on the task.
● Issue considerations
● The training data for our downstream task (e.g. sentiment
analysis) must be suﬃcient to teach all contextual aspects of
language.
● Most of the parameters in the network are randomly initialized.

49 January, 2025
Pretraining
● All parameters in the networks are initialized via pretraining.
● Pretraining methods hide parts of the input from the model,
and train the model to reconstruct those parts.
● This has been exceptionally eﬀective at building strong:
● Representations of language
● Parameter initializations for strong NLP models.
● Probability distributions over language that we can sample from

50 January, 2025
Pretraining through language modeling
● Language Modeling
● Model the probability distribution over words given their
past contexts.
● Large amount of data available in the Web.
● Pretraining through language modeling
● Train a neural network to perform language modeling on
a large amount of text.
● Save the network parameters.

51 January, 2025
Pretraining / Finetuning
● Pretraining can improve NLP applications by serving as
parameter initialization.

52 January, 2025
Pretraining encoders
● Encoders get bidirectional context, so we can’t do language
modeling.
● Idea: replace some fraction of words in the input with a
special [MASK] token; predict these words.
● Only add loss terms from words that are “masked out.”
● If 𝑥M is the masked version of, we’re learning 𝑝𝛉(𝑥|𝑥M). Called
Masked LM.

[Devlin et al., 2018]

53 January, 2025
BERT: Bidirectional Encoder Representations from Transformers

● Devlin et al., 2018 proposed the “Masked LM” objective and

released the weights of a pretrained Transformer.
Masked LM for BERT:
● Predict a random 15% of (sub)word
tokens.
● Replace input word with [MASK]
80% of the time
● Replace input word with a random
token 10% of the time
● Leave input word unchanged 10%
of the time (but still predict it)

[Devlin et al., 2018]

54 January, 2025
Limitations of pretrained encoders
BERT and other pretrained encoders don’t naturally lead to nice
autoregressive (1-word-at-a-time) generation methods.

55 January, 2025
Pretraining decoders
● When using language model pretrained decoders, we
can ignore that they were trained to model the
probability of the next token.
● We can ﬁnetune them by training a softmax classiﬁer
on the last word’s hidden state.
● Gradients backpropagate through the whole network.

56 January, 2025
Finetuning decoders

GPT: [Radford et al., 2018]

57 January, 2025

Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
lec-12
No ratings yet
lec-12
30 pages
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
7181-attention-is-all-you-need
No ratings yet
7181-attention-is-all-you-need
11 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
attention
No ratings yet
attention
15 pages
attention is all you need
No ratings yet
attention is all you need
11 pages
A1
No ratings yet
A1
11 pages
Transformer
No ratings yet
Transformer
59 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
3.1 Language Models and Attention
No ratings yet
3.1 Language Models and Attention
22 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
Example File
No ratings yet
Example File
3 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Attention is all you need
No ratings yet
Attention is all you need
19 pages
Aiayn
No ratings yet
Aiayn
15 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
1706.03762v1
No ratings yet
1706.03762v1
15 pages
Attention 1 2
No ratings yet
Attention 1 2
2 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
Transformers
No ratings yet
Transformers
102 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
AN2DL_06_2324_AttentionAndTrasformers
No ratings yet
AN2DL_06_2324_AttentionAndTrasformers
60 pages
Transformers
No ratings yet
Transformers
102 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Chap6 Transformer (20240219) - DL4H practioner guide
No ratings yet
Chap6 Transformer (20240219) - DL4H practioner guide
36 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
AE556_2024_Topic7_Transformer
No ratings yet
AE556_2024_Topic7_Transformer
49 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
01 The Transformer
No ratings yet
01 The Transformer
64 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Transformer
No ratings yet
Transformer
33 pages
Transformer
No ratings yet
Transformer
58 pages
Transformer Presentation
No ratings yet
Transformer Presentation
15 pages
dis7-sol
No ratings yet
dis7-sol
8 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
Transformer
No ratings yet
Transformer
5 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
chapter_4
No ratings yet
chapter_4
24 pages
Building Transformer Models With Attention - Web - Page
No ratings yet
Building Transformer Models With Attention - Web - Page
19 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
C++ Programming: From Novice to Expert in a Step-by-Step Journey
From Everand
C++ Programming: From Novice to Expert in a Step-by-Step Journey
Ryan Campbell
No ratings yet
ARDUINO DETECTION: Harnessing Arduino for Sensing and Detection Applications (2024 Guide)
From Everand
ARDUINO DETECTION: Harnessing Arduino for Sensing and Detection Applications (2024 Guide)
ADDISON GARDNER
No ratings yet
Module 5 DSDV (Bec302)
No ratings yet
Module 5 DSDV (Bec302)
11 pages
7045 User Manual
No ratings yet
7045 User Manual
268 pages
27.tacacs Configuration in Aci - Learn Work It
No ratings yet
27.tacacs Configuration in Aci - Learn Work It
8 pages
Log Config Olt Villahermosa 04
No ratings yet
Log Config Olt Villahermosa 04
163 pages
Grades For Mohit Patil - AWS Academy Cloud Foundations (76481)
No ratings yet
Grades For Mohit Patil - AWS Academy Cloud Foundations (76481)
2 pages
Instructor: Nelly Fazio
No ratings yet
Instructor: Nelly Fazio
3 pages
1 Midrange Introduction 2016-12-05
No ratings yet
1 Midrange Introduction 2016-12-05
22 pages
PISOFI portal_moderntips2.0
No ratings yet
PISOFI portal_moderntips2.0
5 pages
Lenovo ThinkCentre m91p
No ratings yet
Lenovo ThinkCentre m91p
148 pages
TC Unit 4 Technical Writing-Grammar and Editing
No ratings yet
TC Unit 4 Technical Writing-Grammar and Editing
19 pages
Ch2 - HTML-1
No ratings yet
Ch2 - HTML-1
120 pages
Komal Kamble - Resume
No ratings yet
Komal Kamble - Resume
2 pages
Basics of Storage Technology
No ratings yet
Basics of Storage Technology
13 pages
Avaya Support - Knowledge Base InQuira InfoCenter - DL360 G7 Bad RAID Backup Battery Detected
No ratings yet
Avaya Support - Knowledge Base InQuira InfoCenter - DL360 G7 Bad RAID Backup Battery Detected
2 pages
BHSPCL Project
No ratings yet
BHSPCL Project
88 pages
iWall-X-User-Manual-V1.0.3
No ratings yet
iWall-X-User-Manual-V1.0.3
30 pages
HCI SQLServer For Dummies
No ratings yet
HCI SQLServer For Dummies
4 pages
BCA-15 BCA Pt. III Examination Fundamental of Computer Networks Paper - BCA-15
No ratings yet
BCA-15 BCA Pt. III Examination Fundamental of Computer Networks Paper - BCA-15
2 pages
Training Module Five: Completing The TAC Application Form
No ratings yet
Training Module Five: Completing The TAC Application Form
19 pages
detailed-lesson-plan-in-grade-9-mathemat
100% (1)
detailed-lesson-plan-in-grade-9-mathemat
13 pages
Mini Guide To Root Cause Analysis
50% (2)
Mini Guide To Root Cause Analysis
2 pages
Ubuntu LTS Support Matrix
No ratings yet
Ubuntu LTS Support Matrix
15 pages
Wedding Photography Contract
No ratings yet
Wedding Photography Contract
4 pages
Lesson 1 Understanding Computer Programming
No ratings yet
Lesson 1 Understanding Computer Programming
21 pages
Migrating ZFS Storage Pools - Managing ZFS File Systems in Oracle® Solaris 11.3
No ratings yet
Migrating ZFS Storage Pools - Managing ZFS File Systems in Oracle® Solaris 11.3
1 page
Parallel and Distributed Query Processing: Practice Exercises
No ratings yet
Parallel and Distributed Query Processing: Practice Exercises
4 pages
Sony XBR-65X930D CNET Review Calibration Report
No ratings yet
Sony XBR-65X930D CNET Review Calibration Report
3 pages
Multimedia and ICT
75% (4)
Multimedia and ICT
23 pages
chapter-10-test-bank
No ratings yet
chapter-10-test-bank
27 pages
Playground Games
No ratings yet
Playground Games
27 pages

02-Transformer Based NLP Applications

Uploaded by

02-Transformer Based NLP Applications

Uploaded by

Mastère TICV

Transformer Based NLP

Moez BEN HAJ HMIDA

[Content adapted from CS224N: Natural Language Processing with Deep

● Core idea: on each step of the decoder, use direct

softmax softmax softmax softmax

● We take softmax to get the attention distribution 𝛼 t for this step

● We use 𝛼 t to take a weighted sum of the encoder hidden states to

● Finally we concatenate the attention output at with the decoder

● Attention helps with the vanishing gradient problem

● We sometimes say that the query attends to the values.

[Vaswani et al., Attention Is All You Need, 2017]

[Vaswani et al., Attention Is All You Need, 2017]

● Problem: RNNs take O(sequence length) steps for

[Vaswani et al., 2017]

● Equation for Feed Forward Layer

● This prevents the network from "forgetting" or

He et al. (2016) Deep Residual Learning for Image

Ba et al. (2016) Layer

Updated Self-Attention Equation:

● add the 𝑝𝑖 to the inputs

to project the input vectors into a

● Softmax to generate a probability

● Similar methods WordPiece and SentencePiece are used in

[Devlin et al., 2018]

● Devlin et al., 2018 proposed the “Masked LM” objective and

[Devlin et al., 2018]

GPT: [Radford et al., 2018]

You might also like