0% found this document useful (0 votes)

3 views22 pages

3.1 Language Models and Attention

This lecture discusses language models in deep learning, focusing on the evolution of attention mechanisms and their significance in processing sequences. It covers the limitations of traditional models like RNNs and LSTMs, and introduces self-attention as a breakthrough for modern language models. The session concludes with an overview of the 'Attention is All You Need' paper, which laid the foundation for transformer models.

Uploaded by

sa.seventyseven.sh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views22 pages

3.1 Language Models and Attention

Uploaded by

sa.seventyseven.sh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Lecture 3.

1 - Language
Models and Attention
Generative AI Teaching Kit
The NVIDIA Deep Learning Institute Generative AI Teaching Kit is licensed by NVIDIA and Dartmouth College under the
Creative Commons Attribution-NonCommercial 4.0 International License.

Page 2
This lecture

§ Language Models in Deep Learning

§ Evolution of attention mechanisms pre-transformers
§ Attention Mechanism

Page 3
Language Models in Deep Learning

Page 4
Deep Language Modeling

Recall that a goal of a language model is to predict the correct word

based on the given input and the available corpus.

We have already seen statistical methods like Bag of Words which

use the statistics of the prevalence of words in the dataset to predict
the most likely next word.

But deep learning typically allows for more complex features to be

encoded into a model.

Using deep learning for language, we can explore more complex

relationships and more efficiently make use of the wealth of language
data that is present digitally.

Page 5
Challenges of Deep Learning Sequences

One main issue that standard neural networks run into when
attempting to model language is time-dependency.

Language is causal, previous words influence future words, the same

is true for sentences and paragraphs.
Neural networks don’t have an inherent way to “remember” previous
inputs, or to properly represent sequences.

A newer innovation, the recurrent neural network is designed to

process sequential data by maintaining a form of memory through its
recurrent connections.

Page 6
Recurrent Neural Networks
Unlike traditional feedforward networks, RNNs are designed to recognize patterns in sequences of
data, such as time-series, text, speech, or videos, by maintaining a hidden state that captures
information about previous inputs in the sequence.
Key Characteristics of RNNs:
Sequential Processing:
RNNs process input one step at a time, making them well-suited for tasks where data has a
temporal or sequential structure.
Recurrent Connections:
At each time step, the hidden state of the network is updated based on the current input and the
hidden state from the previous time step. This allows the network to "remember" information from
earlier in the sequence.
Shared Weights:
The weights used for processing are shared across time steps, which reduces the number of
parameters and helps capture temporal dependencies.
Memory:
The hidden state serves as a form of memory, enabling the network to use information from earlier
inputs to influence later outputs.
Training via Backpropagation Through Time (BPTT):
The gradient is computed for all time steps of the sequence to train the network. However, this can
lead to issues like vanishing or exploding gradients.
Page 7
Recurrent Neural Networks – Code Differences

Key Differences
§ ANN: Processes inputs without considering temporal relationships; uses a simple Linear layer.
§ RNN: Processes sequences with recurrent connections; uses an RNN layer and captures sequential
dependencies.

Page 8
Long-Short Term Memory Models (LSTM)

Challenges with RNNs

§ Struggle with long input sequences: Information from earlier words
gets diluted as it propagates through the sequence.
§ Vanishing Gradient Problem: Gradients shrink over time, reducing
the ability to learn long-term dependencies.

Solution: LSTMs
§ Introduce a "cell state" to selectively retain or discard information.
§ Effectively capture long-range dependencies in sequences.

Page 9
Limitations of RNNs and LSTMs in Language Modeling
Sequential Processing Bottleneck
§ RNNs and LSTMs process inputs step-by-step, making training and inference slow
for long sequences.
Vanishing Gradients
§ Gradients diminish as they backpropagate through many time steps, limiting the
network’s ability to learn relationships across long sequences.
Challenges with Long-Range Dependencies
§ Even with LSTMs, retaining information from distant parts of a sequence is difficult,
leading to a loss of context over time.
Focus on Local Context
§ RNNs and LSTMs prioritize immediate neighboring words but struggle to model
relationships across the entire input effectively.

The solution?
A new mechanism that would enable parallel process, dynamically focusing on relevant sequence parts, and
capturing long-range dependencies without the limitations of step-by-step computing or vanishing gradients…

Page 10
Evolution of attention mechanisms pre-transformers

Page 11
Adding Attention
What is Attention?
Attention is a mechanism that enables a model to focus on the most relevant parts of
the input while making predictions.
Instead of treating all input information equally, it assigns varying levels of
"importance" (weights) to different parts based on the task.

Key Idea:
When processing sequences, attention computes a weighted sum of the input
elements, where the weights represent how much "attention" each element deserves.

Different to other Layer types?

§ Fully Connected: Listens to everyone equally, regardless of relevance.
§ Convolution: Pays attention only to nearby neighbors.
§ Attention: Actively decides who to listen to, whether they are nearby or far away.

Page 12
Early Attention Attempts – Bahdanau et.al 2014
Bahdanau et.al introduced a mechanism to dynamically focus on specific parts of the input sequence while generating each
element of the output sequence.
§ The encoder produces a set of context vectors (hidden states) for each input token.
§ For each output token, the decoder calculates an attention score for each input token based on its relevance to the current
decoding step.
§ The scores are normalized (using SoftMax) to produce attention weights, which are used to compute a weighted sum of
encoder hidden states (context vector).
§ This context vector is then used by the decoder to produce the next output token.

Page 13
Improved Attention – Luong et.al 2015

Luong et.al introduced in the paper “Effective Approaches to Attention-based Neural Machine
Translation” by Minh-Thang Luong et al., this innovation focused on improving the
computational efficiency and flexibility of attention mechanisms.
Multiplicative Attention (Dot-Product Scoring):
§ Replaced Bahdanau's additive scoring function with a simpler dot-product or scaled dot-
product function to calculate attention scores.
§ Resulted in faster computations while maintaining strong performance.

Global vs. Local Attention:

Proposed two types of attention mechanisms:
1.Global Attention: Attends to all input tokens for each output token (like Bahdanau et.al).
2.Local Attention: Focuses on a subset of the input sequence for efficiency, using a predicted alignment window.

Improved Training Efficiency:

By simplifying the attention computation, Luong et.al allowed larger models and longer
sequences to be trained more effectively.

Page 14
Early Self-Attention – Lin et.al 2017

Lin et al. introduced a self-attention mechanism to create

structured, multi-aspect sentence embeddings, enabling
models to focus on different parts of a sentence simultaneously.

§ Self-Attention for Sentence Embeddings:

Dynamically assigns attention weights to input tokens,
emphasizing their importance to the sentence representation.

§ Multi-Aspect Representations:
Captures diverse aspects of a sentence by generating multiple
attention vectors (hops), enabling richer embeddings.

§ Regularization for Diversity:

Introduced a penalty term to ensure attention focuses on
distinct sentence parts, reducing redundancy.

Page 15
Remaining Limitations with Attention Methods

Despite the novel innovations of attention methods, these approaches still suffered from some
general limitations, preventing their widespread use.

Dependence on Recurrence
§ Attention mechanisms (e.g., Bahdanau, Luong, Lin et al.) were tightly integrated with RNNs/LSTMs, which
process sequences sequentially.
§ Gradient Issues: The reliance on recurrence also made them susceptible to vanishing or exploding gradients,
limiting their ability to model very long dependencies.

Lack of Scalability
§ RNN/LSTM-based models with attention were computationally expensive and struggled with large datasets or
sequences.
§ Memory Usage: Maintaining hidden states for long sequences was resource-intensive.

Inefficient Training
§ Training LSTM-based models with attention was slow because of sequential dependencies and the need to
process data step-by-step.

Page 16
The Self-Attention Mechanism

Page 17
Attention is all you need – Vaswani et.al 2017

One of the most seminal papers of machine learning to date. “Attention is all you need” introduced a new concept of how to
make use of attention and completely removed the reliance on recurrence to process sequences. In the next lesson we will
dive deeper into the Transformer model, but now we will focus on how self-attention is presented in that paper.

Self-Attention: What Does It Actually Mean?

Self-attention is a mechanism that allows a model to dynamically focus on the most relevant parts of a sequence when
computing representations for each token.

Key Concept:
Every token in the sequence can attend to every other token, including itself, to understand its relationship and importance
in the context of the entire sequence.

E.g.
In the sentence: "The cat chased the mouse,"

Self-attention helps the model understand that "the mouse" is what "the cat" chased, by focusing on the semantic
relationship between "chased" and "the mouse."

Page 18
How Self-Attention Mechanisms Work
The implementation of self-attention can vary, but the essence is to:
§ Compare: Each token is compared to every other token in the sequence to
compute relevance scores.

§ Weight: Assign weights to other tokens based on their relevance to the

current token.

§ Aggregate: Combine information from all tokens, weighted by their

relevance, to compute a new representation for the current token.

Global Context
Captures relationships between tokens across the entire sequence, not just
local neighbors.

Dynamic Focus
The model decides what to focus on for each token, rather than relying on fixed
patterns (e.g., sliding windows in convolution).

Flexibility
Works for variable-length sequences and tasks requiring both short- and long-
range dependencies.

Page 19
Queries, Key, and Values
In the Attention is all you need paper, a novel algorithm, based on Queries, Keys,
and Values is presented to handle the attention calculations:
QKV allows each token to decide:
§ What it wants to know (Query).
§ What information it can provide (Key).
§ The actual data it contributes to the result (Value).

Step 1: Create Q, K, V
Each input token (e.g., word embedding) is linearly transformed into three vectors:
Query (Q), Key (K), and Value (V).

Step 2: Compute Attention Scores:

Compare the Query of a token with the Keys of all tokens in the sequence to
measure relevance. Use the dot product to capture similarity.

Step 3: Normalize Scores (SoftMax):

Apply SoftMax to the attention scores to ensure they sum to 1, producing attention
weights.

Step 4: Aggregate Values:

Multiply the attention weights by the corresponding Values and sum them to
produce the output representation for each token.
Page 20
Wrap Up

Language Models and Attention

§ Today we introduced the concepts of deep learning in the context of

language models.
§ We saw some of the older model architectures like recurrent neural
networks that were used to model sequence data as an improvement
over traditional feedforward neural networks.
§ The concept of attention mechanisms was also discussed as a means
to add extra information between parts of sequences
§ As a precursor to the introduction of the full transformer model, we
looked back at the evolution of attention models in deep learning.
§ Finally, we presented the self-attention mechanism that provided the
breakthrough needed for modern LLMs and Transformer-based
models
-----------------------------------------------------------------------------
In the next class we will explore the transformer model.

Page 21
Thank you!

TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
50 Deep Learning Technical Interview Questions With Answers
100% (1)
50 Deep Learning Technical Interview Questions With Answers
20 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
Comprehensive Guide Attention Mechanism Deep Learning
No ratings yet
Comprehensive Guide Attention Mechanism Deep Learning
17 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
AN2DL_06_2324_AttentionAndTrasformers
No ratings yet
AN2DL_06_2324_AttentionAndTrasformers
60 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
21 pages
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
No ratings yet
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
15 pages
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
7181-attention-is-all-you-need
No ratings yet
7181-attention-is-all-you-need
11 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
A1
No ratings yet
A1
11 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
attention is all you need
No ratings yet
attention is all you need
11 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
Attention 1 2
No ratings yet
Attention 1 2
2 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
attention
No ratings yet
attention
15 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Example File
No ratings yet
Example File
3 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Transformer Tutorial
No ratings yet
Transformer Tutorial
14 pages
Presentation On Attention Model
No ratings yet
Presentation On Attention Model
14 pages
Aiayn
No ratings yet
Aiayn
15 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
Attention As An RNN: Preprint. Under Review
No ratings yet
Attention As An RNN: Preprint. Under Review
18 pages
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
No ratings yet
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
14 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Attention is all you need
No ratings yet
Attention is all you need
19 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Attention Mechanism in Neural Networks
No ratings yet
Attention Mechanism in Neural Networks
22 pages
Understanding Attention Mechanisms in Deep Learning
No ratings yet
Understanding Attention Mechanisms in Deep Learning
104 pages
DL For Sequencial Data
No ratings yet
DL For Sequencial Data
36 pages
Attention_ Attention! _ Lil'Log
No ratings yet
Attention_ Attention! _ Lil'Log
23 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
2503.12992v1 (1)
No ratings yet
2503.12992v1 (1)
42 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
Building Transformer Models With Attention - Web - Page
No ratings yet
Building Transformer Models With Attention - Web - Page
19 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
lec-12
No ratings yet
lec-12
30 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
dis7-sol
No ratings yet
dis7-sol
8 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)
No ratings yet
3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)
64 pages
lecture10_RNN_LSTMs
No ratings yet
lecture10_RNN_LSTMs
35 pages
UNIT 2 FULL - Compressed
No ratings yet
UNIT 2 FULL - Compressed
26 pages
RNN_2
No ratings yet
RNN_2
144 pages
8.0+Introduction+to+Transformers+and+LLMs
No ratings yet
8.0+Introduction+to+Transformers+and+LLMs
9 pages
shivam final
No ratings yet
shivam final
34 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
ML 5
No ratings yet
ML 5
76 pages
Permutation Rules and Genetic Algorithm to Solve the Traveling Salesman Problem
No ratings yet
Permutation Rules and Genetic Algorithm to Solve the Traveling Salesman Problem
10 pages
part 5 and 7
No ratings yet
part 5 and 7
1 page
05 Parallel Programming-Pthreads -V2(1)
No ratings yet
05 Parallel Programming-Pthreads -V2(1)
30 pages
Bianchi
No ratings yet
Bianchi
62 pages
Unit 2 - Neural Networks (DL Illustrated)
No ratings yet
Unit 2 - Neural Networks (DL Illustrated)
146 pages
Immediate Download Deep Learning in Bioinformatics: Techniques and Applications in Practice - Ebook PDF Ebooks 2024
100% (5)
Immediate Download Deep Learning in Bioinformatics: Techniques and Applications in Practice - Ebook PDF Ebooks 2024
41 pages
2.vanishing Gradient and Exploding Gradient Simple Notes
No ratings yet
2.vanishing Gradient and Exploding Gradient Simple Notes
2 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
30_Deep_Learning_Quiz_Questions_and_Answers_OnlineExamMaker_Blog
No ratings yet
30_Deep_Learning_Quiz_Questions_and_Answers_OnlineExamMaker_Blog
18 pages
All Quizes
No ratings yet
All Quizes
81 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
RNN LSTM
No ratings yet
RNN LSTM
42 pages
Practice Exam Solutions
No ratings yet
Practice Exam Solutions
26 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
14 pages
NLP Lecture 6
No ratings yet
NLP Lecture 6
57 pages
Unit 3
No ratings yet
Unit 3
41 pages
comprehensive-popular-deep-learning-interview-questions-answers
No ratings yet
comprehensive-popular-deep-learning-interview-questions-answers
15 pages
Cours Certif Huawei
No ratings yet
Cours Certif Huawei
441 pages
Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo
No ratings yet
Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo
8 pages
DL Endsem 2024 FlyHigh Services
No ratings yet
DL Endsem 2024 FlyHigh Services
18 pages
thesis (52) (1)
No ratings yet
thesis (52) (1)
76 pages
Lec RNNs 2 LLMs - 1
No ratings yet
Lec RNNs 2 LLMs - 1
117 pages
Deep Learning Handson
No ratings yet
Deep Learning Handson
65 pages
RNN Basics
No ratings yet
RNN Basics
17 pages
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
Network Anomaly Detection Using LSTMBased Autoencoder
No ratings yet
Network Anomaly Detection Using LSTMBased Autoencoder
10 pages
Unit I Architecture of Neural Network
No ratings yet
Unit I Architecture of Neural Network
74 pages
gradient_exploding_vanishing_problem_v2
No ratings yet
gradient_exploding_vanishing_problem_v2
3 pages
5707 11 RNN LSTM
No ratings yet
5707 11 RNN LSTM
128 pages
Ensemble Deep Learning for Aspect-based Sentiment Analysis
No ratings yet
Ensemble Deep Learning for Aspect-based Sentiment Analysis
10 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages

3.1 Language Models and Attention

Uploaded by

3.1 Language Models and Attention

Uploaded by

Lecture 3.

§ Language Models in Deep Learning

Recall that a goal of a language model is to predict the correct word

We have already seen statistical methods like Bag of Words which

But deep learning typically allows for more complex features to be

Using deep learning for language, we can explore more complex

Language is causal, previous words influence future words, the same

A newer innovation, the recurrent neural network is designed to

Challenges with RNNs

Different to other Layer types?

Global vs. Local Attention:

Improved Training Efficiency:

Lin et al. introduced a self-attention mechanism to create

§ Self-Attention for Sentence Embeddings:

§ Regularization for Diversity:

Self-Attention: What Does It Actually Mean?

§ Weight: Assign weights to other tokens based on their relevance to the

§ Aggregate: Combine information from all tokens, weighted by their

Step 2: Compute Attention Scores:

Step 3: Normalize Scores (SoftMax):

Step 4: Aggregate Values:

Language Models and Attention

§ Today we introduced the concepts of deep learning in the context of

You might also like