3.1 Language Models and Attention
3.1 Language Models and Attention
1 - Language
Models and Attention
Generative AI Teaching Kit
The NVIDIA Deep Learning Institute Generative AI Teaching Kit is licensed by NVIDIA and Dartmouth College under the
Creative Commons Attribution-NonCommercial 4.0 International License.
Page 2
This lecture
Page 3
Language Models in Deep Learning
Page 4
Deep Language Modeling
Page 5
Challenges of Deep Learning Sequences
One main issue that standard neural networks run into when
attempting to model language is time-dependency.
Page 6
Recurrent Neural Networks
Unlike traditional feedforward networks, RNNs are designed to recognize patterns in sequences of
data, such as time-series, text, speech, or videos, by maintaining a hidden state that captures
information about previous inputs in the sequence.
Key Characteristics of RNNs:
Sequential Processing:
RNNs process input one step at a time, making them well-suited for tasks where data has a
temporal or sequential structure.
Recurrent Connections:
At each time step, the hidden state of the network is updated based on the current input and the
hidden state from the previous time step. This allows the network to "remember" information from
earlier in the sequence.
Shared Weights:
The weights used for processing are shared across time steps, which reduces the number of
parameters and helps capture temporal dependencies.
Memory:
The hidden state serves as a form of memory, enabling the network to use information from earlier
inputs to influence later outputs.
Training via Backpropagation Through Time (BPTT):
The gradient is computed for all time steps of the sequence to train the network. However, this can
lead to issues like vanishing or exploding gradients.
Page 7
Recurrent Neural Networks – Code Differences
Key Differences
§ ANN: Processes inputs without considering temporal relationships; uses a simple Linear layer.
§ RNN: Processes sequences with recurrent connections; uses an RNN layer and captures sequential
dependencies.
Page 8
Long-Short Term Memory Models (LSTM)
Solution: LSTMs
§ Introduce a "cell state" to selectively retain or discard information.
§ Effectively capture long-range dependencies in sequences.
Page 9
Limitations of RNNs and LSTMs in Language Modeling
Sequential Processing Bottleneck
§ RNNs and LSTMs process inputs step-by-step, making training and inference slow
for long sequences.
Vanishing Gradients
§ Gradients diminish as they backpropagate through many time steps, limiting the
network’s ability to learn relationships across long sequences.
Challenges with Long-Range Dependencies
§ Even with LSTMs, retaining information from distant parts of a sequence is difficult,
leading to a loss of context over time.
Focus on Local Context
§ RNNs and LSTMs prioritize immediate neighboring words but struggle to model
relationships across the entire input effectively.
The solution?
A new mechanism that would enable parallel process, dynamically focusing on relevant sequence parts, and
capturing long-range dependencies without the limitations of step-by-step computing or vanishing gradients…
Page 10
Evolution of attention mechanisms pre-transformers
Page 11
Adding Attention
What is Attention?
Attention is a mechanism that enables a model to focus on the most relevant parts of
the input while making predictions.
Instead of treating all input information equally, it assigns varying levels of
"importance" (weights) to different parts based on the task.
Key Idea:
When processing sequences, attention computes a weighted sum of the input
elements, where the weights represent how much "attention" each element deserves.
Page 12
Early Attention Attempts – Bahdanau et.al 2014
Bahdanau et.al introduced a mechanism to dynamically focus on specific parts of the input sequence while generating each
element of the output sequence.
§ The encoder produces a set of context vectors (hidden states) for each input token.
§ For each output token, the decoder calculates an attention score for each input token based on its relevance to the current
decoding step.
§ The scores are normalized (using SoftMax) to produce attention weights, which are used to compute a weighted sum of
encoder hidden states (context vector).
§ This context vector is then used by the decoder to produce the next output token.
Page 13
Improved Attention – Luong et.al 2015
Luong et.al introduced in the paper “Effective Approaches to Attention-based Neural Machine
Translation” by Minh-Thang Luong et al., this innovation focused on improving the
computational efficiency and flexibility of attention mechanisms.
Multiplicative Attention (Dot-Product Scoring):
§ Replaced Bahdanau's additive scoring function with a simpler dot-product or scaled dot-
product function to calculate attention scores.
§ Resulted in faster computations while maintaining strong performance.
Page 14
Early Self-Attention – Lin et.al 2017
§ Multi-Aspect Representations:
Captures diverse aspects of a sentence by generating multiple
attention vectors (hops), enabling richer embeddings.
Page 15
Remaining Limitations with Attention Methods
Despite the novel innovations of attention methods, these approaches still suffered from some
general limitations, preventing their widespread use.
Dependence on Recurrence
§ Attention mechanisms (e.g., Bahdanau, Luong, Lin et al.) were tightly integrated with RNNs/LSTMs, which
process sequences sequentially.
§ Gradient Issues: The reliance on recurrence also made them susceptible to vanishing or exploding gradients,
limiting their ability to model very long dependencies.
Lack of Scalability
§ RNN/LSTM-based models with attention were computationally expensive and struggled with large datasets or
sequences.
§ Memory Usage: Maintaining hidden states for long sequences was resource-intensive.
Inefficient Training
§ Training LSTM-based models with attention was slow because of sequential dependencies and the need to
process data step-by-step.
Page 16
The Self-Attention Mechanism
Page 17
Attention is all you need – Vaswani et.al 2017
One of the most seminal papers of machine learning to date. “Attention is all you need” introduced a new concept of how to
make use of attention and completely removed the reliance on recurrence to process sequences. In the next lesson we will
dive deeper into the Transformer model, but now we will focus on how self-attention is presented in that paper.
Key Concept:
Every token in the sequence can attend to every other token, including itself, to understand its relationship and importance
in the context of the entire sequence.
E.g.
In the sentence: "The cat chased the mouse,"
Self-attention helps the model understand that "the mouse" is what "the cat" chased, by focusing on the semantic
relationship between "chased" and "the mouse."
Page 18
How Self-Attention Mechanisms Work
The implementation of self-attention can vary, but the essence is to:
§ Compare: Each token is compared to every other token in the sequence to
compute relevance scores.
Global Context
Captures relationships between tokens across the entire sequence, not just
local neighbors.
Dynamic Focus
The model decides what to focus on for each token, rather than relying on fixed
patterns (e.g., sliding windows in convolution).
Flexibility
Works for variable-length sequences and tasks requiring both short- and long-
range dependencies.
Page 19
Queries, Key, and Values
In the Attention is all you need paper, a novel algorithm, based on Queries, Keys,
and Values is presented to handle the attention calculations:
QKV allows each token to decide:
§ What it wants to know (Query).
§ What information it can provide (Key).
§ The actual data it contributes to the result (Value).
Step 1: Create Q, K, V
Each input token (e.g., word embedding) is linearly transformed into three vectors:
Query (Q), Key (K), and Value (V).
Page 21
Thank you!