Lecture 26
Lecture 26
denote the output (“target”) token embeddings (assuming output sequence length is also 3 and is a special start-
of-sentence (SOS) token)
Superscript denotes the layer . At layer 0, we have original token embeddings (with added positional encodings)
In the output layer of decoder, at each step, we predict the most likely output token for that step ( denotes the
special end-of-sentence (EOS) token)
In the decoder’s input layer, tokens in the output sequence are shown as shifted right by one position (because, in
the output layer, the next token prediction depends on the previously predicted token in the output sequence)
The feedforward (FF) and linear layers are applied position-wise (for each token separately)
(ℓ ) (ℓ )
𝑡 𝑠 𝑡 (1ℓ ) 𝑡 2
(ℓ )
𝑡3
𝑡^ 𝑚 =argmax 𝑖 =1 ,… ,𝑉 softmax( 𝑾 𝑡 𝑚 )
(𝑁 ) ( 𝑁)
(ℓ ) (ℓ ) (ℓ )
𝑠1 𝑠2 𝑠3 Layer Normalization Note the one
𝑡^ 1 𝑡^ 2 𝑡^ 3 𝑡^ 𝑒
(𝑁 ) (𝑁 ) (𝑁 ) (𝑁 ) position (towards
right) shift between
Layer Normalization FF FF FF FF decoder’s input vs
output
Softmax Softmax Softmax Softmax
Layer Normalization With weight matrix
Overall
Transformer FF FF FF of size where is
Cross-Attention Layer Linear Linear Linear Linear vocab size and is the
Architecture dimensionality of the
Each FF (feed-forward)in encoder last decoder block
Layer Normalization
and decoder blocks is usually a Layer Normalization (𝑁 ) (𝑁 ) (𝑁 ) (𝑁 )
linear layer + ReLU nonlinearity 𝑡𝑠 𝑡1 𝑡2 𝑡3 embeddings
+ another linear layer Self-Attention Layer Masked Self-Attention Layer
Decoder’s output layer
( ℓ − 1) ( ℓ − 1) ( ℓ − 1) ( ℓ − 1) (ℓ − 1) (ℓ − 1) (ℓ − 1)
𝑠1 𝑠2 𝑠3 𝑡𝑠 𝑡1 𝑡2 𝑡3
An Encoder Block A Decoder Block connected with
(N such blocks) the corresponding encoder block
(N such blocks) CS771: Intro to ML
3
Layer Normalization
Normalization helps improve training and performance overall
Unlike batch normalization (BN), which we already saw, layer normalization (LN)
normalizes each across its dimensions (not across all minibatch examples)
LN commonly used for sequence data models (e.g., RNN and transformers) where BN is difficult to
apply
LN also useful when batch sizes are small (or equal to 1) where BN statisticsAfter
(mean/var) aren’t
LN operation, we reliable
apply another
transformation defined by another set of learnable
weights (just like we did in BN using and
𝒙𝑛 𝒙𝑛
Figure source: Dive into Deep Learning (Zhang et al, 2023) CS771: Intro to ML
7
Unsupervised Pre-training
Self-supervised learning is a powerful idea to learn good representations unsupervisedly
Self-supervised learning will help us
learn a good encoder (feature
Hide part of the input
representation)
and predict it using
the remaining parts
Models like BERT, GPT are usually pre-trained using self-supervised learning
Then we can finetune them further for a given task using labelled data for that task
CS771: Intro to ML
A special type of self-supervised
8
Auto-encoders learning: The whole input is being
predicted by first compressing it
and then uncompressing
^
𝑥 =𝑔 ( 𝑓 ( 𝑥 ) )
called variational auto-encoder decoder (standard AE’s
(VAE) decoder can’t generate “new”
data)
𝑓 𝑔
representations of the don’t learn an identify
input mapping from to
CS771: Intro to ML
9
Convolution-less Models for Images: ViT
Transformers can be used for images as well#. For image classification, it looks like this
Treat image
patches as tokens
of a sequence
Also use the
position
information
Early work showed ViT can outperform CNNs given very large amount of training
data
However, recent work* has shown that good old CNNs still rule! ViT and CNN
perform
# An Image is Worth 16x16comparably atRecognition
Words: Transformers for Image scale,at Scale
i.e., when
(Dosovitskiy both given large amount of compute and
et al, 2020)
CS771: Intro to ML
training data
*ConvNets Match Vision Transformers at Scale (Smith et al, 2023)
10
Convolution-less Models for Images: MLP-mixer
Many MLPs can be mixed to construct more powerful deep models (“MLP-mixer”)
‘T’ stands
for
Transpose
MLP-Mixer: An all-MLP Architecture for Vision (Tolstikhin et al, 2021) CS771: Intro to ML
11
Bias-Variance Trade-off
Assume to be a class of models (e.g., linear classifiers with some pre-defined features)
Suppose we’ve learned a model learned using some (finite amount of) training data
We can decompose the test error of as follows
E.g., going from
linear models to
deep nets or by Reason: We are now learning a
adding more features more complex model using the
same amount of training data
Can bias reduce by
Making richer will also cause
making class richer
estimation error to increase
Here is the best possible model in assuming infinite amount of training data
Approximation error: Error of because of model class being too simple
Also known as “bias”(high if the model is simple)
Estimation error: Error of (relative to ) because we only had finite training data
Also known as “variance”(high if the model is complex)
Because we can’t keep both low, this is known as the bias-variance trade-off
CS771: Intro to ML
12
Bias-Variance Trade-off
Bias-variance trade-off implies how training/test losses vary as we increase model
complexity
CS771: Intro to ML
13
Deep Neural Nets and Bias-Variance Trade-off
Bias-variance trade-off doesn’t explain well why deep neural networks work so well
They have very large model complexity (massive number of parameters – massively
“overparametrized”)
Despite being massively overparametrized, deep neural nets still work well because
Implicit regularization: SGD has noise (randomly chosen minibatches) which performs regularization
These networks have many local minima and all of them are roughly equally good
SGD on overparametrized models usually converges to “flat” minima (less chance of overfitting)
Such a solution is A flat Such minima are not good
less likely to be an minima because they might represent
overfitted solution an overfitted solution
because other nearby A sharp
solutions are also minima
reasonably good SGD because of the
“noise” can escape
such sharp minima
CS771: Intro to ML
16
Common Types of Layers used in Deep Learning
Linear layer: Have the form (used in fully connected networks like MLP and also in some
parts of other type of models like CNN, RNN, transformers, etc)
Nonlinearity: Activation functions (sigmoid, tanh, ReLU, etc)
Essential for any deep neural network (without them, deep nets can’t learn nonlinear functions)
Convolutional layer: Have the form (here * denotes the conv operation)
Usually used in conjunction with pooling layers (e.g., max pooling, average pooling)
Residual or skip connections: Help when learning very deep networks (e.g., ResNets,
transformers, etc) by avoiding vanishing/exploding gradients
Normalization layer such as batch normalization and layer normalization
Dropout layer: Helps to regularize the network
Recurrent layer: Used in sequential data models such as RNNs and variants
Attention layer: Used in encoder-decoder models like transformers (also in some RNN
variants)
Multiplicative layer: Have the form (used when each input has two parts and ) CS771: Intro to ML
17
Popular Deep Learning Architectures
MLP: Feedforward fully connected network
Not preferred when inputs have spatial/sequential structures (e.g., image, text)
Some variants of MLP (e.g., MLP-mix) perform very well on such data as well
CNN: Feedforward but NOT fully connected (but last few layers, especially output,
are)
RNNs: Not feedforward (hidden state of one timestep connects with that of the
next)
Graph Neural Networks: Used when inputs are graphs (e.g., molecules)
CS771: Intro to ML