0% found this document useful (0 votes)
41 views

Lecture 26

Uploaded by

Kushagra gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Lecture 26

Uploaded by

Kushagra gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Deep Neural Networks:

Assorted Topics and Some Recent Advances

CS771: Introduction to Machine Learning


Piyush Rai
Transformers 

In the encoder and decoder, the blue arrows denote skip/residual connections
denote the input (“source”) token embeddings (assuming input sequence length is 3)
2
(recap) 


denote the output (“target”) token embeddings (assuming output sequence length is also 3 and is a special start-
of-sentence (SOS) token)
Superscript denotes the layer . At layer 0, we have original token embeddings (with added positional encodings)
 In the output layer of decoder, at each step, we predict the most likely output token for that step ( denotes the
special end-of-sentence (EOS) token)
 In the decoder’s input layer, tokens in the output sequence are shown as shifted right by one position (because, in
the output layer, the next token prediction depends on the previously predicted token in the output sequence)
 The feedforward (FF) and linear layers are applied position-wise (for each token separately)
(ℓ ) (ℓ )
𝑡 𝑠 𝑡 (1ℓ ) 𝑡 2
(ℓ )
𝑡3
𝑡^ 𝑚 =argmax 𝑖 =1 ,… ,𝑉 softmax( 𝑾 𝑡 𝑚 )
(𝑁 ) ( 𝑁)
(ℓ ) (ℓ ) (ℓ )
𝑠1 𝑠2 𝑠3 Layer Normalization Note the one

𝑡^ 1 𝑡^ 2 𝑡^ 3 𝑡^ 𝑒
(𝑁 ) (𝑁 ) (𝑁 ) (𝑁 ) position (towards
right) shift between
Layer Normalization FF FF FF FF decoder’s input vs
output
Softmax Softmax Softmax Softmax
Layer Normalization With weight matrix
Overall
Transformer FF FF FF of size where is
Cross-Attention Layer Linear Linear Linear Linear vocab size and is the
Architecture dimensionality of the
Each FF (feed-forward)in encoder last decoder block
Layer Normalization
and decoder blocks is usually a Layer Normalization (𝑁 ) (𝑁 ) (𝑁 ) (𝑁 )
linear layer + ReLU nonlinearity 𝑡𝑠 𝑡1 𝑡2 𝑡3 embeddings
+ another linear layer Self-Attention Layer Masked Self-Attention Layer
Decoder’s output layer
( ℓ − 1) ( ℓ − 1) ( ℓ − 1) ( ℓ − 1) (ℓ − 1) (ℓ − 1) (ℓ − 1)
𝑠1 𝑠2 𝑠3 𝑡𝑠 𝑡1 𝑡2 𝑡3
An Encoder Block A Decoder Block connected with
(N such blocks) the corresponding encoder block
(N such blocks) CS771: Intro to ML
3
Layer Normalization
 Normalization helps improve training and performance overall

 Unlike batch normalization (BN), which we already saw, layer normalization (LN)
normalizes each across its dimensions (not across all minibatch examples)
 LN commonly used for sequence data models (e.g., RNN and transformers) where BN is difficult to
apply
 LN also useful when batch sizes are small (or equal to 1) where BN statisticsAfter
(mean/var) aren’t
LN operation, we reliable
apply another
transformation defined by another set of learnable
weights (just like we did in BN using and

 For an MLP, (2the


) LN operation would look (2like
) this has zero mean and unit std-
𝒉𝑛 0.1 0.8 0.3 𝒉𝑛 -1.02 1.36 -0.34
dev along its dimensions
After LN
(1 ) (1 ) has zero mean and unit std-
𝒉 𝑛
0.5 0.9 0.7 𝒉𝑛 -1.22 1.22 0.0
dev along its dimensions

𝒙𝑛 𝒙𝑛

 Unlike MLP, in RNNs/transformers, each input is a sequence


 We have a sequence of token embedding vectors for each input
 LN simply normalizes each token embedding vector as above CS771: Intro to ML
4
Residual/Skip Connections
 Transformers (and many other modern deep nets) contain a very large number of layers
 In general, just stacking lots of layer doesn’t necessarily help a deep learning model
 Vanishing/exploding gradient may make learning difficult
 Skip connections or “residual connections” help if we want very deep networks
 This idea was popularized by “Residual Networks”* (ResNets) which can have hundreds of layers
 Basic idea: Don’t force a layer to learn everything about a mapping May need to perform an additional
projection/adjustment to that the
sizes of and match
Added a “residual branch” or
“short-cut” connection to connect
to the residual output of these
These layers trying layers
to learn some
function Reducing their burden by just
asking them to learn the
“residual”

*Deep Residual Learning for Image Recognition (He et al, 2015)


CS771: Intro to ML
5
Transformers have many variants
 The standard transformer architecture is an encoder-decoder model
 Some models use just the encoder or the decoder of the transformer
 BERT (Bidirectional Encoder Representations from Transformers)
 Basic BERT can be learned to encoder token sequences
 GPT (Generative Pretrained Transformer)
 Basic GPT can be used to generate token sequences similar to its training data
Encoder Decoder
This encoder can Also, no cross-attention
be used for other since there is no
tasks by fine- encoder
tuning A transformer
A transformer which contains
which contains only the decoder
BERT GPT
only the encoder Pre-trained using a
next token
Trained unsupervisedly prediction
using a missing token objective
prediction objective
This is just start of
sentence token
Missing token which
BERT tries to predict CS771: Intro to ML
6
Fine-tuning and Transfer Learning
 Deep neural networks trained on one dataset can be reused for another dataset
 It is like transferring the knowledge of one learning task to another learning task
 This is typically done by “freezing” most of the lower layers and finetuning the output
layer (or the top few layers) – this is known as “fine-tuning”
BERT (pre-trained in unsupervised
Initial model with frozen manner) fine-tuned for a sentence
layers is called the “pre- classification task by adding a fully
trained” model and the connect MLP to predict class-label of a
updated model is called sentence
the “fine-tuned” model

This example is for an


MLP like architecture but
fine-tuning can be done
for other architectures as
well, such as RNN,
CNN, transformers, etc

Figure source: Dive into Deep Learning (Zhang et al, 2023) CS771: Intro to ML
7
Unsupervised Pre-training
 Self-supervised learning is a powerful idea to learn good representations unsupervisedly
Self-supervised learning will help us
learn a good encoder (feature
Hide part of the input
representation)
and predict it using
the remaining parts

 Self-supervised learning is key to unsupervised pre-training of deep learning models


 Such pre-trained models can be fine-tuned for any new task given labelled data

 Models like BERT, GPT are usually pre-trained using self-supervised learning
 Then we can finetune them further for a given task using labelled data for that task
CS771: Intro to ML
A special type of self-supervised
8
Auto-encoders learning: The whole input is being
predicted by first compressing it
and then uncompressing

 Auto-encoders (AE) are used for unsupervised feature learning


Note: Usually only the encoder is
 Consist of an encoder and a decoder of use after the AE has been trained
(unless we want to use the decoder
 and can be deep neural networks (MLP, RNN, CNN, etc) for reconstructing the inputs later)
VAE can also generate
If using a prior on , we can a
probabilistic latent variable model synthetic data usings its

^
𝑥 =𝑔 ( 𝑓 ( 𝑥 ) )
called variational auto-encoder decoder (standard AE’s
(VAE) decoder can’t generate “new”
data)

Goal: Learn and s.t. is small

𝑧 = 𝑓 ( 𝑥) Dimensionality of can be chosen to be


smaller or larger than that of
Sometimes we want
If using AE to learn In such cases, need to
for “overcomplete” impose additional
dimensionalit feature constraints on so that we
y reduction

𝑓 𝑔
representations of the don’t learn an identify
input mapping from to

CS771: Intro to ML
9
Convolution-less Models for Images: ViT
 Transformers can be used for images as well#. For image classification, it looks like this

Only the encoder part of


the transformer needed

On the output side,


we just need an MLP
with softmax outout

Treat image
patches as tokens
of a sequence
Also use the
position
information

 Early work showed ViT can outperform CNNs given very large amount of training
data
 However, recent work* has shown that good old CNNs still rule! ViT and CNN
perform
# An Image is Worth 16x16comparably atRecognition
Words: Transformers for Image scale,at Scale
i.e., when
(Dosovitskiy both given large amount of compute and
et al, 2020)
CS771: Intro to ML
training data
*ConvNets Match Vision Transformers at Scale (Smith et al, 2023)
10
Convolution-less Models for Images: MLP-mixer
 Many MLPs can be mixed to construct more powerful deep models (“MLP-mixer”)

‘T’ stands
for
Transpose

MLP-Mixer: An all-MLP Architecture for Vision (Tolstikhin et al, 2021) CS771: Intro to ML
11
Bias-Variance Trade-off
 Assume to be a class of models (e.g., linear classifiers with some pre-defined features)
 Suppose we’ve learned a model learned using some (finite amount of) training data
 We can decompose the test error of as follows
E.g., going from
linear models to
deep nets or by Reason: We are now learning a
adding more features more complex model using the
same amount of training data
Can bias reduce by
Making richer will also cause
making class richer
estimation error to increase
 Here is the best possible model in assuming infinite amount of training data
 Approximation error: Error of because of model class being too simple
 Also known as “bias”(high if the model is simple)
 Estimation error: Error of (relative to ) because we only had finite training data
 Also known as “variance”(high if the model is complex)
 Because we can’t keep both low, this is known as the bias-variance trade-off
CS771: Intro to ML
12
Bias-Variance Trade-off
 Bias-variance trade-off implies how training/test losses vary as we increase model
complexity

CS771: Intro to ML
13
Deep Neural Nets and Bias-Variance Trade-off
 Bias-variance trade-off doesn’t explain well why deep neural networks work so well
 They have very large model complexity (massive number of parameters – massively
“overparametrized”)

 Despite being massively overparametrized, deep neural nets still work well because
 Implicit regularization: SGD has noise (randomly chosen minibatches) which performs regularization
 These networks have many local minima and all of them are roughly equally good
 SGD on overparametrized models usually converges to “flat” minima (less chance of overfitting)
Such a solution is A flat Such minima are not good
less likely to be an minima because they might represent
overfitted solution an overfitted solution
because other nearby A sharp
solutions are also minima
reasonably good SGD because of the
“noise” can escape
such sharp minima

 Learning of good features from the raw data


 Ensemble-like effect (a deep neural net is akin to an ensemble of many simpler models)
 Trained on very large datasets
CS771: Intro to ML
14
Double Descent Phenomenon
 Overparametrized deep neural networks exhibit a “double descent” phenomenon

 Bias-variance trade-off seen only in the underparametrized regime


 Beyond a point (in the overparametrized regime), the test error starts decreasing once again
even as the model gets more and more complex
Figure source: “A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning” (Dar et al, 2021) CS771: Intro to ML
15

Deep Neural Networks: A Summary

CS771: Intro to ML
16
Common Types of Layers used in Deep Learning
 Linear layer: Have the form (used in fully connected networks like MLP and also in some
parts of other type of models like CNN, RNN, transformers, etc)
 Nonlinearity: Activation functions (sigmoid, tanh, ReLU, etc)
 Essential for any deep neural network (without them, deep nets can’t learn nonlinear functions)
 Convolutional layer: Have the form (here * denotes the conv operation)
 Usually used in conjunction with pooling layers (e.g., max pooling, average pooling)
 Residual or skip connections: Help when learning very deep networks (e.g., ResNets,
transformers, etc) by avoiding vanishing/exploding gradients
 Normalization layer such as batch normalization and layer normalization
 Dropout layer: Helps to regularize the network
 Recurrent layer: Used in sequential data models such as RNNs and variants
 Attention layer: Used in encoder-decoder models like transformers (also in some RNN
variants)
 Multiplicative layer: Have the form (used when each input has two parts and ) CS771: Intro to ML
17
Popular Deep Learning Architectures
 MLP: Feedforward fully connected network
 Not preferred when inputs have spatial/sequential structures (e.g., image, text)
 Some variants of MLP (e.g., MLP-mix) perform very well on such data as well

 CNN: Feedforward but NOT fully connected (but last few layers, especially output,
are)

 RNNs: Not feedforward (hidden state of one timestep connects with that of the
next)

 Transformers: Very powerful models for sequential data


 Unlike RNNs, can process inputs in parallel. Also uses (self) attention to better capture long
range dependencies and context in the input sequence

 Graph Neural Networks: Used when inputs are graphs (e.g., molecules)
CS771: Intro to ML

You might also like