0% found this document useful (0 votes)

41 views

Lecture 26

Uploaded by

Kushagra gupta

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Lecture 26

Uploaded by

Kushagra gupta

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Deep Neural Networks:

Assorted Topics and Some Recent Advances

CS771: Introduction to Machine Learning

Piyush Rai
Transformers 

In the encoder and decoder, the blue arrows denote skip/residual connections
denote the input (“source”) token embeddings (assuming input sequence length is 3)
2
(recap) 


denote the output (“target”) token embeddings (assuming output sequence length is also 3 and is a special start-
of-sentence (SOS) token)
Superscript denotes the layer . At layer 0, we have original token embeddings (with added positional encodings)
 In the output layer of decoder, at each step, we predict the most likely output token for that step ( denotes the
special end-of-sentence (EOS) token)
 In the decoder’s input layer, tokens in the output sequence are shown as shifted right by one position (because, in
the output layer, the next token prediction depends on the previously predicted token in the output sequence)
 The feedforward (FF) and linear layers are applied position-wise (for each token separately)
(ℓ ) (ℓ )
𝑡 𝑠 𝑡 (1ℓ ) 𝑡 2
(ℓ )
𝑡3
𝑡^ 𝑚 =argmax 𝑖 =1 ,… ,𝑉 softmax( 𝑾 𝑡 𝑚 )
(𝑁 ) ( 𝑁)
(ℓ ) (ℓ ) (ℓ )
𝑠1 𝑠2 𝑠3 Layer Normalization Note the one

𝑡^ 1 𝑡^ 2 𝑡^ 3 𝑡^ 𝑒
(𝑁 ) (𝑁 ) (𝑁 ) (𝑁 ) position (towards
right) shift between
Layer Normalization FF FF FF FF decoder’s input vs
output
Softmax Softmax Softmax Softmax
Layer Normalization With weight matrix
Overall
Transformer FF FF FF of size where is
Cross-Attention Layer Linear Linear Linear Linear vocab size and is the
Architecture dimensionality of the
Each FF (feed-forward)in encoder last decoder block
Layer Normalization
and decoder blocks is usually a Layer Normalization (𝑁 ) (𝑁 ) (𝑁 ) (𝑁 )
linear layer + ReLU nonlinearity 𝑡𝑠 𝑡1 𝑡2 𝑡3 embeddings
+ another linear layer Self-Attention Layer Masked Self-Attention Layer
Decoder’s output layer
( ℓ − 1) ( ℓ − 1) ( ℓ − 1) ( ℓ − 1) (ℓ − 1) (ℓ − 1) (ℓ − 1)
𝑠1 𝑠2 𝑠3 𝑡𝑠 𝑡1 𝑡2 𝑡3
An Encoder Block A Decoder Block connected with
(N such blocks) the corresponding encoder block
(N such blocks) CS771: Intro to ML
3
Layer Normalization
 Normalization helps improve training and performance overall

 Unlike batch normalization (BN), which we already saw, layer normalization (LN)
normalizes each across its dimensions (not across all minibatch examples)
 LN commonly used for sequence data models (e.g., RNN and transformers) where BN is difficult to
apply
 LN also useful when batch sizes are small (or equal to 1) where BN statisticsAfter
(mean/var) aren’t
LN operation, we reliable
apply another
transformation defined by another set of learnable
weights (just like we did in BN using and

 For an MLP, (2the

) LN operation would look (2like
) this has zero mean and unit std-
𝒉𝑛 0.1 0.8 0.3 𝒉𝑛 -1.02 1.36 -0.34
dev along its dimensions
After LN
(1 ) (1 ) has zero mean and unit std-
𝒉 𝑛
0.5 0.9 0.7 𝒉𝑛 -1.22 1.22 0.0
dev along its dimensions

𝒙𝑛 𝒙𝑛

 Unlike MLP, in RNNs/transformers, each input is a sequence

 We have a sequence of token embedding vectors for each input
 LN simply normalizes each token embedding vector as above CS771: Intro to ML
4
Residual/Skip Connections
 Transformers (and many other modern deep nets) contain a very large number of layers
 In general, just stacking lots of layer doesn’t necessarily help a deep learning model
 Vanishing/exploding gradient may make learning difficult
 Skip connections or “residual connections” help if we want very deep networks
 This idea was popularized by “Residual Networks”* (ResNets) which can have hundreds of layers
 Basic idea: Don’t force a layer to learn everything about a mapping May need to perform an additional
projection/adjustment to that the
sizes of and match
Added a “residual branch” or
“short-cut” connection to connect
to the residual output of these
These layers trying layers
to learn some
function Reducing their burden by just
asking them to learn the
“residual”

*Deep Residual Learning for Image Recognition (He et al, 2015)

CS771: Intro to ML
5
Transformers have many variants
 The standard transformer architecture is an encoder-decoder model
 Some models use just the encoder or the decoder of the transformer
 BERT (Bidirectional Encoder Representations from Transformers)
 Basic BERT can be learned to encoder token sequences
 GPT (Generative Pretrained Transformer)
 Basic GPT can be used to generate token sequences similar to its training data
Encoder Decoder
This encoder can Also, no cross-attention
be used for other since there is no
tasks by fine- encoder
tuning A transformer
A transformer which contains
which contains only the decoder
BERT GPT
only the encoder Pre-trained using a
next token
Trained unsupervisedly prediction
using a missing token objective
prediction objective
This is just start of
sentence token
Missing token which
BERT tries to predict CS771: Intro to ML
6
Fine-tuning and Transfer Learning
 Deep neural networks trained on one dataset can be reused for another dataset
 It is like transferring the knowledge of one learning task to another learning task
 This is typically done by “freezing” most of the lower layers and finetuning the output
layer (or the top few layers) – this is known as “fine-tuning”
BERT (pre-trained in unsupervised
Initial model with frozen manner) fine-tuned for a sentence
layers is called the “pre- classification task by adding a fully
trained” model and the connect MLP to predict class-label of a
updated model is called sentence
the “fine-tuned” model

This example is for an

MLP like architecture but
fine-tuning can be done
for other architectures as
well, such as RNN,
CNN, transformers, etc

Figure source: Dive into Deep Learning (Zhang et al, 2023) CS771: Intro to ML
7
Unsupervised Pre-training
 Self-supervised learning is a powerful idea to learn good representations unsupervisedly
Self-supervised learning will help us
learn a good encoder (feature
Hide part of the input
representation)
and predict it using
the remaining parts

 Self-supervised learning is key to unsupervised pre-training of deep learning models

 Such pre-trained models can be fine-tuned for any new task given labelled data

 Models like BERT, GPT are usually pre-trained using self-supervised learning
 Then we can finetune them further for a given task using labelled data for that task
CS771: Intro to ML
A special type of self-supervised
8
Auto-encoders learning: The whole input is being
predicted by first compressing it
and then uncompressing

 Auto-encoders (AE) are used for unsupervised feature learning

Note: Usually only the encoder is
 Consist of an encoder and a decoder of use after the AE has been trained
(unless we want to use the decoder
 and can be deep neural networks (MLP, RNN, CNN, etc) for reconstructing the inputs later)
VAE can also generate
If using a prior on , we can a
probabilistic latent variable model synthetic data usings its

^
𝑥 =𝑔 ( 𝑓 ( 𝑥 ) )
called variational auto-encoder decoder (standard AE’s
(VAE) decoder can’t generate “new”
data)

Goal: Learn and s.t. is small

𝑧 = 𝑓 ( 𝑥) Dimensionality of can be chosen to be

smaller or larger than that of
Sometimes we want
If using AE to learn In such cases, need to
for “overcomplete” impose additional
dimensionalit feature constraints on so that we
y reduction

𝑓 𝑔
representations of the don’t learn an identify
input mapping from to

CS771: Intro to ML
9
Convolution-less Models for Images: ViT
 Transformers can be used for images as well#. For image classification, it looks like this

Only the encoder part of

the transformer needed

On the output side,

we just need an MLP
with softmax outout

Treat image
patches as tokens
of a sequence
Also use the
position
information

 Early work showed ViT can outperform CNNs given very large amount of training
data
 However, recent work* has shown that good old CNNs still rule! ViT and CNN
perform
# An Image is Worth 16x16comparably atRecognition
Words: Transformers for Image scale,at Scale
i.e., when
(Dosovitskiy both given large amount of compute and
et al, 2020)
CS771: Intro to ML
training data
*ConvNets Match Vision Transformers at Scale (Smith et al, 2023)
10
Convolution-less Models for Images: MLP-mixer
 Many MLPs can be mixed to construct more powerful deep models (“MLP-mixer”)

‘T’ stands
for
Transpose

MLP-Mixer: An all-MLP Architecture for Vision (Tolstikhin et al, 2021) CS771: Intro to ML
11
Bias-Variance Trade-off
 Assume to be a class of models (e.g., linear classifiers with some pre-defined features)
 Suppose we’ve learned a model learned using some (finite amount of) training data
 We can decompose the test error of as follows
E.g., going from
linear models to
deep nets or by Reason: We are now learning a
adding more features more complex model using the
same amount of training data
Can bias reduce by
Making richer will also cause
making class richer
estimation error to increase
 Here is the best possible model in assuming infinite amount of training data
 Approximation error: Error of because of model class being too simple
 Also known as “bias”(high if the model is simple)
 Estimation error: Error of (relative to ) because we only had finite training data
 Also known as “variance”(high if the model is complex)
 Because we can’t keep both low, this is known as the bias-variance trade-off
CS771: Intro to ML
12
Bias-Variance Trade-off
 Bias-variance trade-off implies how training/test losses vary as we increase model
complexity

CS771: Intro to ML
13
Deep Neural Nets and Bias-Variance Trade-off
 Bias-variance trade-off doesn’t explain well why deep neural networks work so well
 They have very large model complexity (massive number of parameters – massively
“overparametrized”)

 Despite being massively overparametrized, deep neural nets still work well because
 Implicit regularization: SGD has noise (randomly chosen minibatches) which performs regularization
 These networks have many local minima and all of them are roughly equally good
 SGD on overparametrized models usually converges to “flat” minima (less chance of overfitting)
Such a solution is A flat Such minima are not good
less likely to be an minima because they might represent
overfitted solution an overfitted solution
because other nearby A sharp
solutions are also minima
reasonably good SGD because of the
“noise” can escape
such sharp minima

 Learning of good features from the raw data

 Ensemble-like effect (a deep neural net is akin to an ensemble of many simpler models)
 Trained on very large datasets
CS771: Intro to ML
14
Double Descent Phenomenon
 Overparametrized deep neural networks exhibit a “double descent” phenomenon

 Bias-variance trade-off seen only in the underparametrized regime

 Beyond a point (in the overparametrized regime), the test error starts decreasing once again
even as the model gets more and more complex
Figure source: “A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning” (Dar et al, 2021) CS771: Intro to ML
15

Deep Neural Networks: A Summary

CS771: Intro to ML
16
Common Types of Layers used in Deep Learning
 Linear layer: Have the form (used in fully connected networks like MLP and also in some
parts of other type of models like CNN, RNN, transformers, etc)
 Nonlinearity: Activation functions (sigmoid, tanh, ReLU, etc)
 Essential for any deep neural network (without them, deep nets can’t learn nonlinear functions)
 Convolutional layer: Have the form (here * denotes the conv operation)
 Usually used in conjunction with pooling layers (e.g., max pooling, average pooling)
 Residual or skip connections: Help when learning very deep networks (e.g., ResNets,
transformers, etc) by avoiding vanishing/exploding gradients
 Normalization layer such as batch normalization and layer normalization
 Dropout layer: Helps to regularize the network
 Recurrent layer: Used in sequential data models such as RNNs and variants
 Attention layer: Used in encoder-decoder models like transformers (also in some RNN
variants)
 Multiplicative layer: Have the form (used when each input has two parts and ) CS771: Intro to ML
17
Popular Deep Learning Architectures
 MLP: Feedforward fully connected network
 Not preferred when inputs have spatial/sequential structures (e.g., image, text)
 Some variants of MLP (e.g., MLP-mix) perform very well on such data as well

 CNN: Feedforward but NOT fully connected (but last few layers, especially output,
are)

 RNNs: Not feedforward (hidden state of one timestep connects with that of the
next)

 Transformers: Very powerful models for sequential data

 Unlike RNNs, can process inputs in parallel. Also uses (self) attention to better capture long
range dependencies and context in the input sequence

 Graph Neural Networks: Used when inputs are graphs (e.g., molecules)
CS771: Intro to ML

Functional Performance of Older Adults
100% (2)
Functional Performance of Older Adults
601 pages
Physics Informed Neural Network Theory and Applications
No ratings yet
Physics Informed Neural Network Theory and Applications
44 pages
A Survey of Evolution of Image Captioning PDF
No ratings yet
A Survey of Evolution of Image Captioning PDF
18 pages
What Is A Support Vector Machine?: Primer
No ratings yet
What Is A Support Vector Machine?: Primer
3 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Download Full Deep Learning 1st Edition Dulani Meedeniya PDF All Chapters
100% (2)
Download Full Deep Learning 1st Edition Dulani Meedeniya PDF All Chapters
50 pages
Training Generative Adversarial Networks With Limited Data
No ratings yet
Training Generative Adversarial Networks With Limited Data
37 pages
Deep Learning Tutorial Complete (v3)
No ratings yet
Deep Learning Tutorial Complete (v3)
109 pages
Short Report On Expert Systems
100% (1)
Short Report On Expert Systems
12 pages
Neural Networks PDF
No ratings yet
Neural Networks PDF
89 pages
CSC445: Neural Networks
No ratings yet
CSC445: Neural Networks
51 pages
Model With One-Word Context: 2vec 2vec 2vec 2vec
100% (1)
Model With One-Word Context: 2vec 2vec 2vec 2vec
17 pages
Best Practices For Prompt Engineering With The OpenAI
No ratings yet
Best Practices For Prompt Engineering With The OpenAI
6 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
115 pages
Deep Learning (MODULE-3) (1)
No ratings yet
Deep Learning (MODULE-3) (1)
85 pages
Deep Learning: - Course Code: - Unit 1
No ratings yet
Deep Learning: - Course Code: - Unit 1
21 pages
Regularization_for_Neural_Networks_1718966083
No ratings yet
Regularization_for_Neural_Networks_1718966083
9 pages
Automatic Music Generation
No ratings yet
Automatic Music Generation
16 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Rakesh Kumar - Data Scientist
No ratings yet
Rakesh Kumar - Data Scientist
3 pages
LSTM
No ratings yet
LSTM
42 pages
6 - Train - Test - Split - Ipynb - Colaboratory
No ratings yet
6 - Train - Test - Split - Ipynb - Colaboratory
5 pages
Convolutional Neural Networks (Part I)
No ratings yet
Convolutional Neural Networks (Part I)
61 pages
Knowledge Based Systems (Sistem Berbasis Pengetahuan) : Ir. Wahidin Wahab M.SC PH.D
No ratings yet
Knowledge Based Systems (Sistem Berbasis Pengetahuan) : Ir. Wahidin Wahab M.SC PH.D
33 pages
Large Language Model
0% (1)
Large Language Model
38 pages
Prompt Engineering For Vision Models Slides 1720084286
No ratings yet
Prompt Engineering For Vision Models Slides 1720084286
17 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Introduction To Learning: Frederic Precioso 24/01/2019
No ratings yet
Introduction To Learning: Frederic Precioso 24/01/2019
179 pages
Lab I TENSOR FLOW AND KERAS
No ratings yet
Lab I TENSOR FLOW AND KERAS
3 pages
G5Aiai Introduction To AI: Graham Kendall
No ratings yet
G5Aiai Introduction To AI: Graham Kendall
48 pages
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
No ratings yet
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
12 pages
Lesson 4 Logic and Knowledge Representation
No ratings yet
Lesson 4 Logic and Knowledge Representation
100 pages
Information Retrieval Data Structures & Algorithms - William B. Frakes
No ratings yet
Information Retrieval Data Structures & Algorithms - William B. Frakes
630 pages
Unit 2
No ratings yet
Unit 2
112 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
UNIT-I_Introduction to Computer Vision
No ratings yet
UNIT-I_Introduction to Computer Vision
45 pages
Statistics Presentation
No ratings yet
Statistics Presentation
21 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
Module-5:: Network Analysis
No ratings yet
Module-5:: Network Analysis
22 pages
Technical Seminar: Sapthagiri College of Engineering
No ratings yet
Technical Seminar: Sapthagiri College of Engineering
18 pages
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
No ratings yet
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
20 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
103 pages
Chapter_1_Introduction_to_computer_vision_and_image_processing_for
No ratings yet
Chapter_1_Introduction_to_computer_vision_and_image_processing_for
81 pages
Day 1
No ratings yet
Day 1
32 pages
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
100% (1)
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
65 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
Agents & Environment
No ratings yet
Agents & Environment
24 pages
Generative AI For Media Analysis - Partner Use Case Package
No ratings yet
Generative AI For Media Analysis - Partner Use Case Package
45 pages
Computer Vision Unit 4
No ratings yet
Computer Vision Unit 4
186 pages
CS 8520: Artificial Intelligence: Knowledge Representation
No ratings yet
CS 8520: Artificial Intelligence: Knowledge Representation
30 pages
MLOPS Summary Every Day
No ratings yet
MLOPS Summary Every Day
23 pages
AIML - 04 Single Layer Perceptron
No ratings yet
AIML - 04 Single Layer Perceptron
11 pages
Scaling AI and ML
No ratings yet
Scaling AI and ML
4 pages
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
CS771: Introduction To Machine Learning Piyush Rai
25 pages
Lecture 23
No ratings yet
Lecture 23
15 pages
Lecture 21 and 22
No ratings yet
Lecture 21 and 22
28 pages
Report Gurdeep
100% (1)
Report Gurdeep
42 pages
Islamic Architecture
No ratings yet
Islamic Architecture
2 pages
De Thi Hoc Ki 1 Tieng Anh 8 Ilearn Smart World de So 4 1702031683
No ratings yet
De Thi Hoc Ki 1 Tieng Anh 8 Ilearn Smart World de So 4 1702031683
3 pages
ĐÁP ÁN ĐỀ THI THỬ SỐ 09
No ratings yet
ĐÁP ÁN ĐỀ THI THỬ SỐ 09
8 pages
Material Handling and Packaging
No ratings yet
Material Handling and Packaging
33 pages
Modular Air Handling Unit (FMA)
No ratings yet
Modular Air Handling Unit (FMA)
3 pages
Bajaj Electricals LTD
No ratings yet
Bajaj Electricals LTD
9 pages
03MCL013
No ratings yet
03MCL013
147 pages
Automation and Robotics Lab Activity
100% (1)
Automation and Robotics Lab Activity
6 pages
Nancy - Stiegler Melancholy Negativity
No ratings yet
Nancy - Stiegler Melancholy Negativity
10 pages
Tips To Optimize Your Verilog HDL Code
No ratings yet
Tips To Optimize Your Verilog HDL Code
9 pages
Um Gdnablood
No ratings yet
Um Gdnablood
32 pages
Translste Gebhar Scram - 11-35
No ratings yet
Translste Gebhar Scram - 11-35
25 pages
Automatic Plant Irrigation
100% (1)
Automatic Plant Irrigation
20 pages
Today Is Friday, October 25, 2019
No ratings yet
Today Is Friday, October 25, 2019
21 pages
Uneb Subsidiary Ict s850-1 2019
100% (1)
Uneb Subsidiary Ict s850-1 2019
12 pages
Guide Guide To Cold Storage Roof System Design
No ratings yet
Guide Guide To Cold Storage Roof System Design
21 pages
Clarendonz QN Bank Final pdf-1 PDF
100% (6)
Clarendonz QN Bank Final pdf-1 PDF
69 pages
Controlling The: Jatco Re5R05A
100% (1)
Controlling The: Jatco Re5R05A
8 pages
EM00000738 APC - APC Back-UPS 300, 500, 650 User's Manual
No ratings yet
EM00000738 APC - APC Back-UPS 300, 500, 650 User's Manual
2 pages
ADOH PerformingSignalIntegrityAnalyses 110314 1845 197628
No ratings yet
ADOH PerformingSignalIntegrityAnalyses 110314 1845 197628
27 pages
Lesson 10 Music
No ratings yet
Lesson 10 Music
64 pages
7th Math Unit 15 Lec2
No ratings yet
7th Math Unit 15 Lec2
9 pages
GAG Test - MEMO
No ratings yet
GAG Test - MEMO
5 pages
7JUMANGPAS SET UP As Stake-6 GDS-HVW
No ratings yet
7JUMANGPAS SET UP As Stake-6 GDS-HVW
1 page
S1-5-Reducing Transformer Mass and Dimensions - S Ryder-Doble
No ratings yet
S1-5-Reducing Transformer Mass and Dimensions - S Ryder-Doble
9 pages
Adv Ex 0405 Trigonometry
No ratings yet
Adv Ex 0405 Trigonometry
6 pages
модуль грам
No ratings yet
модуль грам
20 pages
HE Mohammad Sanusi Barkindo: Secretary General of OPEC
No ratings yet
HE Mohammad Sanusi Barkindo: Secretary General of OPEC
3 pages

Lecture 26

Uploaded by

Lecture 26

Uploaded by

Deep Neural Networks:

Assorted Topics and Some Recent Advances

CS771: Introduction to Machine Learning

 For an MLP, (2the

 Unlike MLP, in RNNs/transformers, each input is a sequence

*Deep Residual Learning for Image Recognition (He et al, 2015)

This example is for an

 Self-supervised learning is key to unsupervised pre-training of deep learning models

 Auto-encoders (AE) are used for unsupervised feature learning

Goal: Learn and s.t. is small

𝑧 = 𝑓 ( 𝑥) Dimensionality of can be chosen to be

Only the encoder part of

On the output side,

 Learning of good features from the raw data

 Bias-variance trade-off seen only in the underparametrized regime

Deep Neural Networks: A Summary

 Transformers: Very powerful models for sequential data

You might also like