0% found this document useful (0 votes)

2 views

ML Modelling - part 1

The document provides an overview of various activation functions used in neural networks, including Softmax for single-label classification, Sigmoid for multi-label classification, and ReLU for internal layers. It discusses different architectures such as CNNs and RNNs, their applications, and modern NLP techniques like BERT and GPT, alongside transfer learning strategies. Additionally, it covers model evaluation metrics, regularization techniques, and methods to address issues like vanishing gradients and overfitting in machine learning models.

Uploaded by

kjmfqkwj4v

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

ML Modelling - part 1

Uploaded by

kjmfqkwj4v

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Activation Functions (Given the input, what my output should be..

mapping of input
to output in the NN layers)

Softmax - Multiple classification, but with 1 label only (identifying 1 object in an

image)
Sigmoid - Multiple classification, but with multiple labels (identifying all the objects
in the image)
TanH - RNNs
ReLU - All other cases like internal layers of NNs etc. (leaky ReLU to get better,
PReLU - parametric ReLU where parameters are learned via backpropagation, Swish
for really deep NNs)

Softmax activation is used in the output layer when we want to predict only one
class among other classes, so it makes a probability distribution among the
outputs so that we can choose the highest probable output.
On the other hand, sigmoid is used when we want to predict all classes present.
The sigmoid outputs a probability for each class independent of one another. RelU
activation is not used in the output layer as it outputs a value between “0” and “inf”.
Also, Tanh is not used as it outputs a value between “-1” and “1”.

CNNs
- Detects edges/patterns in the data
- MaxPooling2D - to consolidate 2D layer down by taking max value of block
- Conv2D>MaxPooling2D>Dropout>Flatten>Dense>Dropout>Softmax
- Resource intensive
- Hyperparameters : Kernal size, Number of layers, Pooling, Number of layers,
Choice of Optimizer
- LeNet-5 : Handwritting
- AlexNet : Image classification
- GoogLeNet : Deeper with performance.. Inception modules (group of CNN layers)
- ResNet (Residual Network) : Optimize performance with "skip" connections
ResNet50 widely used AWS

RNNs
- Time series or any kind of sequential data
- Works with feedback from last result of same neuron or group of neurons in the
layer.. Mechanism is like Informatica persistent variable.. called Memory Cell
- Sequence to sequence : Time series to time series output
- Sequence to vector : sentence to sentiment
- Vector to sequence : image to captions
- Encoder > Decoder : Sentence to sentiment to sentence.. or language translation
- Additional hyperparameter for RNN : Backpropagation through time.. so need to
consider this as well.. like truncated backpropagation
- RNN default behaviour : recent inputs have more influence
- LSTM : Long short term memory cell.. Ensures all inputs have same influence..
Solves the problem of items losing their weight over time. In NLP applications,
words in a sentence may be significant regardless of their position.. e.g.
completing words in an incomplete sentence - so preserving context
- GRU : Simplified LSTM version.. more popular in practice

Modern NLP/BERT/GPT
- Self attention mechanisms.. ability to provide context
- Process all input data at once UNLIKE RNN which processed it word by word
- DistilBERT : Reduces model size by 40%
- BERT : Bi-directional Encoder Representation from Transformers
- GPT : Generative Pre-trained Transformer

Transfer Learning
- Apporach 1
- Take pre-trained models and use + fine tune for use case : Model zoos such as
Hugging Face
- Integrated with Sagemaker via Hugging Face Deep Learning Containers (DLC)
- Example take HF DLC for BERT (which is trained on BookCorpus & Wikipedia)..
tokenize your own additional data in the same format... and further train with LOW
LEARNING RATE to avoid too much deviation
- Apporach 2 : Add new trainable layers on top of frozen model
- Apporach 3 : Re-Train from scratch
- Apporach 4 : Use it AS-IS

Deep learning on EC2/EMR

- EMR supports Apache MXNet & GPU instance types
- Examples.. P3 (Tesla v100 GPUs), P2 (16 K80 GPUs), G3 (4 M60 GPUs), G5g (AWS
Graviton 2 / Nvidia T4G Tensor), P4d - A100 UltraClusters for supercomputing......
OR Deep Learning AMIs

*Tuning NNs
- Small batch sizes tend to not get stuck in local minima
- Large batch sizes can converge on the wrong solution at random
- Large learning rates can overshoot the correct solution
- Small learning rates increase training time

Choosing activation functions (https://datascience.aero/aviation-function-deep-

learning/)
- Conventional activation functions are mostly used in the following scenarios:
Softmax is used for multi-classification in logistic regression models.. Sigmoid and
tanh are used for binary classification in logistic regression models
- ReLU (Rectified Linear units) is a very simple and efficient activation function that
has became very popular recently, especially in CNNs. It avoids and rectifies the
vanishing gradient problem, which mostly explains why it’s used in almost all deep
learning problems nowadays. The gradient for ReLU is: f(x) = 0 for x<= 0, 1 for x >
0... problem in this case is dying ReLU where neuron enters 0 area and stays there
never able to recover
- Leaky ReLU and PReLU (Parametric ReLU) introduce few modifications to ReLU.
Both alternatives provide a small slope for negative ranges, static for Leaky ReLU,
and as a parameter for PReLU. This way, the “dying” ReLU problem is completely
solved as it doesn’t have zero-slope parts.

Regularisation
- Divide data into 3 : Training, Evaluation (to calculate accuracy after each epoch),
Testing (never seen data to be used after training completion)
- Dropout : Drop some neurons at each epoch. Dropout layers force the network to
spread out its learning throughout the network, and can prevent overfitting
resulting from learning concentrating in one spot. Early stopping would be another
valid answer.
- Early Stopping : When we see that after particular epoch step TRAINING accuracy
is getting better but Validation accuracy is oscillating or not improving much at all
- L1 : Does sum of weights, makes some features go to zero i.e. DROP FEATURES,
Computationally in-efficient but Sparse output can make up for it.. Best to be
applied to reduce dimensionality
- L2 : Sum of SQUARE of weights, ALL features used but some are weighted less,
Computationally efficient, Dense output.. best if ALL FEATURES are important

Vanishing Gradient Issues

- At local minima, slope becomes zero.. so training slows down even introduce
numerical errors.. major problem with RNNs if these vanishing gradients propagate
to deeper layers
- Batch normalization overcomes vanishing and exploding gradient problems.
- RelU does not have vanishing gradient problem.

Opposite problem - Exploding gradients.. very high slope

- Solutions...
- Multi level Hierarchy : Break down into sub-networks and train individually
- LSTM
- Residual networks (ResNet)
- Ensemble of smaller networks
Ensemble learning
- Combines the predictions from multiple neural network models to reduce the
variance of predictions and reduce generalization error.
- Techniques for ensemble learning can be grouped by the element that is varied,
such as training data, the model, and how predictions are combined.

Residual networks
- Beneficial with deeper networkds only.. no performance gain on shallow networks
- Very deep neural networks (plain networks) are not practical to implement as they
are hard to train due to vanishing gradients.
- The skip-connections help to address the Vanishing Gradient problem.
- They also make it easy for a ResNet block to learn an identity function.
- There are two main types of ResNets blocks: The identity block and the
convolutional block.
- Very deep Residual Networks are built by stacking these blocks together.

Confusion metric
- Accuracy can’t be only measure
-Diagonal represents how good of a prediction is
-Can swap actual predicted labels across column vs row
-Sometimes rows and columns should tell actual and predicted class frequencies

Refer to this - it came from AWS Website directly

Accuracy = (TP+TN) / (TP+FP+FN+TN)
Precision = TP/(TP+FP)
Recall or True positive rate (TPR) = TP/(TP+FN)
False positive rate (FPR) = FP/(FP+TN)

Depending on your business problem, you might be more interested in a model that
performs well for a specific subset of these metrics. For example, two business
applications might have very different requirements for their ML model: One
application might need to be extremely sure about the positive predictions actually
being positive (high precision), and be able to afford to misclassify some positive
examples as negative (moderate recall).
Another application might need to correctly predict as many positive examples as
possible (high recall), and will accept some negative examples being misclassified
as positive (moderate precision). Amazon ML allows you to choose a score cut-off
that corresponds to a particular value of any of the preceding advanced metrics. It
also shows the tradeoffs incurred with optimizing for any one metric. For example,
if you select a cut-off that corresponds to a high precision, you typically will have to
trade that off with a lower recall.
Recall
- TP/(TP+FN)
- Also called sensitivity, True positive rate, completeness
- Best choice in fraud detection because we care about false negative.. meaning
something was fraud and we said not-a-fraud
- Recall is appropriate when you care most about "false negatives", which in this
case is incorrectly identifying fraudulent transactions as non-fraudulent.
- Recall is an important metric in situations where classifications are highly
imbalanced, and the positive case is rare. Accuracy tends to be misleading in these
cases.

Precision
- TP/(TP+FP)
- Also called Correct positives
- Measure of relevancy
- Best choice for medical testing or drug testing .. since giving "false positive" may
cause someone emotional pain if diagnosed with disease or as drug addict but it
was not true

Specificity
- TN/(TN+FP)
- True negative rate

ROC Curve - Plotting Sensitivity (Recall) on Y axis.. and Precision or (1-Specificity)

on X axis depending on type of analysis.. When comparing multiple confusion
metrices (example - for same logistic regression model but different threshold to
classify obese or not obese.. to decide on which threshold is best... Or comparing
multiple models for their performance).. This basically tells us how good model is
considering both recall and precision (instead of just looking at only recall or only
precision)..
If Y axis value = X axis value (true positives rate = false positive rate or correct
positives rate) then model is doing just average.. anything on above of line (from
0,0 to 1,1) is better model .. anything below it is poor model..

AUC - area below ROC curve... For comparing multiple models.. say if 0.9 for
regression and 0.75 for random forest.. then you would go for regression model..

F1 score
- Harmonic mean of Precision & Recall
- 2TP/(2TP+FP+FN)
- 2 * ( (precision * recall) / (precision + recall) )
- In Amazon ML, the macro-average F1 score is used to evaluate the predictive
accuracy of a multiclass metric.

RMSE (root mean squared error)

- Accuracy measurement
- Only cares about right and wrong answers
- For regression tasks, Amazon ML uses the industry standard root mean square
error (RMSE) metric. It is a distance measure between the predicted numeric target
and the actual numeric answer (ground truth). The smaller the value of the RMSE,
the better is the predictive accuracy of the model. A model with perfectly correct
predictions would have an RMSE of 0. The following example shows evaluation
data that contains N records:
- Baseline RMSE - Amazon ML provides a baseline metric for regression models. It
is the RMSE for a hypothetical regression model that would always predict the
mean of the target as the answer. For example, if you were predicting the age of a
house buyer and the mean age for the observations in your training data was 35,
the baseline model would always predict the answer as 35. You would compare
your ML model against this baseline to validate if your ML model is better than a
ML model that predicts this constant answer.

Residuals for Regression Model

- It is common practice to review the residuals for regression problems. A residual
for an observation in the evaluation data is the difference between the true target
and the predicted target. Residuals represent the portion of the target that the
model is unable to predict.
- A positive residual indicates that the model is underestimating the target (the
actual target is larger than the predicted target).
- A negative residual indicates an overestimation (the actual target is smaller than
the predicted target).
- The histogram of the residuals on the evaluation data when distributed in a bell
shape and centered at zero indicates that the model makes mistakes in a random
manner and does not systematically over or under predict any particular range of
target values.
- If the residuals do not form a zero-centered bell shape, there is some structure in
the model's prediction error. Adding more variables to the model might help the
model capture the pattern that is not captured by the current model.

Cross-Validation
Cross-validation is a technique for evaluating ML models by training several ML
models on subsets of the available input data and evaluating them on the
complementary subset of the data. Use cross-validation to detect overfitting, ie,
failing to generalize a pattern. In Amazon ML, you can use the k-fold cross-
validation method to perform cross-validation. In k-fold cross-validation, you split
the input data into k subsets of data (also known as folds). You train an ML model
on all but one (k-1) of the subsets, and then evaluate the model on the subset that
was not used for training. This process is repeated k times, with a different subset
reserved for evaluation (and excluded from training) each time.
Stratetified K-Fold cross validation further ensures that data within each fold
uniformly represents each class
BALANCED Dataset - K-fold cross validation works
UN-BALANCED Dataset - Stratified K-fold cross validation

Ensemble methods
Bagging - Random sampling with replacement from original dataset..
Boosting - Start with equal weights for all datasets.. then go on updating weights
Random forest uses Bagging - Many training datasets from same dataset feeding
into multiple decision trees.. Each tree votes on result to arrive at final result... so
this voting happens in parallel
XGBoost - Runs sequential.. Boosting method assigns equal weights to each
observations in the training data and runs the model.. then revises weights and re-
runs model again..
Which one to pick - For Accuracy = Boosting but can cause overfitting.. Avoid
overfitting to run in parallel - use Bagging

SOME NOTES YET TO BE CAPTURED FOR Last 2 lectures

Introtodeeplearning MIT 6.S191
No ratings yet
Introtodeeplearning MIT 6.S191
36 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
6 NN RNN
No ratings yet
6 NN RNN
55 pages
ML prep for samsung
No ratings yet
ML prep for samsung
73 pages
ANN notes
No ratings yet
ANN notes
7 pages
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
No ratings yet
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
6 pages
HCIP-AI-EI Developer V2.0 Training Material
No ratings yet
HCIP-AI-EI Developer V2.0 Training Material
508 pages
Object Classification Using CNN
No ratings yet
Object Classification Using CNN
9 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
what are the activation functions, how do i deter...
No ratings yet
what are the activation functions, how do i deter...
3 pages
Deep Learning
No ratings yet
Deep Learning
40 pages
6 Apr - 6 - DL
No ratings yet
6 Apr - 6 - DL
69 pages
Intro CNN PDF
No ratings yet
Intro CNN PDF
31 pages
CNN
No ratings yet
CNN
31 pages
sdl unit 2 3 4
No ratings yet
sdl unit 2 3 4
12 pages
f8194544 Microsoft PowerPoint DeepLearning
No ratings yet
f8194544 Microsoft PowerPoint DeepLearning
28 pages
Artificial Neural Networks_dl
No ratings yet
Artificial Neural Networks_dl
55 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
GENAI-SEE
No ratings yet
GENAI-SEE
51 pages
Unit II.
No ratings yet
Unit II.
14 pages
Emotion Detection FER2013 - Kinza Shakeel
No ratings yet
Emotion Detection FER2013 - Kinza Shakeel
21 pages
Unit-3 (1)
No ratings yet
Unit-3 (1)
37 pages
Delving Deep Into Rectifiers: Surpassing Human-Level Performance On Imagenet Classification
No ratings yet
Delving Deep Into Rectifiers: Surpassing Human-Level Performance On Imagenet Classification
11 pages
Alexnet Tugce Kyunghee
No ratings yet
Alexnet Tugce Kyunghee
35 pages
Data Scientist Guide
No ratings yet
Data Scientist Guide
10 pages
Convolutional Networks
No ratings yet
Convolutional Networks
211 pages
Unit-2 Adl
No ratings yet
Unit-2 Adl
25 pages
Assignment Jaiprakash
No ratings yet
Assignment Jaiprakash
5 pages
Development of Deep Learning Architecture: Pantech Solutions & The Institution of Electronics and Telecommunication
No ratings yet
Development of Deep Learning Architecture: Pantech Solutions & The Institution of Electronics and Telecommunication
31 pages
Unit III
No ratings yet
Unit III
58 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
Terms to Review
No ratings yet
Terms to Review
9 pages
deeplearning_ppt_unit 4 and 5.pptx
No ratings yet
deeplearning_ppt_unit 4 and 5.pptx
154 pages
Unit-3
No ratings yet
Unit-3
38 pages
Deep Learning
No ratings yet
Deep Learning
30 pages
Module 2
No ratings yet
Module 2
13 pages
Unit 4
No ratings yet
Unit 4
86 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Module 4
No ratings yet
Module 4
36 pages
Introduction to Convolutional Neural Networks (1)
No ratings yet
Introduction to Convolutional Neural Networks (1)
4 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Anatomy of Neural Networks
No ratings yet
Anatomy of Neural Networks
2 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
4 pages
INTRO TO Deep Learning Focusing on ToolS_Knowlexon_Biswa
No ratings yet
INTRO TO Deep Learning Focusing on ToolS_Knowlexon_Biswa
37 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Res Net 4
No ratings yet
Res Net 4
23 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
No ratings yet
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
50 pages
Crashcourse DL Pytorch Parr
No ratings yet
Crashcourse DL Pytorch Parr
39 pages
He Delving Deep Into
No ratings yet
He Delving Deep Into
9 pages
Res Net
No ratings yet
Res Net
8 pages
Deep Nets
No ratings yet
Deep Nets
25 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
Deep Learning Curriculum
No ratings yet
Deep Learning Curriculum
23 pages
CH 02 Summary
No ratings yet
CH 02 Summary
3 pages
GoogleNET and ResNet v4 With Nin and Bias
No ratings yet
GoogleNET and ResNet v4 With Nin and Bias
82 pages
new
No ratings yet
new
8 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
Sony Ai Content[1]
No ratings yet
Sony Ai Content[1]
26 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
KNN Updated
No ratings yet
KNN Updated
30 pages
Vanishing and Exploding
No ratings yet
Vanishing and Exploding
9 pages
Single Layer Perceptron
No ratings yet
Single Layer Perceptron
14 pages
CS8082 Unit 2
No ratings yet
CS8082 Unit 2
38 pages
Supervised Learning Notes 1-4
No ratings yet
Supervised Learning Notes 1-4
42 pages
DL Assignment 4
No ratings yet
DL Assignment 4
7 pages
Parameter Calculation
No ratings yet
Parameter Calculation
10 pages
Livro 4 - Deep-Learning
No ratings yet
Livro 4 - Deep-Learning
271 pages
Theoretical Evaluation of Ensemble Machine Learning Techniques
No ratings yet
Theoretical Evaluation of Ensemble Machine Learning Techniques
9 pages
Dl Lab Manual
No ratings yet
Dl Lab Manual
18 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
30 pages
Back propogation
No ratings yet
Back propogation
9 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Is Zc415 (Data Mining BITS-WILP)
No ratings yet
Is Zc415 (Data Mining BITS-WILP)
4 pages
ELET442 - Artificial Neural Networks (ANNs)
No ratings yet
ELET442 - Artificial Neural Networks (ANNs)
56 pages
PublishedPaperNo.8 2022
100% (1)
PublishedPaperNo.8 2022
14 pages
Diabetes Detection Using Deep Learning Algorithms: ICT Express November 2018
No ratings yet
Diabetes Detection Using Deep Learning Algorithms: ICT Express November 2018
5 pages
A Step by Step Perceptron Example
No ratings yet
A Step by Step Perceptron Example
5 pages
Syllabus
No ratings yet
Syllabus
5 pages
Feed-Forward Neural Networks (Part 2: Learning)
No ratings yet
Feed-Forward Neural Networks (Part 2: Learning)
17 pages
Deep Learning With Python File
No ratings yet
Deep Learning With Python File
22 pages
RNN & LSTM Notes
No ratings yet
RNN & LSTM Notes
8 pages
421-Article Text-1289-1-10-20220916
No ratings yet
421-Article Text-1289-1-10-20220916
11 pages
Image Classification Using Convolutional Neural Networks (CNNS)
No ratings yet
Image Classification Using Convolutional Neural Networks (CNNS)
61 pages
DSTBD_10-DMClassification-ENG
No ratings yet
DSTBD_10-DMClassification-ENG
160 pages
Bagging and Boosting
100% (1)
Bagging and Boosting
19 pages
Unit IV Artificial Neural Networks
No ratings yet
Unit IV Artificial Neural Networks
25 pages
100 Days of DEep Learning
No ratings yet
100 Days of DEep Learning
5 pages
MCQ
No ratings yet
MCQ
4 pages
Artificial Neural Networks Unit 3: Single-Layer Perceptrons
No ratings yet
Artificial Neural Networks Unit 3: Single-Layer Perceptrons
11 pages

ML Modelling - part 1

Uploaded by

ML Modelling - part 1

Uploaded by

Activation Functions (Given the input, what my output should be..

Softmax - Multiple classification, but with 1 label only (identifying 1 object in an

Deep learning on EC2/EMR

Choosing activation functions (https://datascience.aero/aviation-function-deep-

Vanishing Gradient Issues

Opposite problem - Exploding gradients.. very high slope

Refer to this - it came from AWS Website directly

ROC Curve - Plotting Sensitivity (Recall) on Y axis.. and Precision or (1-Specificity)

RMSE (root mean squared error)

Residuals for Regression Model

SOME NOTES YET TO BE CAPTURED FOR Last 2 lectures

You might also like