0% found this document useful (0 votes)
2 views

ML Modelling - part 1

The document provides an overview of various activation functions used in neural networks, including Softmax for single-label classification, Sigmoid for multi-label classification, and ReLU for internal layers. It discusses different architectures such as CNNs and RNNs, their applications, and modern NLP techniques like BERT and GPT, alongside transfer learning strategies. Additionally, it covers model evaluation metrics, regularization techniques, and methods to address issues like vanishing gradients and overfitting in machine learning models.

Uploaded by

kjmfqkwj4v
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML Modelling - part 1

The document provides an overview of various activation functions used in neural networks, including Softmax for single-label classification, Sigmoid for multi-label classification, and ReLU for internal layers. It discusses different architectures such as CNNs and RNNs, their applications, and modern NLP techniques like BERT and GPT, alongside transfer learning strategies. Additionally, it covers model evaluation metrics, regularization techniques, and methods to address issues like vanishing gradients and overfitting in machine learning models.

Uploaded by

kjmfqkwj4v
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Activation Functions (Given the input, what my output should be..

mapping of input
to output in the NN layers)

Softmax - Multiple classification, but with 1 label only (identifying 1 object in an


image)
Sigmoid - Multiple classification, but with multiple labels (identifying all the objects
in the image)
TanH - RNNs
ReLU - All other cases like internal layers of NNs etc. (leaky ReLU to get better,
PReLU - parametric ReLU where parameters are learned via backpropagation, Swish
for really deep NNs)

Softmax activation is used in the output layer when we want to predict only one
class among other classes, so it makes a probability distribution among the
outputs so that we can choose the highest probable output.
On the other hand, sigmoid is used when we want to predict all classes present.
The sigmoid outputs a probability for each class independent of one another. RelU
activation is not used in the output layer as it outputs a value between “0” and “inf”.
Also, Tanh is not used as it outputs a value between “-1” and “1”.

CNNs
- Detects edges/patterns in the data
- MaxPooling2D - to consolidate 2D layer down by taking max value of block
- Conv2D>MaxPooling2D>Dropout>Flatten>Dense>Dropout>Softmax
- Resource intensive
- Hyperparameters : Kernal size, Number of layers, Pooling, Number of layers,
Choice of Optimizer
- LeNet-5 : Handwritting
- AlexNet : Image classification
- GoogLeNet : Deeper with performance.. Inception modules (group of CNN layers)
- ResNet (Residual Network) : Optimize performance with "skip" connections
ResNet50 widely used AWS

RNNs
- Time series or any kind of sequential data
- Works with feedback from last result of same neuron or group of neurons in the
layer.. Mechanism is like Informatica persistent variable.. called Memory Cell
- Sequence to sequence : Time series to time series output
- Sequence to vector : sentence to sentiment
- Vector to sequence : image to captions
- Encoder > Decoder : Sentence to sentiment to sentence.. or language translation
- Additional hyperparameter for RNN : Backpropagation through time.. so need to
consider this as well.. like truncated backpropagation
- RNN default behaviour : recent inputs have more influence
- LSTM : Long short term memory cell.. Ensures all inputs have same influence..
Solves the problem of items losing their weight over time. In NLP applications,
words in a sentence may be significant regardless of their position.. e.g.
completing words in an incomplete sentence - so preserving context
- GRU : Simplified LSTM version.. more popular in practice

Modern NLP/BERT/GPT
- Self attention mechanisms.. ability to provide context
- Process all input data at once UNLIKE RNN which processed it word by word
- DistilBERT : Reduces model size by 40%
- BERT : Bi-directional Encoder Representation from Transformers
- GPT : Generative Pre-trained Transformer

Transfer Learning
- Apporach 1
- Take pre-trained models and use + fine tune for use case : Model zoos such as
Hugging Face
- Integrated with Sagemaker via Hugging Face Deep Learning Containers (DLC)
- Example take HF DLC for BERT (which is trained on BookCorpus & Wikipedia)..
tokenize your own additional data in the same format... and further train with LOW
LEARNING RATE to avoid too much deviation
- Apporach 2 : Add new trainable layers on top of frozen model
- Apporach 3 : Re-Train from scratch
- Apporach 4 : Use it AS-IS

Deep learning on EC2/EMR


- EMR supports Apache MXNet & GPU instance types
- Examples.. P3 (Tesla v100 GPUs), P2 (16 K80 GPUs), G3 (4 M60 GPUs), G5g (AWS
Graviton 2 / Nvidia T4G Tensor), P4d - A100 UltraClusters for supercomputing......
OR Deep Learning AMIs

*Tuning NNs
- Small batch sizes tend to not get stuck in local minima
- Large batch sizes can converge on the wrong solution at random
- Large learning rates can overshoot the correct solution
- Small learning rates increase training time

Choosing activation functions (https://datascience.aero/aviation-function-deep-


learning/)
- Conventional activation functions are mostly used in the following scenarios:
Softmax is used for multi-classification in logistic regression models.. Sigmoid and
tanh are used for binary classification in logistic regression models
- ReLU (Rectified Linear units) is a very simple and efficient activation function that
has became very popular recently, especially in CNNs. It avoids and rectifies the
vanishing gradient problem, which mostly explains why it’s used in almost all deep
learning problems nowadays. The gradient for ReLU is: f(x) = 0 for x<= 0, 1 for x >
0... problem in this case is dying ReLU where neuron enters 0 area and stays there
never able to recover
- Leaky ReLU and PReLU (Parametric ReLU) introduce few modifications to ReLU.
Both alternatives provide a small slope for negative ranges, static for Leaky ReLU,
and as a parameter for PReLU. This way, the “dying” ReLU problem is completely
solved as it doesn’t have zero-slope parts.

Regularisation
- Divide data into 3 : Training, Evaluation (to calculate accuracy after each epoch),
Testing (never seen data to be used after training completion)
- Dropout : Drop some neurons at each epoch. Dropout layers force the network to
spread out its learning throughout the network, and can prevent overfitting
resulting from learning concentrating in one spot. Early stopping would be another
valid answer.
- Early Stopping : When we see that after particular epoch step TRAINING accuracy
is getting better but Validation accuracy is oscillating or not improving much at all
- L1 : Does sum of weights, makes some features go to zero i.e. DROP FEATURES,
Computationally in-efficient but Sparse output can make up for it.. Best to be
applied to reduce dimensionality
- L2 : Sum of SQUARE of weights, ALL features used but some are weighted less,
Computationally efficient, Dense output.. best if ALL FEATURES are important

Vanishing Gradient Issues


- At local minima, slope becomes zero.. so training slows down even introduce
numerical errors.. major problem with RNNs if these vanishing gradients propagate
to deeper layers
- Batch normalization overcomes vanishing and exploding gradient problems.
- RelU does not have vanishing gradient problem.

Opposite problem - Exploding gradients.. very high slope


- Solutions...
- Multi level Hierarchy : Break down into sub-networks and train individually
- LSTM
- Residual networks (ResNet)
- Ensemble of smaller networks
Ensemble learning
- Combines the predictions from multiple neural network models to reduce the
variance of predictions and reduce generalization error.
- Techniques for ensemble learning can be grouped by the element that is varied,
such as training data, the model, and how predictions are combined.

Residual networks
- Beneficial with deeper networkds only.. no performance gain on shallow networks
- Very deep neural networks (plain networks) are not practical to implement as they
are hard to train due to vanishing gradients.
- The skip-connections help to address the Vanishing Gradient problem.
- They also make it easy for a ResNet block to learn an identity function.
- There are two main types of ResNets blocks: The identity block and the
convolutional block.
- Very deep Residual Networks are built by stacking these blocks together.

Confusion metric
- Accuracy can’t be only measure
-Diagonal represents how good of a prediction is
-Can swap actual predicted labels across column vs row
-Sometimes rows and columns should tell actual and predicted class frequencies

Refer to this - it came from AWS Website directly


Accuracy = (TP+TN) / (TP+FP+FN+TN)
Precision = TP/(TP+FP)
Recall or True positive rate (TPR) = TP/(TP+FN)
False positive rate (FPR) = FP/(FP+TN)

Depending on your business problem, you might be more interested in a model that
performs well for a specific subset of these metrics. For example, two business
applications might have very different requirements for their ML model: One
application might need to be extremely sure about the positive predictions actually
being positive (high precision), and be able to afford to misclassify some positive
examples as negative (moderate recall).
Another application might need to correctly predict as many positive examples as
possible (high recall), and will accept some negative examples being misclassified
as positive (moderate precision). Amazon ML allows you to choose a score cut-off
that corresponds to a particular value of any of the preceding advanced metrics. It
also shows the tradeoffs incurred with optimizing for any one metric. For example,
if you select a cut-off that corresponds to a high precision, you typically will have to
trade that off with a lower recall.
Recall
- TP/(TP+FN)
- Also called sensitivity, True positive rate, completeness
- Best choice in fraud detection because we care about false negative.. meaning
something was fraud and we said not-a-fraud
- Recall is appropriate when you care most about "false negatives", which in this
case is incorrectly identifying fraudulent transactions as non-fraudulent.
- Recall is an important metric in situations where classifications are highly
imbalanced, and the positive case is rare. Accuracy tends to be misleading in these
cases.

Precision
- TP/(TP+FP)
- Also called Correct positives
- Measure of relevancy
- Best choice for medical testing or drug testing .. since giving "false positive" may
cause someone emotional pain if diagnosed with disease or as drug addict but it
was not true

Specificity
- TN/(TN+FP)
- True negative rate

ROC Curve - Plotting Sensitivity (Recall) on Y axis.. and Precision or (1-Specificity)


on X axis depending on type of analysis.. When comparing multiple confusion
metrices (example - for same logistic regression model but different threshold to
classify obese or not obese.. to decide on which threshold is best... Or comparing
multiple models for their performance).. This basically tells us how good model is
considering both recall and precision (instead of just looking at only recall or only
precision)..
If Y axis value = X axis value (true positives rate = false positive rate or correct
positives rate) then model is doing just average.. anything on above of line (from
0,0 to 1,1) is better model .. anything below it is poor model..

AUC - area below ROC curve... For comparing multiple models.. say if 0.9 for
regression and 0.75 for random forest.. then you would go for regression model..

F1 score
- Harmonic mean of Precision & Recall
- 2TP/(2TP+FP+FN)
- 2 * ( (precision * recall) / (precision + recall) )
- In Amazon ML, the macro-average F1 score is used to evaluate the predictive
accuracy of a multiclass metric.

RMSE (root mean squared error)


- Accuracy measurement
- Only cares about right and wrong answers
- For regression tasks, Amazon ML uses the industry standard root mean square
error (RMSE) metric. It is a distance measure between the predicted numeric target
and the actual numeric answer (ground truth). The smaller the value of the RMSE,
the better is the predictive accuracy of the model. A model with perfectly correct
predictions would have an RMSE of 0. The following example shows evaluation
data that contains N records:
- Baseline RMSE - Amazon ML provides a baseline metric for regression models. It
is the RMSE for a hypothetical regression model that would always predict the
mean of the target as the answer. For example, if you were predicting the age of a
house buyer and the mean age for the observations in your training data was 35,
the baseline model would always predict the answer as 35. You would compare
your ML model against this baseline to validate if your ML model is better than a
ML model that predicts this constant answer.

Residuals for Regression Model


- It is common practice to review the residuals for regression problems. A residual
for an observation in the evaluation data is the difference between the true target
and the predicted target. Residuals represent the portion of the target that the
model is unable to predict.
- A positive residual indicates that the model is underestimating the target (the
actual target is larger than the predicted target).
- A negative residual indicates an overestimation (the actual target is smaller than
the predicted target).
- The histogram of the residuals on the evaluation data when distributed in a bell
shape and centered at zero indicates that the model makes mistakes in a random
manner and does not systematically over or under predict any particular range of
target values.
- If the residuals do not form a zero-centered bell shape, there is some structure in
the model's prediction error. Adding more variables to the model might help the
model capture the pattern that is not captured by the current model.

Cross-Validation
Cross-validation is a technique for evaluating ML models by training several ML
models on subsets of the available input data and evaluating them on the
complementary subset of the data. Use cross-validation to detect overfitting, ie,
failing to generalize a pattern. In Amazon ML, you can use the k-fold cross-
validation method to perform cross-validation. In k-fold cross-validation, you split
the input data into k subsets of data (also known as folds). You train an ML model
on all but one (k-1) of the subsets, and then evaluate the model on the subset that
was not used for training. This process is repeated k times, with a different subset
reserved for evaluation (and excluded from training) each time.
Stratetified K-Fold cross validation further ensures that data within each fold
uniformly represents each class
BALANCED Dataset - K-fold cross validation works
UN-BALANCED Dataset - Stratified K-fold cross validation

Ensemble methods
Bagging - Random sampling with replacement from original dataset..
Boosting - Start with equal weights for all datasets.. then go on updating weights
Random forest uses Bagging - Many training datasets from same dataset feeding
into multiple decision trees.. Each tree votes on result to arrive at final result... so
this voting happens in parallel
XGBoost - Runs sequential.. Boosting method assigns equal weights to each
observations in the training data and runs the model.. then revises weights and re-
runs model again..
Which one to pick - For Accuracy = Boosting but can cause overfitting.. Avoid
overfitting to run in parallel - use Bagging

SOME NOTES YET TO BE CAPTURED FOR Last 2 lectures

You might also like