ML Modelling - part 1
ML Modelling - part 1
mapping of input
to output in the NN layers)
Softmax activation is used in the output layer when we want to predict only one
class among other classes, so it makes a probability distribution among the
outputs so that we can choose the highest probable output.
On the other hand, sigmoid is used when we want to predict all classes present.
The sigmoid outputs a probability for each class independent of one another. RelU
activation is not used in the output layer as it outputs a value between “0” and “inf”.
Also, Tanh is not used as it outputs a value between “-1” and “1”.
CNNs
- Detects edges/patterns in the data
- MaxPooling2D - to consolidate 2D layer down by taking max value of block
- Conv2D>MaxPooling2D>Dropout>Flatten>Dense>Dropout>Softmax
- Resource intensive
- Hyperparameters : Kernal size, Number of layers, Pooling, Number of layers,
Choice of Optimizer
- LeNet-5 : Handwritting
- AlexNet : Image classification
- GoogLeNet : Deeper with performance.. Inception modules (group of CNN layers)
- ResNet (Residual Network) : Optimize performance with "skip" connections
ResNet50 widely used AWS
RNNs
- Time series or any kind of sequential data
- Works with feedback from last result of same neuron or group of neurons in the
layer.. Mechanism is like Informatica persistent variable.. called Memory Cell
- Sequence to sequence : Time series to time series output
- Sequence to vector : sentence to sentiment
- Vector to sequence : image to captions
- Encoder > Decoder : Sentence to sentiment to sentence.. or language translation
- Additional hyperparameter for RNN : Backpropagation through time.. so need to
consider this as well.. like truncated backpropagation
- RNN default behaviour : recent inputs have more influence
- LSTM : Long short term memory cell.. Ensures all inputs have same influence..
Solves the problem of items losing their weight over time. In NLP applications,
words in a sentence may be significant regardless of their position.. e.g.
completing words in an incomplete sentence - so preserving context
- GRU : Simplified LSTM version.. more popular in practice
Modern NLP/BERT/GPT
- Self attention mechanisms.. ability to provide context
- Process all input data at once UNLIKE RNN which processed it word by word
- DistilBERT : Reduces model size by 40%
- BERT : Bi-directional Encoder Representation from Transformers
- GPT : Generative Pre-trained Transformer
Transfer Learning
- Apporach 1
- Take pre-trained models and use + fine tune for use case : Model zoos such as
Hugging Face
- Integrated with Sagemaker via Hugging Face Deep Learning Containers (DLC)
- Example take HF DLC for BERT (which is trained on BookCorpus & Wikipedia)..
tokenize your own additional data in the same format... and further train with LOW
LEARNING RATE to avoid too much deviation
- Apporach 2 : Add new trainable layers on top of frozen model
- Apporach 3 : Re-Train from scratch
- Apporach 4 : Use it AS-IS
*Tuning NNs
- Small batch sizes tend to not get stuck in local minima
- Large batch sizes can converge on the wrong solution at random
- Large learning rates can overshoot the correct solution
- Small learning rates increase training time
Regularisation
- Divide data into 3 : Training, Evaluation (to calculate accuracy after each epoch),
Testing (never seen data to be used after training completion)
- Dropout : Drop some neurons at each epoch. Dropout layers force the network to
spread out its learning throughout the network, and can prevent overfitting
resulting from learning concentrating in one spot. Early stopping would be another
valid answer.
- Early Stopping : When we see that after particular epoch step TRAINING accuracy
is getting better but Validation accuracy is oscillating or not improving much at all
- L1 : Does sum of weights, makes some features go to zero i.e. DROP FEATURES,
Computationally in-efficient but Sparse output can make up for it.. Best to be
applied to reduce dimensionality
- L2 : Sum of SQUARE of weights, ALL features used but some are weighted less,
Computationally efficient, Dense output.. best if ALL FEATURES are important
Residual networks
- Beneficial with deeper networkds only.. no performance gain on shallow networks
- Very deep neural networks (plain networks) are not practical to implement as they
are hard to train due to vanishing gradients.
- The skip-connections help to address the Vanishing Gradient problem.
- They also make it easy for a ResNet block to learn an identity function.
- There are two main types of ResNets blocks: The identity block and the
convolutional block.
- Very deep Residual Networks are built by stacking these blocks together.
Confusion metric
- Accuracy can’t be only measure
-Diagonal represents how good of a prediction is
-Can swap actual predicted labels across column vs row
-Sometimes rows and columns should tell actual and predicted class frequencies
Depending on your business problem, you might be more interested in a model that
performs well for a specific subset of these metrics. For example, two business
applications might have very different requirements for their ML model: One
application might need to be extremely sure about the positive predictions actually
being positive (high precision), and be able to afford to misclassify some positive
examples as negative (moderate recall).
Another application might need to correctly predict as many positive examples as
possible (high recall), and will accept some negative examples being misclassified
as positive (moderate precision). Amazon ML allows you to choose a score cut-off
that corresponds to a particular value of any of the preceding advanced metrics. It
also shows the tradeoffs incurred with optimizing for any one metric. For example,
if you select a cut-off that corresponds to a high precision, you typically will have to
trade that off with a lower recall.
Recall
- TP/(TP+FN)
- Also called sensitivity, True positive rate, completeness
- Best choice in fraud detection because we care about false negative.. meaning
something was fraud and we said not-a-fraud
- Recall is appropriate when you care most about "false negatives", which in this
case is incorrectly identifying fraudulent transactions as non-fraudulent.
- Recall is an important metric in situations where classifications are highly
imbalanced, and the positive case is rare. Accuracy tends to be misleading in these
cases.
Precision
- TP/(TP+FP)
- Also called Correct positives
- Measure of relevancy
- Best choice for medical testing or drug testing .. since giving "false positive" may
cause someone emotional pain if diagnosed with disease or as drug addict but it
was not true
Specificity
- TN/(TN+FP)
- True negative rate
AUC - area below ROC curve... For comparing multiple models.. say if 0.9 for
regression and 0.75 for random forest.. then you would go for regression model..
F1 score
- Harmonic mean of Precision & Recall
- 2TP/(2TP+FP+FN)
- 2 * ( (precision * recall) / (precision + recall) )
- In Amazon ML, the macro-average F1 score is used to evaluate the predictive
accuracy of a multiclass metric.
Cross-Validation
Cross-validation is a technique for evaluating ML models by training several ML
models on subsets of the available input data and evaluating them on the
complementary subset of the data. Use cross-validation to detect overfitting, ie,
failing to generalize a pattern. In Amazon ML, you can use the k-fold cross-
validation method to perform cross-validation. In k-fold cross-validation, you split
the input data into k subsets of data (also known as folds). You train an ML model
on all but one (k-1) of the subsets, and then evaluate the model on the subset that
was not used for training. This process is repeated k times, with a different subset
reserved for evaluation (and excluded from training) each time.
Stratetified K-Fold cross validation further ensures that data within each fold
uniformly represents each class
BALANCED Dataset - K-fold cross validation works
UN-BALANCED Dataset - Stratified K-fold cross validation
Ensemble methods
Bagging - Random sampling with replacement from original dataset..
Boosting - Start with equal weights for all datasets.. then go on updating weights
Random forest uses Bagging - Many training datasets from same dataset feeding
into multiple decision trees.. Each tree votes on result to arrive at final result... so
this voting happens in parallel
XGBoost - Runs sequential.. Boosting method assigns equal weights to each
observations in the training data and runs the model.. then revises weights and re-
runs model again..
Which one to pick - For Accuracy = Boosting but can cause overfitting.. Avoid
overfitting to run in parallel - use Bagging