0% found this document useful (0 votes)
3 views

Module 2 part2

The document discusses early stopping as a regularization technique to prevent overfitting in machine learning models by monitoring training and validation errors. It emphasizes the importance of data augmentation, which artificially increases dataset size through transformations, enhancing model performance and reducing operational costs. The document also outlines various methods and benefits of data augmentation, particularly in image classification and natural language processing, while addressing challenges and use cases in healthcare.

Uploaded by

thejasurendran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 2 part2

The document discusses early stopping as a regularization technique to prevent overfitting in machine learning models by monitoring training and validation errors. It emphasizes the importance of data augmentation, which artificially increases dataset size through transformations, enhancing model performance and reducing operational costs. The document also outlines various methods and benefits of data augmentation, particularly in image classification and natural language processing, while addressing challenges and use cases in healthcare.

Uploaded by

thejasurendran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Early Stopping

Regularization is a kind of regression where the learning algorithms are modified to


reduce overfitting. This may incur a higher bias but will lead to lower variance when
compared to non-regularized models i.e. increases generalization of the training
algorithm.
● In a general learning algorithm, the dataset is divided into a training set and test
set.
● After each epoch of the algorithm, the parameters are updated accordingly after
understanding the dataset.
● Finally, this trained model is applied to the test set.
Generally, the training set error will be less compared to the test set error. This is because
of overfitting whereby the algorithm memorizes the training data and produces the right
results on the training set. So the model becomes highly exclusive to the training set and
fails to produce accurate results for other datasets including the test set. Regularization
techniques are used in such situations to reduce overfitting and increase the performance
of the model on any general dataset. Early stopping is a popular regularization technique
due to its simplicity and effectiveness.

TRACE KTU
Regularization by early stopping can be done either by dividing the dataset into training,
test sets and validation set. In early stopping, the algorithm is trained using the training
set, and the point at which to stop training is determined from the validation set. Training
error and validation error are analyzed. The training error steadily decreases while the
validation error decreases until a point, after which it increases. This is because, during
training, the learning model starts to overfit the training data. This causes the training
error to decrease while the validation error increases. So a model with better validation
set error can be obtained if the parameters that give the least validation set error are used.
Each time the error on the validation set decreases, a copy of the model parameters is
stored. When the training algorithm terminates, these parameters which give the least
validation set error are finally returned and not the last modified parameters.
TRACE KTU
In Regularization by Early Stopping, we stop training the model when the performance of
the model on the validation set is getting worse-increasing loss or decreasing accuracy or
poorer values of the scoring metric. By plotting the error on the training dataset and the
validation dataset together, both the errors decrease with a number of iterations until the
point where the model starts to overfit. After this point, the training error still decreases
but the validation error increases. So, even if training is continued after this point, early
stopping essentially returns the set of parameters that were used at this point and so is
equivalent to stopping training at that point. So, the final parameters returned will enable
the model to have low variance and better generalization. The model at the time the
training is stopped will have a better generalization performance than the model with the
least training error. Early stopping can be thought of as implicit regularization, contrary
to regularization via weight decay. This method is also efficient since it requires less
amount of training data, which is not always available. Due to this fact, early stopping
requires lesser time for training compared to other regularization methods. Repeating the
early stopping process many times may result in the model overfitting the validation
dataset, just as similar as overfitting occurs in the case of training data.
TRACE KTU
Data Augmentation
Data augmentation is a set of techniques to artificially increase the amount of data by
generating new data points from existing data. This includes making small changes to
data or using deep learning models to generate new data points.

Why is it important now?


Data augmentation is useful to improve the performance and outcomes of machine
learning models by forming new and different examples to train datasets. If the dataset
in a machine learning model is rich and sufficient, the model performs better and more
accurately.

For machine learning models, collecting and labeling data can be exhausting and
costly processes. Transformations in datasets by using data augmentation techniques
allow companies to reduce these operational costs.

TRACE KTU
One of the steps in a data model is cleaning data which is necessary for high-accuracy
models. However, if cleaning reduces the representability of data, then the model
cannot provide good predictions for real-world inputs. Data augmentation techniques
can enable machine learning models to be more robust by creating variations that the
model may see in the real world.

How does it work?


Source: The Stanford AI Lab Blog, (Note: TF – transformation functions)

For image classification and segmentation

For data augmentation, making simple alterations on visual data is popular. In


addition, generative adversarial networks (GANs) are used to create new synthetic
data. Classic image processing activities for data augmentation are:



padding TRACE KTU
random rotating
● re-scaling,
● vertical and horizontal flipping
● translation ( image is moved along X, Y direction)
● cropping
● zooming
● darkening & brightening/color modification
● grayscaling
● changing contrast
● adding noise
● random erasing
Source: Medium

Advanced models for data augmentation are

TRACE KTU
● Adversarial training/Adversarial machine learning: It generates adversarial
examples which disrupt a machine learning model and injects them into a
dataset to train.
● Generative adversarial networks (GANs): GAN algorithms can learn
patterns from input datasets and automatically create new examples which
resemble training data.
● Neural style transfer: Neural style transfer models can blend content image
and style image and separate style from content.
● Reinforcement learning: Reinforcement learning models train software agents
to attain their goals and make decisions in a virtual environment.

Popular open source python packages for data augmentation in computer vision are
Keras ImageDataGenerator, Skimage and OpenCV.

For natural language processing (NLP)

Data augmentation is not as popular in the NLP domain as in the computer vision
domain. Augmenting text data is difficult, due to the complexity of a language.
Common methods for data augmentation in NLP are:
● Easy Data Augmentation (EDA) operations: synonym replacement, word
insertion, word swap and word deletion
● Back translation: re-translating text from the target language back to its original
language
● Contextualized word embeddings

What are the benefits of data augmentation?


Benefits of data augmentation include:

● Improving model prediction accuracy


○ adding more training data into the models
○ preventing data scarcity for better models
○ reducing data overfitting ( i.e. an error in statistics, it means a function
corresponds too closely to a limited set of data points) and creating
variability in data
○ increasing generalization ability of the models
○ helping resolve class imbalance issues in classification

TRACE KTU
● Reducing costs of collecting and labeling data
● Enables rare event prediction
● Prevents data privacy problems

What are the challenges of data augmentation?

● Companies need to build evaluation systems for the quality of augmented


datasets. As use of data augmentation methods increases, assessment of quality
of their output will be required.
● Data augmentation domain needs to develop new research and studies to create
new/synthetic data with advanced applications. For example, generation of
high-resolution images by using GANs can be challenging.
● If a real dataset contains biases, data augmented from it will contain biases, too.
So, identification of optimal data augmentation strategy is important.
What are use cases/examples in data augmentation?
Image recognition and NLP models generally use data augmentation methods. Also,
the medical imaging domain utilizes data augmentation to apply transformations on
images and create diversity into the datasets. The reasons of data augmentation
interest in healthcare are

● Small dataset for medical images


● Sharing data is not easy due to patient data privacy regulations
● There are only a few patients whose data can be used as training data in the
diagnosis of rare diseases

Example studies in this field include:

● Brain tumor segmentation


● Differential data augmentation for medical imaging
● An automated data augmentation method for synthesizing labeled medical
images

TRACE KTU
● Semi-supervised task-driven data augmentation for medical image
segmentation

You might also like