The document discusses normalization and standardization techniques used in machine learning to adjust feature scales, ensuring equal contribution to models. It also covers overfitting and underfitting, which describe how well a model generalizes to new data, along with their symptoms, causes, and prevention strategies. Normalization scales data to a specific range, while standardization transforms data to have a mean of 0 and a standard deviation of 1.
The document discusses normalization and standardization techniques used in machine learning to adjust feature scales, ensuring equal contribution to models. It also covers overfitting and underfitting, which describe how well a model generalizes to new data, along with their symptoms, causes, and prevention strategies. Normalization scales data to a specific range, while standardization transforms data to have a mean of 0 and a standard deviation of 1.
Normalization and Standardization • Both Normalization and Standardization are techniques used to adjust the scale of features in a dataset • They are crucial in machine learning to ensure that all features contribute equally to the model and prevent any feature from dominating due to its scale
Dr. Mainak Biswas
Normalization • Normalization (also called Min-Max Scaling) is the process of transforming features such that they lie within a specific range, typically [0, 1] or [-1, 1] • This is done by scaling the data to a fixed range based on the minimum and maximum values of the feature • Formula: 𝑥 − min(𝑥) 𝑥′ = max 𝑥 − min(𝑥) where x is the original value, min(x)is the minimum value, and max(x) is the maximum value in the dataset • Usage: Algorithms like k-Nearest Neighbors (k-NN), and Neural Networks, which are sensitive to the scale of features. Dr. Mainak Biswas Normalization Example 𝒙 − 𝐦𝐢𝐧(𝒙) SL Values 𝒙= Normalized 𝐦𝐚𝐱 𝒙 − 𝐦𝐢𝐧(𝒙) Values 𝟏𝟎 − 𝟏𝟎 1 10 0 𝟓𝟎 − 𝟏𝟎 𝟐𝟎 − 𝟏𝟎 2 20 0.25 𝟓𝟎 − 𝟏𝟎 𝟑𝟎 − 𝟏𝟎 3 30 0.50 𝟓𝟎 − 𝟏𝟎 𝟒𝟎 − 𝟏𝟎 4 40 0.75 𝟓𝟎 − 𝟏𝟎
𝟓𝟎 − 𝟏𝟎 5 50 1.00 𝟓𝟎 − 𝟏𝟎
Dr. Mainak Biswas
Standardization • Standardization: Transforming data to have a mean of 0 and a standard deviation of 1 (also known as Z-Score Scaling) • It centers the data and scales it based on standard deviation • Formula: ′ 𝑥−𝜇 𝑥 = 𝜎 where 𝜇 is the mean and 𝜎 is the standard deviation of the dataset • Usage: Algorithms like Support Vector Machines (SVM), Logistic Regression, and Principal Component Analysis (PCA) which assume a normal distribution or work better with data centered around 0.
Overfitting and Underfitting • Overfitting and Underfitting are concepts in machine learning that describe how well a model generalizes to new data • They are often indicators of how effectively a model has learned patterns from the training data
Dr. Mainak Biswas
Overfitting • Overfitting occurs when a model learns not only the underlying patterns in the training data but also the noise and details that do not generalize to unseen data • Symptoms – High accuracy on training data – Poor performance on validation or test data • Causes – Model is too complex (e.g., too many parameters or layers) – Insufficient training data – Training for too many epochs without regularization • Prevention – Use regularization techniques – Reduce the model's complexity – Use more training data or data augmentation
Dr. Mainak Biswas
Underfitting • Underfitting occurs when a model is too simple to capture the underlying patterns in the data • Symptoms – Poor performance on both training and validation/test data – Model fails to capture the complexity of the data • Causes – Model is too simple – Insufficient training time – Features used in the model are not relevant or sufficient • Prevention – Use a more complex model – Train the model for more epochs – Provide better or more features to the model
Dr. Mainak Biswas
Differences
Aspect Overfitting Underfitting
Performance on Training High accuracy Low accuracy Data Performance on Test Data Poor Poor Model Complexity Too complex Too simple Generalization Poor Poor