ML Unit-1
ML Unit-1
Home
Unit-2
Unit-3
Ensemble Learning and Random Forests: Introduction, Voting Classifiers, Bagging and
Pasting, Random Forests, Boosting, Stacking. Support Vector Machine: Linear SVM
Classification, Nonlinear SVM Classification SVM Regression, Naïve Bayes Classifiers.
Unit-4
Unit-5
Neural Networks and Deep Learning: Introduction to Artificial Neural Networks with Keras,
Implementing MLPs with Keras, Installing Tensor Flow 2, Loading and Preprocessing Data
with Tensor Flow.
Home
Unit-1
Home
Artificial Intelligence (AI)
Definition
AI refers to the simulation of human intelligence by machines, enabling them to perform tasks
like reasoning, learning, problem-solving, and decision-making. It serves as an umbrella term
encompassing fields like machine learning, natural language processing, robotics, and
computer vision.
Advantages
1. Automation: Handles repetitive tasks efficiently.
2. Enhanced Decision-Making: Analyzes large datasets for insights.
3. Scalability: Operates on a large scale without human intervention.
4. Versatility: Applicable in healthcare, finance, transportation, and more.
Drawbacks
1. High Costs: Development and maintenance are expensive.
2. Limited Generalization: AI models often excel in narrow tasks but lack flexibility.
3. Ethical Concerns: Raises privacy, security, and bias issues.
4. Job Displacement: Automation may lead to unemployment in certain sectors.
Features
1. Learning: Adapts and improves over time through data.
2. Reasoning: Solves problems and makes decisions.
3. Self-Correction: Learns from errors to improve performance.
4. Perception: Interprets data from sensors or the environment (e.g., images, sounds).
Applications
1. Healthcare: Disease diagnosis, drug discovery, and robotic surgeries.
2. Finance: Fraud detection, algorithmic trading, and credit scoring.
3. Retail: Inventory management, personalized recommendations, and chatbots.
4. Transportation: Autonomous vehicles and traffic management.
Home
Machine Learning (ML)
Definition
Machine learning (ML) is a discipline of artificial intelligence (AI) that provides machines with
the ability to automatically learn from data and past experiences while identifying patterns to
make predictions with minimal human intervention.
Advantages
1. Adaptability: Learns and improves as new data is provided.
2. Predictive Power: Analyzes data for future outcomes.
3. Wide Applications: Fraud detection, recommendation systems, and predictive
maintenance.
4. Efficiency: Automates data analysis tasks.
Drawbacks
1. Data Dependency: Requires large, high-quality datasets.
2. Complexity: Some algorithms are computationally expensive.
3. Overfitting: Models may memorize data instead of generalizing.
4. Opacity: Certain models (e.g., neural networks) lack interpretability.
Features
1. Data-Driven: Learns from structured or unstructured data.
2. Algorithm-Based: Uses methods like regression, clustering, and decision trees.
3. Self-Improving: Enhances performance over time with feedback.
4. Automation: Reduces manual effort in decision-making tasks.
Applications
1. Recommendation Systems: Netflix, Amazon, and Spotify.
2. Fraud Detection: Identifying anomalies in transactions.
3. Customer Segmentation: Behavioral analysis for targeted marketing.
4. Healthcare: Risk prediction and personalized medicine.
Home
Deep Learning (DL)
Definition
DL is a specialized subset of ML that uses artificial neural networks with many layers (deep
networks) to process data. It excels at recognizing patterns in unstructured data like images,
text, and audio.
Advantages
1. Accuracy: Delivers state-of-the-art results in complex tasks.
2. Unstructured Data Processing: Handles images, videos, text, and audio effectively.
3. Automated Feature Extraction: Learns features directly from raw data.
4. Scalability: Performs well with large datasets and advanced hardware.
Drawbacks
1. Data Hunger: Requires vast amounts of labeled data.
2. Computationally Intensive: Needs high-end hardware like GPUs.
3. Interpretability Issues: Often functions as a "black box," making results hard to
explain.
4. Overfitting Risk: May struggle to generalize to unseen data.
Features
1. Hierarchical Learning: Extracts features at multiple levels (low-level edges to high-
level concepts).
2. End-to-End Learning: Automates the entire learning process from input to output.
3. Complex Data Handling: Excels with unstructured data like images and text.
4. High Accuracy: Achieves better results in vision, language, and audio tasks.
Applications
1. Image Recognition: Facial recognition, medical imaging, and object detection.
2. Natural Language Processing (NLP): Chatbots, language translation, and sentiment
analysis.
3. Speech Recognition: Virtual assistants like Siri and Alexa.
4. Autonomous Vehicles: Navigating self-driving cars with object detection and decision-
making.
Home
Example: Role of AI, ML, and DL in Self-Driving Cars
• How It Works:
o AI systems integrate data from various sensors (LiDAR, cameras, radar, GPS)
to understand the vehicle's environment.
o AI combines predefined rules with learning algorithms to make real-time
decisions like when to stop, when to accelerate, and how to avoid collisions.
o AI's reasoning algorithms simulate human thought processes and follow traffic
laws.
• Example:
o Waymo, a self-driving car company, uses AI to integrate information from
sensors and plan safe routes through traffic, managing complex tasks such as
stopping at traffic lights, avoiding pedestrians, and handling intersections.
• Strength:
o AI allows the car to make intelligent decisions based on sensor data, traffic
conditions, and predefined rules.
• Limitation:
o AI alone is often not sufficient for handling the complexity of real-time
decision-making. It requires the integration of ML and DL for better
performance.
2. Machine Learning (ML) in Self-Driving Cars
ML Role: ML algorithms enable the car to improve its driving behavior over time by learning
from data, experience, and feedback. ML focuses on predicting the future actions of objects
(like pedestrians and other vehicles) and adapting to changing environments.
• How It Works:
Home
o Reinforcement Learning: Cars learn optimal driving strategies by interacting
with the environment and receiving feedback on their actions (e.g., avoid
collisions, follow speed limits).
• Example:
o Self-driving trucks use ML to adapt their speed and route based on live traffic
conditions and road closures.
• Strength:
• Limitation:
DL Role: DL, particularly Convolutional Neural Networks (CNNs), is essential for tasks like
object detection, image segmentation, and lane recognition. DL models process unstructured
data (images, videos) directly from cameras and other sensors to identify objects like
pedestrians, vehicles, traffic signs, and road conditions.
• How It Works:
o End-to-End Learning: DL models can learn to drive the car from raw sensor
data (e.g., camera images) to control the car's steering, acceleration, and
braking, all in one step.
o Fusion of Sensor Data: DL models combine data from cameras, LiDAR, and
radar to create a 360-degree view of the environment, enabling real-time
decision-making.
Home
• Example:
o Waymo and Tesla both use deep learning to enable their vehicles to detect and
classify objects, understand the road, and make driving decisions.
• Strength:
• Limitation:
o DL models require high computational power (often GPUs) for training and
inference, which makes it resource-intensive.
Home
AI LAYERS
Generative AI:
Generative AI is a type of AI that can create new content including text, code, images and
music. Generative AI models are trained on large datasets of existing content, learning to
identify patterns in data and using those patterns to generate new content.
LLMs are a type of generative AI model trained on massive datasets of text and code. LLMs
can generate text, translate languages, write different kinds of creative content and answer your
questions in an informative way.
GPT-4 and ChatGPT are two examples of GPT models. GPT-4 is an LLM developed by
OpenAI, while ChatGPT is an LLM (also developed by OpenAI) that is specifically designed
for chatbot applications.
Home
2. Types of Machine Learning Systems
Machine learning (ML) is a discipline of artificial intelligence (AI) that provides machines with
the ability to automatically learn from data and past experiences while identifying patterns to
make predictions with minimal human intervention.
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised learning
4. Reinforcement learning
Home
Types of ML
Home
Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled images.
When we input new dog or cat images that it has never seen before, it will use the learned
algorithms and predict whether it is a dog or a cat. There are two main categories of supervised
learning that are mentioned below:
• Classification
• Regression
Classification
Classification deals with predicting categorical target variables, which represent discrete
classes or labels. For instance, classifying emails as spam or not spam, or predicting whether a
patient has a high risk of heart disease. Classification algorithms learn to map the input features
to one of the predefined classes.
Here are some classification algorithms:
• Logistic Regression
• Support Vector Machine
• Random Forest
• Decision Tree
• K-Nearest Neighbors (KNN)
• Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.
Here are some regression algorithms:
• Linear Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression
• Decision tree
• Random Forest
Home
Advantages of Supervised Machine Learning
• Supervised Learning models can have high accuracy as they are trained on labelled
data.
• The process of decision-making in supervised learning models is often interpretable.
• It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
• It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
• It can be time-consuming and costly as it relies on labeled data only.
• It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
• Image classification: Identify objects, faces, and other features in images.
• Natural language processing: Extract information from text, such as sentiment,
entities, and relationships.
• Speech recognition: Convert spoken language into text.
• Recommendation systems: Make personalized recommendations to users.
• Predictive analytics: Predict outcomes, such as sales, customer churn, and stock prices.
• Medical diagnosis: Detect diseases and other medical conditions.
• Fraud detection: Identify fraudulent transactions.
• Autonomous vehicles: Recognize and respond to objects in the environment.
• Email spam detection: Classify emails as spam or not spam.
• Quality control in manufacturing: Inspect products for defects.
• Credit scoring: Assess the risk of a borrower defaulting on a loan.
• Gaming: Recognize characters, analyze player behavior, and create NPCs.
• Customer support: Automate customer support tasks.
• Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
• Sports analytics: Analyze player performance, make game predictions, and optimize
strategies.
Home
2. Unsupervised Machine Learning
Example: Consider that you have a dataset that contains information about the purchases you
made from the shop. Through clustering, the algorithm can group the same purchasing behavior
among you and other customers, which reveals potential customers without predefined labels.
This type of information can help businesses get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:
• Clustering
• Association
Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for labeled
examples.
Here are some clustering algorithms:
• K-Means Clustering algorithm
• DBSCAN Algorithm
• Principal Component Analysis
Home
Association
Association rule learning is a technique for discovering relationships between items in a
dataset. It identifies rules that indicate the presence of one item implies the presence of another
item with a specific probability.
Here are some association rule learning algorithms:
• Apriori Algorithm
• FP-growth Algorithm
Advantages of Unsupervised Machine Learning
• It helps to discover hidden patterns and various relationships between the data.
• Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
• It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
• Without using labels, it may be difficult to predict the quality of the model’s output.
• Cluster Interpretability may not be clear and may not have meaningful interpretations.
• It has techniques such as autoencoders and dimensionality reduction that can be used
to extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
• Clustering: Group similar data points into clusters.
• Anomaly detection: Identify outliers or anomalies in data.
• Dimensionality reduction: Reduce the dimensionality of data while preserving its
essential information.
• Recommendation systems: Suggest products, movies, or content to users based on
their historical behavior or preferences.
• Topic modeling: Discover latent topics within a collection of documents.
• Density estimation: Estimate the probability density function of data.
• Image and video compression: Reduce the amount of storage required for multimedia
content.
• Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
• Market basket analysis: Discover associations between products.
• Genomic data analysis: Identify patterns or group genes with similar expression
profiles.
Home
• Image segmentation: Segment images into meaningful regions.
• Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections.
• Customer behavior analysis: Uncover patterns and insights for better marketing and
product recommendations.
• Content recommendation: Classify and tag content to make it easier to recommend
similar items to users.
• Exploratory data analysis (EDA): Explore data and gain insights before defining
specific tasks.
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between the supervised
and unsupervised learning so it uses both labelled and unlabelled data. It’s particularly useful
when obtaining labeled data is costly, time-consuming, or resource-intensive. This approach is
useful when the dataset is expensive and time-consuming. Semi-supervised learning is chosen
when labeled data requires skills and relevant resources in order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled and the rest
large portion of it is unlabeled. We can use the unsupervised techniques to predict labels and
then feed these labels to supervised techniques. This technique is mostly applicable in the case
of image data sets where usually all images are not labeled.
Home
Example: Consider that we are building a language translation model, having labeled
translations for every sentence pair can be resources intensive. It allows the models to learn
from labeled and unlabeled sentence pairs, making them more accurate. This technique has led
to significant improvements in the quality of machine translation services.
Types of Semi-Supervised Learning Methods
There are a number of different semi-supervised learning methods each with its own
characteristics. Some of the most common ones include:
• Graph-based semi-supervised learning: This approach uses a graph to represent the
relationships between the data points. The graph is then used to propagate labels from
the labeled data points to the unlabeled data points.
• Label propagation: This approach iteratively propagates labels from the labeled data
points to the unlabeled data points, based on the similarities between the data points.
• Co-training: This approach trains two different machine learning models on different
subsets of the unlabeled data. The two models are then used to label each other’s
predictions.
• Self-training: This approach trains a machine learning model on the labeled data and
then uses the model to predict labels for the unlabeled data. The model is then retrained
on the labeled data and the predicted labels for the unlabeled data.
• Generative adversarial networks (GANs): GANs are a type of deep learning
algorithm that can be used to generate synthetic data. GANs can be used to generate
unlabeled data for semi-supervised learning by training two neural networks, a
generator and a discriminator.
Advantages of Semi- Supervised Machine Learning
• It leads to better generalization as compared to supervised learning, as it takes both
labeled and unlabeled data.
• Can be applied to a wide range of data.
Disadvantages of Semi- Supervised Machine Learning
• Semi-supervised methods can be more complex to implement compared to other
approaches.
• It still requires some labeled data that might not always be available or easy to obtain.
• The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
Home
• Image Classification and Object Recognition: Improve the accuracy of models by
combining a small set of labeled images with a larger set of unlabeled images.
• Natural Language Processing (NLP): Enhance the performance of language models
and classifiers by combining a small set of labeled text data with a vast amount of
unlabeled text.
• Speech Recognition: Improve the accuracy of speech recognition by leveraging a
limited amount of transcribed speech data and a more extensive set of unlabeled audio.
• Recommendation Systems: Improve the accuracy of personalized recommendations
by supplementing a sparse set of user-item interactions (labeled data) with a wealth of
unlabeled user behavior data.
• Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a
small set of labeled medical images alongside a larger set of unlabeled images.
Home
Home
Disadvantages of Reinforcement Machine Learning
• Training Reinforcement Learning agents can be computationally expensive and time-
consuming.
• Reinforcement learning is not preferable to solving simple problems.
• It needs a lot of data and a lot of computation, which makes it impractical and costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
• Game Playing: RL can teach agents to play games, even complex ones.
• Robotics: RL can teach robots to perform tasks autonomously.
• Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
• Recommendation Systems: RL can enhance recommendation algorithms by learning
user preferences.
• Healthcare: RL can be used to optimize treatment plans and drug discovery.
• Natural Language Processing (NLP): RL can be used in dialogue systems and
chatbots.
• Finance and Trading: RL can be used for algorithmic trading.
• Supply Chain and Inventory Management: RL can be used to optimize supply chain
operations.
• Energy Management: RL can be used to optimize energy consumption.
• Game AI: RL can be used to create more intelligent and adaptive NPCs in video games.
• Adaptive Personal Assistants: RL can be used to improve personal assistants.
• Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create
immersive and interactive experiences.
• Industrial Control: RL can be used to optimize industrial processes.
• Education: RL can be used to create adaptive learning systems.
• Agriculture: RL can be used to optimize agricultural operations.
Home
3. Main Challenges of Machine Learning
Home
Home
• Cross-Validation: Employ cross-validation methods to test the model’s generalization
capabilities across different subsets of the data.
Addressing non-representative data is essential for ensuring that models can make accurate
predictions in real-world scenarios.
Home
Solutions:
• Automated Monitoring: Implement monitoring systems to detect when a model’s
performance starts to decline.
• Scheduled Retraining: Regularly retrain models using new data to keep them up
to date.
Effective monitoring and maintenance strategies are critical for ensuring that machine learning
models remain accurate over time.
6. Data Bias
Data bias occurs when the training data used to build a model is not representative of the
broader population, leading to biased predictions. This can result in models that discriminate
against certain groups or fail to generalize to all users.
Examples:
• Gender Bias in Hiring Models: Algorithms trained on biased hiring data may favor
one gender over another, perpetuating inequalities.
• Facial Recognition: Systems trained predominantly on lighter-skinned individuals
often fail to accurately identify people with darker skin tones.
Detecting and Reducing Bias:
• Bias Detection Tools: Tools like IBM AI Fairness 360 can help identify and reduce
bias in machine learning models.
• Diverse Training Data: Ensuring that the training dataset includes diverse examples
can help mitigate bias.
Addressing data bias is critical for building fair and equitable machine learning models,
especially in industries like healthcare, finance, and criminal justice.
7. Lack of Explainability
Many machine learning models, especially deep learning models, are often described as
“black boxes” due to the difficulty in understanding how they make decisions. This lack of
explainability presents challenges in industries where transparency is crucial, such
as healthcare and finance.
Consequences:
• Regulatory Compliance: In some industries, regulations require that models provide
clear explanations for their decisions. Lack of explainability can hinder the adoption of
machine learning in these fields.
Home
• Trust: Without understanding how a model arrives at a decision, stakeholders may be
reluctant to trust its predictions.
Methods to Improve Explainability:
• LIME (Local Interpretable Model-agnostic Explanations): LIME explains
individual predictions by approximating the model locally.
• SHAP (SHapley Additive exPlanations): SHAP values provide insights into how each
feature contributes to a prediction.
Improving explainability is essential for increasing trust in machine learning models and
ensuring compliance with industry regulations.
Home
• Model Scaling: Adapting models to handle larger datasets or real-time applications can
be difficult.
Solutions:
• Automated Machine Learning (AutoML): AutoML platforms automate many of the
tasks involved in building machine learning models, reducing the complexity of the
process.
• Pipeline Automation: Automating data pipelines can streamline the process of moving
from data collection to model deployment.
Simplifying the machine learning workflow through automation tools can help overcome the
complexity of the process.
Home
Reducing irrelevant features improves model accuracy and efficiency, leading to better results
and lower computational costs.
Home
4. Statistical Learning: Introduction, Supervised & Unsupervised Learning
Introduction to Statistical Learning: Statistical learning is a branch of machine learning and
statistics that focuses on understanding and analyzing data. It involves developing models to
predict or explain a set of outcomes based on input data. The main objective of statistical
learning is to uncover patterns, make predictions, or infer relationships within data using
mathematical and computational techniques.
Statistical learning includes both supervised and unsupervised learning, which differ in how
the data is used to train models.
Home
Data Exploration: Descriptive statistics and data visualization are crucial for understanding
the underlying structure of the data in unsupervised learning. This exploration helps identify
patterns and outliers.
Feature Engineering: In unsupervised learning, feature engineering involves creating new
features or representations from the original data. Statistical techniques can be used to create
meaningful features that capture important information.
Anomaly Detection: Detecting anomalies in data involves comparing data points
to statistical distributions or defining threshold values. Deviations from the expected statistical
patterns can indicate anomalies.
Different statistics that we need to encounter while working with Machine Learning
When working with machine learning, various risk statistics need to be considered to assess
the performance, reliability, and potential pitfalls of our models. These risk statistics provide
insights into different aspects of model behavior and help in making informed decisions. In
statistical learning, loss refers to the error between the predicted values of a model and the true
values. It is used to evaluate how well a model fits the data.
1. Training Loss:
• The error calculated on the training dataset, which the model uses to learn.
• Goal: Minimize the training loss during model training.
• Observation: A low training loss indicates the model has learned the patterns in the
training data well.
2. Test Loss:
• The error calculated on the test dataset, which is unseen by the model during
training.
• Goal: Evaluate the generalization ability of the model on new, unseen data.
• Observation: A low test loss means the model generalizes well and can make
accurate predictions on unseen data.
Key Concept:
• Overfitting: Overfitting occurs when a model learns too much from the
training data, including noise, outliers, or irrelevant patterns, making it overly
complex. As a result, the model performs very well on the training data but fails
to generalize to new, unseen data (test data).
Home
• Underfitting: Underfitting happens when a model is too simple to capture the
underlying patterns in the data. It fails to learn enough from the training data
and cannot make accurate predictions on either the training or test data.
3. Accuracy:
• Definition: The ratio of correctly predicted instances to the total number of instances in
classification tasks.
• Importance: Provides an overall measure of classification performance.
• Considerations: Can be misleading when classes are imbalanced; may not account for
varying costs of misclassifications.
4. Precision and Recall:
• Precision: The ratio of true positive predictions to the total positive predictions made
by the model.
• Recall: The ratio of true positive predictions to the total actual positives in the dataset.
• Importance: Important for imbalanced classes; precision focuses on the accuracy of
positive predictions, while recall focuses on the ability to find all positives.
5. F1-Score:
• Definition: The harmonic mean of precision and recall.
• Importance: Provides a balance between precision and recall.
• Interpretation: Useful when there's a trade-off between false positives and false
negatives.
6. Confusion Matrix:
• Definition: Provides detailed insights into classification performance.
• Applications: Used to calculate various classification metrics like accuracy, precision,
recall, and F1-score.
7. Mean Squared Error (MSE) and Root Mean Squared Error (RMSE):
• MSE: The average of the squared differences between predicted and actual values in
regression tasks.
• RMSE: The square root of MSE.
• Importance: Measure the accuracy of regression models.
• Interpretation: Lower values indicate better fit; sensitive to outliers.
Home
• Bias: The difference between the model's predictions and the true values; high bias
leads to Underfitting.
• Variance: The model's sensitivity to small changes in the training data; high variance
leads to overfitting.
• Bias-Variance Trade-off: Balancing bias and variance is crucial for optimal model
performance.
9. Cross-Validation Results:
• K-Fold Cross-Validation: Assessing model performance across different data splits to
ensure generalization.
• Importance: Helps detect overfitting, evaluate model stability, and make informed
model choices.
10. Learning Curves:
• Definition: Plots of training and testing loss (or other metrics) against the number of
training examples.
• Importance: Visualizes how the model's performance changes with data size; helps
identify underfitting and overfitting.
Ways to prevent the Overfitting
Although overfitting is an error in Machine learning which reduces the performance of the
model, however, we can prevent it in several ways. With the use of the linear model, we can
avoid overfitting; however, many real-world problems are non-linear ones. It is important to
prevent overfitting from the models. Below are several ways that can be used to prevent
overfitting:
•Early Stopping
•Train with more data
•Feature Selection
•Cross-Validation
•Data Augmentation
•Regularization
Techniques to Reduce Underfitting
•Increase model complexity.
•Increase the number of features, performing feature engineering.
•Remove noise from the data.
•Increase the number of epochs or increase the duration of training to get better results.
Home
Definition Learning from labeled data to Learning from unlabeled data to find
predict or classify outcomes. patterns or structures.
Data Requires labeled data (input + Uses only unlabeled data (input
Requirement corresponding output). features).
Goal Map inputs to known outputs for Discover hidden patterns, clusters, or
prediction or classification. structures in the data.
- Predicting house prices
Customer segmentation (Clustering)
Examples (Regression)
- Spam detection (Classification) Dimensionality reduction (e.g.,PCA)
Linear regression, Logistic
Algorithms K-means clustering, Hierarchical
regression, Random forests,
clustering, PCA, Autoencoders
Neural networks
Evaluated using metrics like
Evaluation Measured using metrics like
cohesion (e.g., silhouette score) or
accuracy, precision, recall, etc.
explained variance.
Fraud detection Market segmentation
Applications Medical diagnosis Anomaly detection
Stock price prediction Data visualization
Strength Highly accurate for prediction Effective for exploring and grouping
tasks with sufficient labeled data. data without labels.
Requires a large amount of
Limitations Results can be less interpretable and
labeled data, which can be
harder to validate.
expensive to obtain.
Home
5. Training and Test Loss, Tradeoffs in Statistical Learning
In statistical learning, loss refers to the error between the predicted values of a model and the
true values. It is used to evaluate how well a model fits the data.
Training Loss:
• The error calculated on the training dataset, which the model uses to learn.
• Goal: Minimize the training loss during model training.
• Observation: A low training loss indicates the model has learned the patterns in the
training data well.
Test Loss:
• The error calculated on the test dataset, which is unseen by the model during
training.
• Goal: Evaluate the generalization ability of the model on new, unseen data.
• Observation: A low test loss means the model generalizes well and can make
accurate predictions on unseen data.
Key Concept:
• Overfitting: Overfitting occurs when a model learns too much from the training data,
including noise, outliers, or irrelevant patterns, making it overly complex. As a result,
the model performs very well on the training data but fails to generalize to new, unseen
data (test data).
• Underfitting: Underfitting happens when a model is too simple to capture the
underlying patterns in the data. It fails to learn enough from the training data and cannot
make accurate predictions on either the training or test data.
Home
In machine learning, an error is a measure of how accurately an algorithm can make predictions
for the previously unknown dataset. There are mainly two types of errors in machine learning,
which are:
• Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.
• Irreducible errors: These errors will always be present in the model regardless of
which algorithm has been used. The cause of these errors is unknown variables whose
value can't be reduced.
Bias:
While making predictions, a difference occurs between prediction values made by the model
and actual values/expected values, and this difference is known as bias errors or Errors due to
bias.
Low Bias: The model makes fewer assumptions and is flexible enough to capture complex
patterns in the data.
High Bias: The model makes overly simplistic assumptions about the data and fails to
capture its complexity.
Home
Aspect High Bias Low Bias
Very simple
Model Flexible and complex
(e.g., linear regression
Simplicity (e.g., deep neural networks).
for non-linear data).
Fit on Training Poor Excellent
Data (underfits the data). (fits the training data well).
Training Error High. Low.
High due to Depends on variance
Test Error
underfitting. (may be low or high).
How to Reduce Bias:
1. Increase Model Complexity:
• Use models capable of capturing more complex relationships (e.g., decision
trees, neural networks).
2. Add Features:
• Add more relevant input features to provide the model with additional
information.
3. Reduce Regularization:
• Regularization prevents overfitting but can lead to underfitting if too strong.
4. Try Non-Linear Models:
• Use algorithms like polynomial regression, kernel SVMs, or ensemble methods
to handle non-linear data.
Variance
Variance refers to the error caused by the model’s sensitivity to small fluctuations in the training
data. It indicates how much the model’s predictions change if the training data changes.
Low Variance
A model with low variance is stable, meaning its predictions don't change drastically when
trained on different subsets of the data. It captures the general patterns in the data without being
overly influenced by small variations or noise.
High Variance
A model with high variance is highly sensitive to small changes in the training data. It
"memorizes" the details of the training data, including noise, which may not be present in new,
unseen data. This can lead to overfitting, where the model performs very well on the training
set but poorly on the test set.
Home
Aspect Low Variance High Variance
Model Not sensitive to small fluctuations Sensitive to small changes in
Sensitivity in training data. training data.
Fit on
Fit is more general and does not Fit is too specific to the training
Training
capture noise. data, possibly capturing noise.
Data
Training Low, but might be due to Low (because the model fits the
Error underfitting. training data very well).
Home
1. Low-Bias,Low-Variance:
The combination of low bias and low variance shows an ideal machine learning model.
However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a
large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn well
with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias,High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on
average.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very simple
with fewer parameters, it may have low variance and high bias. Whereas, if the model has a
large number of parameters, it will have high variance and low bias. So, it is required to make
Home
a balance between bias and variance errors, and this balance between the bias error and variance
error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias. But this
is not possible because bias and variance are related to each other:
o If we decrease the variance, it will increase the bias.
o If we decrease the bias, it will increase the variance.
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that
accurately captures the regularities in training data and simultaneously generalizes well with
the unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high
variance algorithm may perform well with training data, but it may lead to overfitting to noisy
data. Whereas, high bias algorithm generates a much simple model that may not even capture
important regularities in the data. So, we need to find a sweet spot between bias and variance
to make an optimal model. Hence, the Bias-Variance trade-off is about finding the sweet spot
to make a balance between bias and variance errors.
Home
6. Estimating Risk Statistics, Sampling distribution of an estimator,
Empirical Risk Minimization.
Risk statistics in machine learning involve assessing the performance of a model based on a
defined loss function. The risk is a measure of how far off a model's predictions are from the
true target values. The goal is to minimize this risk to ensure the model generalizes well to
unseen data.
1. Poor Data
2. Overfitting
3. Biased Data
4. Lack of strategy and experience
5. Security Risks
6. Data privacy and confidentiality
7. Third-party risks
8. Regulatory challenges
Home
3. Empirical Risk:
• Since the true distribution P(X,Y) is unknown, we approximate the true risk
using a finite dataset D={(X1,Y1),…,(Xn,Yn)}.
• Empirical risk is the average loss over the sample data:
An estimator is a rule or formula for estimating a parameter of the population from sample
data. For example:
• Sample mean estimates the population mean.
• Model f estimates the true function f∗.
The sampling distribution of an estimator describes how the values of an estimator (e.g.,
sample mean) vary across different random samples.
Key Concepts
1. Estimator θ^:
• A function of the sample data used to estimate an unknown population
parameter θ.
2. Sampling Distribution:
• The distribution of an estimator's values when computed on different random
samples of size n from the same population.
3. Example: Sample Mean
• If X1,X2,…, are the samples from a population with mean μ and variance σ2:
Home
• The sample mean is given by:
In statistics, it is the probability distribution of the given statistic estimated on the basis of a
random sample. It provides a generalized way to statistical inference. The estimator is the
generalized mathematical parameter to calculate sample statistics. An estimate is the result of
the estimation.
The sampling distribution of estimator depends on the sample size. The effect of change of the
sample size has to be determined. An estimate has a single numerical value and hence they are
called point estimates. There are various estimators like sample mean, sample standard
deviation, proportion, variance, range etc.
Sampling distribution of the mean: It is the population mean from which the samples are drawn.
For all the sample sizes, it is likely to be normal if the population distribution is normal. The
population mean is equal to the mean of the sampling distribution of the mean. Sampling
distribution of mean has the standard deviation, which is as follows:
Where σM is the standard deviation of the sampling mean, σ is the population standard deviation
and n is the sample size.
As the size of the sample increases, the spread of the sampling distribution of the mean
decreases. But the mean of the distribution remains the same and it is not affected by the sample
size.
Home
The sampling distribution of the standard deviation is the standard error of the standard
deviation. It is defined as:
Here, σS is the sampling distribution of the standard deviation. It is positively skewed for small
n but it approximately becomes normal for sample sizes greater than 30.
Home
• The loss function - It can cause trouble if the loss function gives very high loss in certain
conditions.
The L2 Regularization is an example of empirical risk minimization.
L2 Regularization
To handle the problem of overfitting, we use the regularization techniques. A regression
problem using L2 regularization is also known as ridge regression. In ridge regression, the
insignificant predictors are penalized. This method constricts the coefficients to deal with
highly correlated independent variables. Ridge regression adds the “squared magnitude” of the
coefficient, which is the sum of squares of the weights of all features as the penalty term to the
loss function.