0% found this document useful (0 votes)
6 views

ML Unit-1

The document outlines a syllabus for a course on Artificial Intelligence and Data Science, covering key topics such as machine learning, deep learning, and their applications. It details various learning methods including supervised, unsupervised, and reinforcement learning, as well as specific algorithms and their advantages and drawbacks. Additionally, it discusses the role of AI, ML, and DL in self-driving cars, highlighting their functionalities and limitations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

ML Unit-1

The document outlines a syllabus for a course on Artificial Intelligence and Data Science, covering key topics such as machine learning, deep learning, and their applications. It details various learning methods including supervised, unsupervised, and reinforcement learning, as well as specific algorithms and their advantages and drawbacks. Additionally, it discusses the role of AI, ML, and DL in self-driving cars, highlighting their functionalities and limitations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Artificial Intelligence and Data Science (AI & DS)

Home

Machine Learning [B20AD3201]


Syllabus
Unit-1

Introduction- Artificial Intelligence, Machine Learning, Deep Learning, Types of Machine


Learning Systems, Main Challenges of Machine Learning. Statistical Learning: Introduction,
Supervised and Unsupervised Learning, Training and Test Loss, Tradeoffs in Statistical
Learning, Estimating Risk Statistics, Sampling distribution of an estimator, Empirical Risk
Minimization.

Unit-2

Supervised Learning (Regression/Classification): Basic Methods: Distance-based Methods,


Nearest Neighbours, Decision Trees, Naive Bayes, Linear Models: Linear Regression, Logistic
Regression, Generalized Linear Models, Support Vector Machines, Binary Classification:
Multiclass/Structured outputs, MNIST, Ranking.

Unit-3

Ensemble Learning and Random Forests: Introduction, Voting Classifiers, Bagging and
Pasting, Random Forests, Boosting, Stacking. Support Vector Machine: Linear SVM
Classification, Nonlinear SVM Classification SVM Regression, Naïve Bayes Classifiers.

Unit-4

Unsupervised Learning Techniques: Clustering, K-Means, Limits of K-Means, Using


Clustering for Image Segmentation, Using Clustering for Pre processing, Using Clustering for
Semi-Supervised Learning, DBSCAN, Gaussian Mixtures. Dimensionality Reduction: The
Curse of Dimensionality, Main Approaches for Dimensionality Reduction, PCA, Using Scikit-
Learn, Randomized PCA, Kernel PCA.

Unit-5

Neural Networks and Deep Learning: Introduction to Artificial Neural Networks with Keras,
Implementing MLPs with Keras, Installing Tensor Flow 2, Loading and Preprocessing Data
with Tensor Flow.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Unit-1

Introduction- Artificial Intelligence, Machine Learning, Deep Learning, Types of


Machine Learning Systems, Main Challenges of Machine Learning. Statistical Learning:
Introduction, Supervised and Unsupervised Learning, Training and Test Loss, Tradeoffs
in Statistical Learning, Estimating Risk Statistics, Sampling distribution of an estimator,
Empirical Risk Minimization.

1. Introduction- Artificial Intelligence, Machine Learning, Deep learning

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Artificial Intelligence (AI)
Definition
AI refers to the simulation of human intelligence by machines, enabling them to perform tasks
like reasoning, learning, problem-solving, and decision-making. It serves as an umbrella term
encompassing fields like machine learning, natural language processing, robotics, and
computer vision.

Advantages
1. Automation: Handles repetitive tasks efficiently.
2. Enhanced Decision-Making: Analyzes large datasets for insights.
3. Scalability: Operates on a large scale without human intervention.
4. Versatility: Applicable in healthcare, finance, transportation, and more.
Drawbacks
1. High Costs: Development and maintenance are expensive.
2. Limited Generalization: AI models often excel in narrow tasks but lack flexibility.
3. Ethical Concerns: Raises privacy, security, and bias issues.
4. Job Displacement: Automation may lead to unemployment in certain sectors.
Features
1. Learning: Adapts and improves over time through data.
2. Reasoning: Solves problems and makes decisions.
3. Self-Correction: Learns from errors to improve performance.
4. Perception: Interprets data from sensors or the environment (e.g., images, sounds).
Applications
1. Healthcare: Disease diagnosis, drug discovery, and robotic surgeries.
2. Finance: Fraud detection, algorithmic trading, and credit scoring.
3. Retail: Inventory management, personalized recommendations, and chatbots.
4. Transportation: Autonomous vehicles and traffic management.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Machine Learning (ML)
Definition
Machine learning (ML) is a discipline of artificial intelligence (AI) that provides machines with
the ability to automatically learn from data and past experiences while identifying patterns to
make predictions with minimal human intervention.
Advantages
1. Adaptability: Learns and improves as new data is provided.
2. Predictive Power: Analyzes data for future outcomes.
3. Wide Applications: Fraud detection, recommendation systems, and predictive
maintenance.
4. Efficiency: Automates data analysis tasks.
Drawbacks
1. Data Dependency: Requires large, high-quality datasets.
2. Complexity: Some algorithms are computationally expensive.
3. Overfitting: Models may memorize data instead of generalizing.
4. Opacity: Certain models (e.g., neural networks) lack interpretability.
Features
1. Data-Driven: Learns from structured or unstructured data.
2. Algorithm-Based: Uses methods like regression, clustering, and decision trees.
3. Self-Improving: Enhances performance over time with feedback.
4. Automation: Reduces manual effort in decision-making tasks.
Applications
1. Recommendation Systems: Netflix, Amazon, and Spotify.
2. Fraud Detection: Identifying anomalies in transactions.
3. Customer Segmentation: Behavioral analysis for targeted marketing.
4. Healthcare: Risk prediction and personalized medicine.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Deep Learning (DL)
Definition
DL is a specialized subset of ML that uses artificial neural networks with many layers (deep
networks) to process data. It excels at recognizing patterns in unstructured data like images,
text, and audio.
Advantages
1. Accuracy: Delivers state-of-the-art results in complex tasks.
2. Unstructured Data Processing: Handles images, videos, text, and audio effectively.
3. Automated Feature Extraction: Learns features directly from raw data.
4. Scalability: Performs well with large datasets and advanced hardware.
Drawbacks
1. Data Hunger: Requires vast amounts of labeled data.
2. Computationally Intensive: Needs high-end hardware like GPUs.
3. Interpretability Issues: Often functions as a "black box," making results hard to
explain.
4. Overfitting Risk: May struggle to generalize to unseen data.
Features
1. Hierarchical Learning: Extracts features at multiple levels (low-level edges to high-
level concepts).
2. End-to-End Learning: Automates the entire learning process from input to output.
3. Complex Data Handling: Excels with unstructured data like images and text.
4. High Accuracy: Achieves better results in vision, language, and audio tasks.
Applications
1. Image Recognition: Facial recognition, medical imaging, and object detection.
2. Natural Language Processing (NLP): Chatbots, language translation, and sentiment
analysis.
3. Speech Recognition: Virtual assistants like Siri and Alexa.
4. Autonomous Vehicles: Navigating self-driving cars with object detection and decision-
making.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Example: Role of AI, ML, and DL in Self-Driving Cars

1. Artificial Intelligence (AI) in Self-Driving Cars

AI Role: AI encompasses the broad decision-making framework for autonomous vehicles. It


allows the car to simulate human-like reasoning and problem-solving for tasks like route
planning, behavior prediction, and emergency response.

• How It Works:
o AI systems integrate data from various sensors (LiDAR, cameras, radar, GPS)
to understand the vehicle's environment.
o AI combines predefined rules with learning algorithms to make real-time
decisions like when to stop, when to accelerate, and how to avoid collisions.
o AI's reasoning algorithms simulate human thought processes and follow traffic
laws.
• Example:
o Waymo, a self-driving car company, uses AI to integrate information from
sensors and plan safe routes through traffic, managing complex tasks such as
stopping at traffic lights, avoiding pedestrians, and handling intersections.

• Strength:
o AI allows the car to make intelligent decisions based on sensor data, traffic
conditions, and predefined rules.
• Limitation:
o AI alone is often not sufficient for handling the complexity of real-time
decision-making. It requires the integration of ML and DL for better
performance.
2. Machine Learning (ML) in Self-Driving Cars

ML Role: ML algorithms enable the car to improve its driving behavior over time by learning
from data, experience, and feedback. ML focuses on predicting the future actions of objects
(like pedestrians and other vehicles) and adapting to changing environments.

• How It Works:

o Supervised Learning: ML models are trained on large labeled datasets (e.g.,


traffic signs, road markings) to classify objects, predict vehicle trajectories, and
understand road conditions.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
o Reinforcement Learning: Cars learn optimal driving strategies by interacting
with the environment and receiving feedback on their actions (e.g., avoid
collisions, follow speed limits).

• Example:

o Tesla’s Autopilot uses ML to predict the motion of nearby vehicles, ensuring


safe lane changes and adaptive cruise control.

o Self-driving trucks use ML to adapt their speed and route based on live traffic
conditions and road closures.

• Strength:

o ML allows self-driving cars to continuously learn and adapt to new driving


scenarios without needing explicit reprogramming.

• Limitation:

o ML models require significant amounts of labeled training data and may


struggle with edge cases or unseen environments.

3. Deep Learning (DL) in Self-Driving Cars

DL Role: DL, particularly Convolutional Neural Networks (CNNs), is essential for tasks like
object detection, image segmentation, and lane recognition. DL models process unstructured
data (images, videos) directly from cameras and other sensors to identify objects like
pedestrians, vehicles, traffic signs, and road conditions.

• How It Works:

o Convolutional Neural Networks (CNNs): DL uses CNNs to analyze image data


from cameras to detect obstacles, read road signs, and recognize lane markings.

o End-to-End Learning: DL models can learn to drive the car from raw sensor
data (e.g., camera images) to control the car's steering, acceleration, and
braking, all in one step.

o Fusion of Sensor Data: DL models combine data from cameras, LiDAR, and
radar to create a 360-degree view of the environment, enabling real-time
decision-making.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• Example:

o Waymo and Tesla both use deep learning to enable their vehicles to detect and
classify objects, understand the road, and make driving decisions.

o Mobileye uses DL to help vehicles avoid collisions by identifying pedestrians,


cyclists, and other obstacles.

• Strength:

o DL excels in recognizing patterns in complex, unstructured data, such as images


and video, with superior accuracy in real-world environments.

• Limitation:

o DL models require high computational power (often GPUs) for training and
inference, which makes it resource-intensive.

Comparisons Between AI, ML, and DL


Machine
Aspect Artificial Intelligence Deep Learning
Learning
(AI) (DL)
(ML)
A subset of ML using
The broad concept of A subset of AI
multi-layered neural
Definition machines simulating focused on systems
networks to analyze
human intelligence. learning from data.
complex data.
Focused on
Broad, encompasses ML, Specialized in processing
predictive models
Scope DL, expert systems, unstructured data like
and pattern
robotics, etc. images, text, and audio.
recognition.
Can include rule-based Requires humans to Minimizes human input by
Human
programming (manual select features and automating feature
Involvement
input). tune models. extraction.
Moderate,
Computational Generally low to High, often requiring
depending on the
Power moderate. GPUs/TPUs for training.
algorithm used.
Performs well in Outperforms ML in
Limited in handling large-
structured and complex tasks like image
Performance scale problems
moderately complex recognition or natural
autonomously.
tasks. language processing.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
AI LAYERS

Generative AI:

Generative AI is a type of AI that can create new content including text, code, images and
music. Generative AI models are trained on large datasets of existing content, learning to
identify patterns in data and using those patterns to generate new content.

Large Language Models (LLMs):

LLMs are a type of generative AI model trained on massive datasets of text and code. LLMs
can generate text, translate languages, write different kinds of creative content and answer your
questions in an informative way.

Generative Pre-trained Transformers (GPTs):


GPTs are a type of LLM that uses a transformer architecture. Transformers are a neural
network architecture well-suited for natural language processing tasks.

GPT-4 and ChatGPT:

GPT-4 and ChatGPT are two examples of GPT models. GPT-4 is an LLM developed by
OpenAI, while ChatGPT is an LLM (also developed by OpenAI) that is specifically designed
for chatbot applications.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
2. Types of Machine Learning Systems
Machine learning (ML) is a discipline of artificial intelligence (AI) that provides machines with
the ability to automatically learn from data and past experiences while identifying patterns to
make predictions with minimal human intervention.

Types of Machine Learning

1. Supervised learning

2. Unsupervised learning

3. Semi-supervised learning

4. Reinforcement learning

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Types of ML

1. Supervised Machine Learning


Supervised learning is defined as when a model gets trained on a “Labelled Dataset”. Labelled
datasets have both input and output parameters. In Supervised Learning algorithms learn to
map points between inputs and correct outputs. It has both training and validation datasets
labelled.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled images.
When we input new dog or cat images that it has never seen before, it will use the learned
algorithms and predict whether it is a dog or a cat. There are two main categories of supervised
learning that are mentioned below:
• Classification
• Regression

Classification
Classification deals with predicting categorical target variables, which represent discrete
classes or labels. For instance, classifying emails as spam or not spam, or predicting whether a
patient has a high risk of heart disease. Classification algorithms learn to map the input features
to one of the predefined classes.
Here are some classification algorithms:
• Logistic Regression
• Support Vector Machine
• Random Forest
• Decision Tree
• K-Nearest Neighbors (KNN)
• Naive Bayes

Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.
Here are some regression algorithms:
• Linear Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression
• Decision tree
• Random Forest

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Advantages of Supervised Machine Learning
• Supervised Learning models can have high accuracy as they are trained on labelled
data.
• The process of decision-making in supervised learning models is often interpretable.
• It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
• It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
• It can be time-consuming and costly as it relies on labeled data only.
• It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
• Image classification: Identify objects, faces, and other features in images.
• Natural language processing: Extract information from text, such as sentiment,
entities, and relationships.
• Speech recognition: Convert spoken language into text.
• Recommendation systems: Make personalized recommendations to users.
• Predictive analytics: Predict outcomes, such as sales, customer churn, and stock prices.
• Medical diagnosis: Detect diseases and other medical conditions.
• Fraud detection: Identify fraudulent transactions.
• Autonomous vehicles: Recognize and respond to objects in the environment.
• Email spam detection: Classify emails as spam or not spam.
• Quality control in manufacturing: Inspect products for defects.
• Credit scoring: Assess the risk of a borrower defaulting on a loan.
• Gaming: Recognize characters, analyze player behavior, and create NPCs.
• Customer support: Automate customer support tasks.
• Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
• Sports analytics: Analyze player performance, make game predictions, and optimize
strategies.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
2. Unsupervised Machine Learning

Unsupervised Learning Unsupervised learning is a type of machine learning technique in


which an algorithm discovers patterns and relationships using unlabeled data. Unlike
supervised learning, unsupervised learning doesn’t involve providing the algorithm with
labeled target outputs. The primary goal of Unsupervised learning is often to discover hidden
patterns, similarities, or clusters within the data, which can then be used for various purposes,
such as data exploration, visualization, dimensionality reduction, and more.

Example: Consider that you have a dataset that contains information about the purchases you
made from the shop. Through clustering, the algorithm can group the same purchasing behavior
among you and other customers, which reveals potential customers without predefined labels.
This type of information can help businesses get target customers as well as identify outliers.

There are two main categories of unsupervised learning that are mentioned below:

• Clustering
• Association

Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for labeled
examples.
Here are some clustering algorithms:
• K-Means Clustering algorithm
• DBSCAN Algorithm
• Principal Component Analysis

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Association
Association rule learning is a technique for discovering relationships between items in a
dataset. It identifies rules that indicate the presence of one item implies the presence of another
item with a specific probability.
Here are some association rule learning algorithms:
• Apriori Algorithm
• FP-growth Algorithm
Advantages of Unsupervised Machine Learning
• It helps to discover hidden patterns and various relationships between the data.
• Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
• It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
• Without using labels, it may be difficult to predict the quality of the model’s output.
• Cluster Interpretability may not be clear and may not have meaningful interpretations.
• It has techniques such as autoencoders and dimensionality reduction that can be used
to extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
• Clustering: Group similar data points into clusters.
• Anomaly detection: Identify outliers or anomalies in data.
• Dimensionality reduction: Reduce the dimensionality of data while preserving its
essential information.
• Recommendation systems: Suggest products, movies, or content to users based on
their historical behavior or preferences.
• Topic modeling: Discover latent topics within a collection of documents.
• Density estimation: Estimate the probability density function of data.
• Image and video compression: Reduce the amount of storage required for multimedia
content.
• Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
• Market basket analysis: Discover associations between products.
• Genomic data analysis: Identify patterns or group genes with similar expression
profiles.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• Image segmentation: Segment images into meaningful regions.
• Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections.
• Customer behavior analysis: Uncover patterns and insights for better marketing and
product recommendations.
• Content recommendation: Classify and tag content to make it easier to recommend
similar items to users.
• Exploratory data analysis (EDA): Explore data and gain insights before defining
specific tasks.

3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between the supervised
and unsupervised learning so it uses both labelled and unlabelled data. It’s particularly useful
when obtaining labeled data is costly, time-consuming, or resource-intensive. This approach is
useful when the dataset is expensive and time-consuming. Semi-supervised learning is chosen
when labeled data requires skills and relevant resources in order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled and the rest
large portion of it is unlabeled. We can use the unsupervised techniques to predict labels and
then feed these labels to supervised techniques. This technique is mostly applicable in the case
of image data sets where usually all images are not labeled.

Semi Supervised Learning

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Example: Consider that we are building a language translation model, having labeled
translations for every sentence pair can be resources intensive. It allows the models to learn
from labeled and unlabeled sentence pairs, making them more accurate. This technique has led
to significant improvements in the quality of machine translation services.
Types of Semi-Supervised Learning Methods
There are a number of different semi-supervised learning methods each with its own
characteristics. Some of the most common ones include:
• Graph-based semi-supervised learning: This approach uses a graph to represent the
relationships between the data points. The graph is then used to propagate labels from
the labeled data points to the unlabeled data points.
• Label propagation: This approach iteratively propagates labels from the labeled data
points to the unlabeled data points, based on the similarities between the data points.
• Co-training: This approach trains two different machine learning models on different
subsets of the unlabeled data. The two models are then used to label each other’s
predictions.
• Self-training: This approach trains a machine learning model on the labeled data and
then uses the model to predict labels for the unlabeled data. The model is then retrained
on the labeled data and the predicted labels for the unlabeled data.
• Generative adversarial networks (GANs): GANs are a type of deep learning
algorithm that can be used to generate synthetic data. GANs can be used to generate
unlabeled data for semi-supervised learning by training two neural networks, a
generator and a discriminator.
Advantages of Semi- Supervised Machine Learning
• It leads to better generalization as compared to supervised learning, as it takes both
labeled and unlabeled data.
• Can be applied to a wide range of data.
Disadvantages of Semi- Supervised Machine Learning
• Semi-supervised methods can be more complex to implement compared to other
approaches.
• It still requires some labeled data that might not always be available or easy to obtain.
• The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• Image Classification and Object Recognition: Improve the accuracy of models by
combining a small set of labeled images with a larger set of unlabeled images.
• Natural Language Processing (NLP): Enhance the performance of language models
and classifiers by combining a small set of labeled text data with a vast amount of
unlabeled text.
• Speech Recognition: Improve the accuracy of speech recognition by leveraging a
limited amount of transcribed speech data and a more extensive set of unlabeled audio.
• Recommendation Systems: Improve the accuracy of personalized recommendations
by supplementing a sparse set of user-item interactions (labeled data) with a wealth of
unlabeled user behavior data.
• Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a
small set of labeled medical images alongside a larger set of unlabeled images.

4. Reinforcement Machine Learning


Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the most
relevant characteristics of reinforcement learning. In this technique, the model keeps on
increasing its performance using Reward Feedback to learn the behavior or pattern. These
algorithms are specific to a particular problem e.g. Google Self Driving car, AlphaGo where a
bot competes with humans and even itself to get better and better performers in Go Game. Each
time we feed in data, they learn and add the data to their knowledge which is training data. So,
the more it learns the better it gets trained and hence experienced.
Here are some of most common reinforcement learning algorithms:
• Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function, which
maps states to actions. The Q-function estimates the expected reward of taking a
particular action in a given state.
• SARSA (State-Action-Reward-State-Action): SARSA is another model-free RL
algorithm that learns a Q-function. However, unlike Q-learning, SARSA updates the Q-
function for the action that was actually taken, rather than the optimal action.
• Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning.
Deep Q-learning uses a neural network to represent the Q-function, which allows it to
learn complex relationships between states and actions.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home

Reinforcement Machine Learning


Example: Consider that you are training an AI agent to play a game like chess. The agent
explores different moves and receives positive or negative feedback based on the outcome.
Reinforcement Learning also finds applications in which they learn to perform tasks by
interacting with their surroundings.
Types of Reinforcement Machine Learning
There are two main types of reinforcement learning:
Positive reinforcement
• Rewards the agent for taking a desired action.
• Encourages the agent to repeat the behavior.
• Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct
answer.
Negative reinforcement
• Removes an undesirable stimulus to encourage a desired behavior.
• Discourages the agent from repeating the behavior.
• Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by
completing a task.
Advantages of Reinforcement Machine Learning
• It has autonomous decision-making that is well-suited for tasks and that can learn to
make a sequence of decisions, like robotics and game-playing.
• This technique is preferred to achieve long-term results that are very difficult to achieve.
• It is used to solve a complex problems that cannot be solved by conventional
techniques.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Disadvantages of Reinforcement Machine Learning
• Training Reinforcement Learning agents can be computationally expensive and time-
consuming.
• Reinforcement learning is not preferable to solving simple problems.
• It needs a lot of data and a lot of computation, which makes it impractical and costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
• Game Playing: RL can teach agents to play games, even complex ones.
• Robotics: RL can teach robots to perform tasks autonomously.
• Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
• Recommendation Systems: RL can enhance recommendation algorithms by learning
user preferences.
• Healthcare: RL can be used to optimize treatment plans and drug discovery.
• Natural Language Processing (NLP): RL can be used in dialogue systems and
chatbots.
• Finance and Trading: RL can be used for algorithmic trading.
• Supply Chain and Inventory Management: RL can be used to optimize supply chain
operations.
• Energy Management: RL can be used to optimize energy consumption.
• Game AI: RL can be used to create more intelligent and adaptive NPCs in video games.
• Adaptive Personal Assistants: RL can be used to improve personal assistants.
• Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create
immersive and interactive experiences.
• Industrial Control: RL can be used to optimize industrial processes.
• Education: RL can be used to create adaptive learning systems.
• Agriculture: RL can be used to optimize agricultural operations.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
3. Main Challenges of Machine Learning

1. Inadequate Training Data


2. Poor Quality of Data
3. Non-Representative Training Data
4. Overfitting and Underfitting
5. Monitoring and Maintenance
6. Data Bias
7. Lack of Explainability
8. Lack of Skilled Resources
9. Process Complexity of Machine Learning
10. Slow Implementations and Results
11. Irrelevant Features
12. Getting Bad Recommendations

1. Inadequate Training Data


One of the primary challenges in machine learning is the availability of adequate training
data. Machine learning models require large amounts of high-quality data to learn effectively.
However, in many domains, obtaining such data is difficult due to factors like privacy
concerns, costs of data collection, and data sparsity.
When the training dataset is too small, models can struggle to capture meaningful patterns,
resulting in poor performance on unseen data. This problem becomes particularly pronounced
in fields like healthcare, where collecting large, diverse datasets is challenging.
Solutions:
• Data Augmentation: Techniques such as data augmentation, which artificially
increases the size of the dataset by modifying existing data, can help mitigate the
problem of limited data.
• Synthetic Data Generation: Tools like GANs (Generative Adversarial
Networks) can generate synthetic data to expand training datasets.
• Transfer Learning: Transfer learning allows models to leverage knowledge from other
related tasks, reducing the need for large amounts of data.
Addressing the challenge of inadequate training data is essential for building robust and
accurate machine learning models.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home

2. Poor Quality of Data


The quality of data directly impacts the performance of machine learning models. Poor-quality
data, which may be incomplete, noisy, or inconsistent, can lead to inaccurate predictions and
flawed outcomes. Data preprocessing is a crucial step to ensure that data is clean and ready
for analysis.
Common Issues in Data Quality:
• Missing Values: Gaps in data can cause models to make incorrect predictions.
• Outliers: Extreme values can skew the model’s understanding of normal behavior.
• Noisy Data: Unreliable or incorrect data points can reduce the accuracy of the model.
Best Practices for Data Quality:
• Data Cleaning: Techniques like imputation (filling missing values) and outlier
detection are essential for improving data quality.
• Normalization and Scaling: Ensuring that data is on a consistent scale can improve
the model’s ability to learn patterns.
• Feature Engineering: Creating new features from existing data can provide the model
with more meaningful information.
Ensuring high-quality data through proper preprocessing steps is key to improving model
performance.

3. Non-Representative Training Data


Non-representative training data occurs when the training dataset does not accurately reflect
the real-world distribution of data. This can result in models that perform well on the training
data but fail to generalize to new, unseen data.
Consequences:
• Poor Generalization: Models trained on biased or unrepresentative data may perform
well in controlled environments but poorly in real-world applications.
• Bias in Predictions: If the training data is not representative, the model’s predictions
will be biased toward certain outcomes, potentially leading to unfair or inaccurate
results.
Solutions:
• Data Sampling: Use stratified sampling techniques to ensure the training dataset
accurately reflects the distribution of the target population.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• Cross-Validation: Employ cross-validation methods to test the model’s generalization
capabilities across different subsets of the data.
Addressing non-representative data is essential for ensuring that models can make accurate
predictions in real-world scenarios.

4. Overfitting and Underfitting


Overfitting occurs when a machine learning model becomes too complex and fits the noise in
the training data rather than the underlying patterns. This results in poor generalization to new
data. Underfitting, on the other hand, occurs when a model is too simple to capture the
underlying patterns in the data.
Causes:
• Overfitting: Caused by models with too many parameters or when there is insufficient
regularization.
• Underfitting: Occurs when the model is too simple or lacks the capacity to capture
complex patterns.
Strategies to Address Overfitting and Underfitting:
• Cross-Validation: Regularly test models on unseen data during training to prevent
overfitting.
• Regularization Techniques: Methods like L1 and L2 regularization can prevent the
model from becoming too complex.
• Early Stopping: Stop the training process when the model’s performance on a
validation set starts to degrade, preventing overfitting.
Balancing model complexity is essential to avoid both overfitting and underfitting, ensuring
optimal model performance.

5. Monitoring and Maintenance


Once a machine learning model is deployed, continuous monitoring is essential to ensure that
it remains accurate and relevant. As the data landscape changes, models may begin to drift from
their original performance levels.
Challenges:
• Model Drift: Over time, changes in the data distribution can lead to model performance
degradation, a phenomenon known as model drift.
• Retraining Needs: Models require periodic updates and retraining to ensure they
continue to deliver accurate predictions as new data becomes available.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Solutions:
• Automated Monitoring: Implement monitoring systems to detect when a model’s
performance starts to decline.
• Scheduled Retraining: Regularly retrain models using new data to keep them up
to date.
Effective monitoring and maintenance strategies are critical for ensuring that machine learning
models remain accurate over time.

6. Data Bias
Data bias occurs when the training data used to build a model is not representative of the
broader population, leading to biased predictions. This can result in models that discriminate
against certain groups or fail to generalize to all users.
Examples:
• Gender Bias in Hiring Models: Algorithms trained on biased hiring data may favor
one gender over another, perpetuating inequalities.
• Facial Recognition: Systems trained predominantly on lighter-skinned individuals
often fail to accurately identify people with darker skin tones.
Detecting and Reducing Bias:
• Bias Detection Tools: Tools like IBM AI Fairness 360 can help identify and reduce
bias in machine learning models.
• Diverse Training Data: Ensuring that the training dataset includes diverse examples
can help mitigate bias.
Addressing data bias is critical for building fair and equitable machine learning models,
especially in industries like healthcare, finance, and criminal justice.

7. Lack of Explainability
Many machine learning models, especially deep learning models, are often described as
“black boxes” due to the difficulty in understanding how they make decisions. This lack of
explainability presents challenges in industries where transparency is crucial, such
as healthcare and finance.
Consequences:
• Regulatory Compliance: In some industries, regulations require that models provide
clear explanations for their decisions. Lack of explainability can hinder the adoption of
machine learning in these fields.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• Trust: Without understanding how a model arrives at a decision, stakeholders may be
reluctant to trust its predictions.
Methods to Improve Explainability:
• LIME (Local Interpretable Model-agnostic Explanations): LIME explains
individual predictions by approximating the model locally.
• SHAP (SHapley Additive exPlanations): SHAP values provide insights into how each
feature contributes to a prediction.
Improving explainability is essential for increasing trust in machine learning models and
ensuring compliance with industry regulations.

8. Lack of Skilled Resources


The demand for skilled machine learning professionals far exceeds the available supply,
creating a skills gap that slows the adoption of machine learning technologies.
Impact:
• Delayed Adoption: Organizations may struggle to implement machine learning
solutions due to a lack of qualified personnel.
• Increased Costs: The scarcity of skilled professionals drives up salaries, making it
costly for organizations to hire and retain talent.
Solutions:
• Education and Training: Companies can invest in training programs and partnerships
with universities to upskill their current workforce.
• Collaborations: Partnering with data science institutes and offering internships can
help build a pipeline of talent.
Closing the skills gap is crucial for accelerating the adoption of machine learning technologies
across industries.

9. Process Complexity of Machine Learning


The development and deployment of machine learning models can be complex, requiring
expertise in data preprocessing, model selection, and hyperparameter tuning. Scaling these
processes for larger datasets or diverse use cases adds to the challenge.
Challenges:
• Data Preparation: Preprocessing large, complex datasets requires significant time and
effort.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• Model Scaling: Adapting models to handle larger datasets or real-time applications can
be difficult.
Solutions:
• Automated Machine Learning (AutoML): AutoML platforms automate many of the
tasks involved in building machine learning models, reducing the complexity of the
process.
• Pipeline Automation: Automating data pipelines can streamline the process of moving
from data collection to model deployment.
Simplifying the machine learning workflow through automation tools can help overcome the
complexity of the process.

10. Slow Implementations and Results


Implementing machine learning models and obtaining actionable results can be a slow process,
particularly for complex algorithms or large datasets.
Causes:
• Data Processing Delays: Preprocessing large datasets can take significant time.
• Complexity of Algorithms: Models like deep learning often require large amounts of
computational resources, leading to delays.
Solutions:
• Parallel Computing: Using distributed computing frameworks like Apache
Spark can speed up data processing and model training.
• Simplified Models: In some cases, simpler models can deliver faster results without
sacrificing accuracy.
Streamlining the model-building process and optimizing algorithms for efficiency can help
reduce the time it takes to implement machine learning solutions.

11. Irrelevant Features


Irrelevant or redundant features in the training data can negatively impact model performance.
These features add noise, increase computational costs, and may lead to overfitting.
Solutions:
• Feature Selection: Techniques like Principal Component Analysis (PCA) and Lasso
regression help reduce the number of features by selecting the most relevant ones.
• Domain Knowledge: Leveraging domain expertise can help identify which features
are likely to be relevant and which can be discarded.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Reducing irrelevant features improves model accuracy and efficiency, leading to better results
and lower computational costs.

12. Getting Bad Recommendations


Recommendation systems are widely used in platforms like e-commerce and streaming
services. However, these systems can provide bad recommendations due to data
inaccuracies, user behavior changes, or poorly designed algorithms.
Consequences:
• User Dissatisfaction: Poor recommendations can lead to a negative user experience,
reducing engagement and customer retention.
• Loss of Revenue: Inaccurate recommendations can impact business outcomes by
driving users away from the platform.
Solutions:
• Collaborative Filtering: Collaborative filtering techniques analyze user behavior to
provide more personalized recommendations.
• Reinforcement Learning: Reinforcement learning allows recommendation systems to
adapt and improve over time by learning from user feedback.
Improving recommendation systems with advanced algorithms can enhance user experience
and drive better business outcomes.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
4. Statistical Learning: Introduction, Supervised & Unsupervised Learning
Introduction to Statistical Learning: Statistical learning is a branch of machine learning and
statistics that focuses on understanding and analyzing data. It involves developing models to
predict or explain a set of outcomes based on input data. The main objective of statistical
learning is to uncover patterns, make predictions, or infer relationships within data using
mathematical and computational techniques.
Statistical learning includes both supervised and unsupervised learning, which differ in how
the data is used to train models.

Importance of Statistics in Supervised Learning:


Data Analysis: Before applying supervised learning algorithms, it's essential to
understand the characteristics of the data.
Descriptive statistics help summarize the below.
Feature Selection: Statistical techniques can help identify relevant features (variables) for
predictive modelling. Feature selection ensures that only meaningful and informative variables
are used, which can lead to more accurate models and reduce the risk of overfitting.
Model Building: Statistics underpin the algorithmic approaches used in supervised learning,
such as linear regression, decision trees, and support vector machines. These algorithms
leverage statistical principles to learn relationships between input features and output labels.
Model Evaluation: Metrics like accuracy, precision, recall, F1-score, and confusion
matrix are essential for evaluating the performance of classification models. For regression
models, statistics such as Mean Squared Error (MSE) and R-squared provide insights into how
well the model fits the data.
Cross-Validation: Statistical techniques like k-fold cross-validation are used to assess a
model's generalization performance. This involves splitting the data into subsets for training
and validation, providing a robust estimation of model performance.

Importance of Statistics in Unsupervised Learning:


Clustering: Unsupervised learning techniques like clustering aim to group similar data points
together. Statistical methods, such as distance metrics and clustering validity indices, help
assess the quality of clustering results.
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) use
statistical concepts to transform high-dimensional data into lower-dimensional representations
while preserving its variability.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Data Exploration: Descriptive statistics and data visualization are crucial for understanding
the underlying structure of the data in unsupervised learning. This exploration helps identify
patterns and outliers.
Feature Engineering: In unsupervised learning, feature engineering involves creating new
features or representations from the original data. Statistical techniques can be used to create
meaningful features that capture important information.
Anomaly Detection: Detecting anomalies in data involves comparing data points
to statistical distributions or defining threshold values. Deviations from the expected statistical
patterns can indicate anomalies.

Different statistics that we need to encounter while working with Machine Learning
When working with machine learning, various risk statistics need to be considered to assess
the performance, reliability, and potential pitfalls of our models. These risk statistics provide
insights into different aspects of model behavior and help in making informed decisions. In
statistical learning, loss refers to the error between the predicted values of a model and the true
values. It is used to evaluate how well a model fits the data.
1. Training Loss:
• The error calculated on the training dataset, which the model uses to learn.
• Goal: Minimize the training loss during model training.
• Observation: A low training loss indicates the model has learned the patterns in the
training data well.
2. Test Loss:
• The error calculated on the test dataset, which is unseen by the model during
training.
• Goal: Evaluate the generalization ability of the model on new, unseen data.
• Observation: A low test loss means the model generalizes well and can make
accurate predictions on unseen data.
Key Concept:
• Overfitting: Overfitting occurs when a model learns too much from the
training data, including noise, outliers, or irrelevant patterns, making it overly
complex. As a result, the model performs very well on the training data but fails
to generalize to new, unseen data (test data).

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• Underfitting: Underfitting happens when a model is too simple to capture the
underlying patterns in the data. It fails to learn enough from the training data
and cannot make accurate predictions on either the training or test data.
3. Accuracy:
• Definition: The ratio of correctly predicted instances to the total number of instances in
classification tasks.
• Importance: Provides an overall measure of classification performance.
• Considerations: Can be misleading when classes are imbalanced; may not account for
varying costs of misclassifications.
4. Precision and Recall:
• Precision: The ratio of true positive predictions to the total positive predictions made
by the model.
• Recall: The ratio of true positive predictions to the total actual positives in the dataset.
• Importance: Important for imbalanced classes; precision focuses on the accuracy of
positive predictions, while recall focuses on the ability to find all positives.
5. F1-Score:
• Definition: The harmonic mean of precision and recall.
• Importance: Provides a balance between precision and recall.
• Interpretation: Useful when there's a trade-off between false positives and false
negatives.
6. Confusion Matrix:
• Definition: Provides detailed insights into classification performance.
• Applications: Used to calculate various classification metrics like accuracy, precision,
recall, and F1-score.
7. Mean Squared Error (MSE) and Root Mean Squared Error (RMSE):
• MSE: The average of the squared differences between predicted and actual values in
regression tasks.
• RMSE: The square root of MSE.
• Importance: Measure the accuracy of regression models.
• Interpretation: Lower values indicate better fit; sensitive to outliers.

8. Bias and Variance:

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• Bias: The difference between the model's predictions and the true values; high bias
leads to Underfitting.
• Variance: The model's sensitivity to small changes in the training data; high variance
leads to overfitting.
• Bias-Variance Trade-off: Balancing bias and variance is crucial for optimal model
performance.
9. Cross-Validation Results:
• K-Fold Cross-Validation: Assessing model performance across different data splits to
ensure generalization.
• Importance: Helps detect overfitting, evaluate model stability, and make informed
model choices.
10. Learning Curves:
• Definition: Plots of training and testing loss (or other metrics) against the number of
training examples.
• Importance: Visualizes how the model's performance changes with data size; helps
identify underfitting and overfitting.
Ways to prevent the Overfitting
Although overfitting is an error in Machine learning which reduces the performance of the
model, however, we can prevent it in several ways. With the use of the linear model, we can
avoid overfitting; however, many real-world problems are non-linear ones. It is important to
prevent overfitting from the models. Below are several ways that can be used to prevent
overfitting:
•Early Stopping
•Train with more data
•Feature Selection
•Cross-Validation
•Data Augmentation
•Regularization
Techniques to Reduce Underfitting
•Increase model complexity.
•Increase the number of features, performing feature engineering.
•Remove noise from the data.
•Increase the number of epochs or increase the duration of training to get better results.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home

Aspect Supervised Learning Unsupervised Learning

Definition Learning from labeled data to Learning from unlabeled data to find
predict or classify outcomes. patterns or structures.
Data Requires labeled data (input + Uses only unlabeled data (input
Requirement corresponding output). features).

Goal Map inputs to known outputs for Discover hidden patterns, clusters, or
prediction or classification. structures in the data.
- Predicting house prices
Customer segmentation (Clustering)
Examples (Regression)
- Spam detection (Classification) Dimensionality reduction (e.g.,PCA)
Linear regression, Logistic
Algorithms K-means clustering, Hierarchical
regression, Random forests,
clustering, PCA, Autoencoders
Neural networks
Evaluated using metrics like
Evaluation Measured using metrics like
cohesion (e.g., silhouette score) or
accuracy, precision, recall, etc.
explained variance.
Fraud detection Market segmentation
Applications Medical diagnosis Anomaly detection
Stock price prediction Data visualization

Strength Highly accurate for prediction Effective for exploring and grouping
tasks with sufficient labeled data. data without labels.
Requires a large amount of
Limitations Results can be less interpretable and
labeled data, which can be
harder to validate.
expensive to obtain.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
5. Training and Test Loss, Tradeoffs in Statistical Learning
In statistical learning, loss refers to the error between the predicted values of a model and the
true values. It is used to evaluate how well a model fits the data.
Training Loss:
• The error calculated on the training dataset, which the model uses to learn.
• Goal: Minimize the training loss during model training.
• Observation: A low training loss indicates the model has learned the patterns in the
training data well.
Test Loss:
• The error calculated on the test dataset, which is unseen by the model during
training.
• Goal: Evaluate the generalization ability of the model on new, unseen data.
• Observation: A low test loss means the model generalizes well and can make
accurate predictions on unseen data.
Key Concept:
• Overfitting: Overfitting occurs when a model learns too much from the training data,
including noise, outliers, or irrelevant patterns, making it overly complex. As a result,
the model performs very well on the training data but fails to generalize to new, unseen
data (test data).
• Underfitting: Underfitting happens when a model is too simple to capture the
underlying patterns in the data. It fails to learn enough from the training data and cannot
make accurate predictions on either the training or test data.

Bias and Variance in Machine Learning


However, if the machine learning model is not accurate, it can make predictions errors, and
these prediction errors are usually known as Bias and Variance. In machine learning, these
errors will always be present as there is always a slight difference between the model
predictions and actual predictions. The main aim of ML/data science analysts is to reduce these
errors in order to get more accurate results.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home

In machine learning, an error is a measure of how accurately an algorithm can make predictions
for the previously unknown dataset. There are mainly two types of errors in machine learning,
which are:
• Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.

• Irreducible errors: These errors will always be present in the model regardless of
which algorithm has been used. The cause of these errors is unknown variables whose
value can't be reduced.

Bias:
While making predictions, a difference occurs between prediction values made by the model
and actual values/expected values, and this difference is known as bias errors or Errors due to
bias.
Low Bias: The model makes fewer assumptions and is flexible enough to capture complex
patterns in the data.
High Bias: The model makes overly simplistic assumptions about the data and fails to
capture its complexity.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Aspect High Bias Low Bias
Very simple
Model Flexible and complex
(e.g., linear regression
Simplicity (e.g., deep neural networks).
for non-linear data).
Fit on Training Poor Excellent
Data (underfits the data). (fits the training data well).
Training Error High. Low.
High due to Depends on variance
Test Error
underfitting. (may be low or high).
How to Reduce Bias:
1. Increase Model Complexity:
• Use models capable of capturing more complex relationships (e.g., decision
trees, neural networks).
2. Add Features:
• Add more relevant input features to provide the model with additional
information.
3. Reduce Regularization:
• Regularization prevents overfitting but can lead to underfitting if too strong.
4. Try Non-Linear Models:
• Use algorithms like polynomial regression, kernel SVMs, or ensemble methods
to handle non-linear data.

Variance
Variance refers to the error caused by the model’s sensitivity to small fluctuations in the training
data. It indicates how much the model’s predictions change if the training data changes.
Low Variance
A model with low variance is stable, meaning its predictions don't change drastically when
trained on different subsets of the data. It captures the general patterns in the data without being
overly influenced by small variations or noise.
High Variance
A model with high variance is highly sensitive to small changes in the training data. It
"memorizes" the details of the training data, including noise, which may not be present in new,
unseen data. This can lead to overfitting, where the model performs very well on the training
set but poorly on the test set.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Aspect Low Variance High Variance
Model Not sensitive to small fluctuations Sensitive to small changes in
Sensitivity in training data. training data.
Fit on
Fit is more general and does not Fit is too specific to the training
Training
capture noise. data, possibly capturing noise.
Data
Training Low, but might be due to Low (because the model fits the
Error underfitting. training data very well).

Low, as the model generalizes well High, as the model fails to


Test Error
to new data. generalize (overfitting).
Model Simple models, may miss some Complex models, can capture
Complexity details of the data. intricate patterns.
Linear regression, shallow decision Deep decision trees, deep neural
Examples
trees, logistic regression. networks, k-NN with low k.

How to Reduce Variance:


•Early Stopping
•Train with more data
•Feature Selection
•Cross-Validation
•Data Augmentation
•Regularization
Different Combinations of Bias-Variance
There are four possible combinations of bias and variances, which are represented by the below
diagram:

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
1. Low-Bias,Low-Variance:
The combination of low bias and low variance shows an ideal machine learning model.
However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a
large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn well
with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias,High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on
average.

How to identify High variance or High Bias?


High variance can be identified if the model has:
• Low training error and high test error.
High Bias can be identified if the model has:
• High training error and the test error is almost similar to training error.

Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very simple
with fewer parameters, it may have low variance and high bias. Whereas, if the model has a
large number of parameters, it will have high variance and low bias. So, it is required to make

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
a balance between bias and variance errors, and this balance between the bias error and variance
error is known as the Bias-Variance trade-off.

For an accurate prediction of the model, algorithms need a low variance and low bias. But this
is not possible because bias and variance are related to each other:
o If we decrease the variance, it will increase the bias.
o If we decrease the bias, it will increase the variance.
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that
accurately captures the regularities in training data and simultaneously generalizes well with
the unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high
variance algorithm may perform well with training data, but it may lead to overfitting to noisy
data. Whereas, high bias algorithm generates a much simple model that may not even capture
important regularities in the data. So, we need to find a sweet spot between bias and variance
to make an optimal model. Hence, the Bias-Variance trade-off is about finding the sweet spot
to make a balance between bias and variance errors.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
6. Estimating Risk Statistics, Sampling distribution of an estimator,
Empirical Risk Minimization.

1. Estimating Risk Statistics

Risk statistics in machine learning involve assessing the performance of a model based on a
defined loss function. The risk is a measure of how far off a model's predictions are from the
true target values. The goal is to minimize this risk to ensure the model generalizes well to
unseen data.

Below are a few risks associated with Machine Learning:

1. Poor Data
2. Overfitting
3. Biased Data
4. Lack of strategy and experience
5. Security Risks
6. Data privacy and confidentiality
7. Third-party risks
8. Regulatory challenges

Key Definitions in Risk Estimation


1. Loss Function L(f(X),Y):
• A function that quantifies the error or loss between the model's prediction f(X)
and the actual target Y.
• Common loss functions:
▪ Mean Squared Error (MSE): For regression tasks.
▪ Cross-Entropy Loss: For classification tasks.
▪ Hinge Loss: Used in Support Vector Machines (SVM).
2. True Risk (Expected Risk):
• True risk is the expected value of the loss over the true data distribution P(X,Y):
R(f)=E(X,Y)∼P[L(f(X),Y)]=∫L(f(X),Y) dP(X,Y)
• R(f): True risk (unobservable).
• P(X,Y): Joint probability distribution of input X and output Y.
• True risk is the ideal quantity we want to minimize.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
3. Empirical Risk:
• Since the true distribution P(X,Y) is unknown, we approximate the true risk
using a finite dataset D={(X1,Y1),…,(Xn,Yn)}.
• Empirical risk is the average loss over the sample data:

• R^(f): Empirical risk.


• n: Number of samples.
• Empirical risk serves as an estimate for the true risk.

Why Do We Estimate Risk?


• To evaluate how well a model performs on training or unseen data.
• To compare models and select the one with the best generalization ability.
• To guide optimization: Minimize the risk to train better models.
Generalization Error: The difference between the true risk and empirical risk is called the
generalization error: Generalization Error=R(f)−R^(f)

2. Sampling distribution of an estimator, Empirical Risk Minimization.

An estimator is a rule or formula for estimating a parameter of the population from sample
data. For example:
• Sample mean estimates the population mean.
• Model f estimates the true function f∗.
The sampling distribution of an estimator describes how the values of an estimator (e.g.,
sample mean) vary across different random samples.
Key Concepts
1. Estimator θ^:
• A function of the sample data used to estimate an unknown population
parameter θ.
2. Sampling Distribution:
• The distribution of an estimator's values when computed on different random
samples of size n from the same population.
3. Example: Sample Mean
• If X1,X2,…, are the samples from a population with mean μ and variance σ2:

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• The sample mean is given by:

• The sampling distribution of is approximately normal for large n


(by the Central Limit Theorem):

• Mean of the sampling distribution = μ.


• Variance of the sampling distribution = σ2/n.
4. Standard Error (SE):
• The standard deviation of the sampling distribution:

In statistics, it is the probability distribution of the given statistic estimated on the basis of a
random sample. It provides a generalized way to statistical inference. The estimator is the
generalized mathematical parameter to calculate sample statistics. An estimate is the result of
the estimation.
The sampling distribution of estimator depends on the sample size. The effect of change of the
sample size has to be determined. An estimate has a single numerical value and hence they are
called point estimates. There are various estimators like sample mean, sample standard
deviation, proportion, variance, range etc.
Sampling distribution of the mean: It is the population mean from which the samples are drawn.
For all the sample sizes, it is likely to be normal if the population distribution is normal. The
population mean is equal to the mean of the sampling distribution of the mean. Sampling
distribution of mean has the standard deviation, which is as follows:

Where σM is the standard deviation of the sampling mean, σ is the population standard deviation
and n is the sample size.
As the size of the sample increases, the spread of the sampling distribution of the mean
decreases. But the mean of the distribution remains the same and it is not affected by the sample
size.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
The sampling distribution of the standard deviation is the standard error of the standard
deviation. It is defined as:

Here, σS is the sampling distribution of the standard deviation. It is positively skewed for small
n but it approximately becomes normal for sample sizes greater than 30.

3. Empirical Risk Minimization (ERM)


Empirical Risk Minimization (ERM) is a fundamental principle in machine learning. It
involves selecting a model that minimizes the empirical risk over a given training dataset.
The Empirical Risk Minimization (ERM) principle is a learning paradigm that consists in
selecting the model with minimal average error over the training set. This so-called training
error can be seen as an estimate of the risk (due to the law of large numbers), hence the
alternative name of empirical risk.
By minimizing the empirical risk, we hope to obtain a model with a low value of the risk. The
The larger the training set size is, the closer to the true risk the empirical risk is.
If we were to apply the ERM principle without more care, we would end up learning by heart,
which we know is bad. This issue is more generally related to the overfitting phenomenon,
which can be avoided by restricting the space of possible models when searching for the one
with minimal error. The most severe and yet common restriction is encountered in the contexts
of linear classification or linear regression. Another approach consists in controlling the
complexity of the model by regularization.
While building our machine learning model, we choose a function that reduces the differences
between the actual and the predicted output i.e. empirical risk. We aim to reduce/minimize the
empirical risk as an attempt to minimize the true risk by hoping that the empirical risk is almost
the same as the true risk.
Empirical risk minimization depends on four factors:
• The size of the dataset - the more data we get, the more the empirical risk approaches
the true risk.
• The complexity of the true distribution - if the underlying distribution is too complex,
we might need more data to get a good approximation of it.
• The class of functions we consider - the approximation error will be very high if the
size of the function is too large.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
• The loss function - It can cause trouble if the loss function gives very high loss in certain
conditions.
The L2 Regularization is an example of empirical risk minimization.
L2 Regularization
To handle the problem of overfitting, we use the regularization techniques. A regression
problem using L2 regularization is also known as ridge regression. In ridge regression, the
insignificant predictors are penalized. This method constricts the coefficients to deal with
highly correlated independent variables. Ridge regression adds the “squared magnitude” of the
coefficient, which is the sum of squares of the weights of all features as the penalty term to the
loss function.

Here, λ is the regularization parameter.

Department of Information Technology, SRKREC(A)

You might also like