0% found this document useful (0 votes)
2 views

Unit-1-MLF-1

The document provides an overview of Machine Learning fundamentals, including key concepts such as supervised and unsupervised learning, model selection, and the importance of machine learning in various industries. It discusses essential terminology like data sets, features, and labels, as well as techniques for handling categorical data such as one-hot encoding. Additionally, it covers challenges like underfitting and overfitting, the bias-variance trade-off, and the model selection process.

Uploaded by

VANSHIKA GADA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit-1-MLF-1

The document provides an overview of Machine Learning fundamentals, including key concepts such as supervised and unsupervised learning, model selection, and the importance of machine learning in various industries. It discusses essential terminology like data sets, features, and labels, as well as techniques for handling categorical data such as one-hot encoding. Additionally, it covers challenges like underfitting and overfitting, the bias-variance trade-off, and the model selection process.

Uploaded by

VANSHIKA GADA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine Learning

Fundamentals
B.Tech(CSBS), Btech(CSDS-311)
V Sem

Dr. Shubha Puthran


Mumbai Campus
Machine Learning Fundamentals:
Terminology, Supervised and Unsupervised Learning, Underfitting,
Overfitting, Bias, Variance, Trade-off, Model Selection, Applications
Introduction
• Machine learning is a subset of artificial
intelligence that focuses on developing algorithms
and models that enable computers to learn and
make predictions or decisions without being
explicitly programmed
• Study of algorithms and statistical models that
allow computer systems to improve their
performance on a specific task through iterative
learning from data
Why is machine learning necessary?

• Learning is a hallmark of intelligence; many would argue that a


system that cannot learn is not intelligent.
• Without learning, everything is new; a system that cannot learn
is not efficient because it rederives each solution and
repeatedly makes the same mistakes.
Different Varieties of Machine Learning
• Concept Learning
• Clustering Algorithms
• Genetic Algorithms
• Reinforcement Learning
• Case-based Learning
• Discovery Systems
• Knowledge capture
Importance of machine learning in today's world
•Machine learning has revolutionized various industries and
sectors, including healthcare, finance, e-commerce, transportation,
and more.

•It enables us to extract valuable insights , for example predicting


customer behavior, detecting fraud, diagnosing diseases,
autonomous driving, and personalizing user experiences.

•In an era of data explosion, machine learning provides the tools to


harness the power of data and uncover hidden patterns, making it
a crucial technology for businesses and organizations to stay
competitive and innovative.

•Emphasizing that machine learning is transforming the way we


live, work, and interact with technology, and its impact will only
continue to grow in the future.
Key Terminology
Data Set
• A data set refers to a collection of individual
data points or examples used for training,
testing, and evaluation in machine learning
• It can include various types of data, such as
numerical, categorical, textual, or image data,
depending on the specific problem being
addressed
Feature:
• In machine learning, a feature refers to an individual measurable
property or characteristic of a data point or object
• Features are used as input variables or dimensions in machine
learning algorithms to make predictions or classifications
• Examples of features could include numerical values like age or
income, categorical variables like gender or location, or textual data
like a product description
Label:
• A label, also known as a target or output variable, is the value or
class that we want to predict or classify in a supervised learning
problem.
• In a labeled data set, each data point is associated with a
corresponding label that represents the ground truth or desired
outcome.
• For example, in a spam email classification task, the label can be
binary, indicating whether the email is spam (1) or not spam (0).
Titanic dataset Features and Label
One hot encoding
• These datasets consist of both categorical as well as numerical
columns. However, various Machine Learning models do not work
with categorical data and to fit this data into the machine learning
model it needs to be converted into numerical data.
• For example, suppose a dataset has a Gender column with
categorical elements like Male and Female. These labels have no
specific order of preference and also since the data is string labels,
machine learning models misinterpreted that there is some sort of
hierarchy in them.
One Hot Encoding
• One hot encoding is a technique that we use to represent categorical
variables as numerical values in a machine learning model.
The advantages of using one hot encoding include:
1.It allows the use of categorical variables in models that require numerical
input.
2.It can improve model performance by providing more information to the
model about the categorical variable.
3.It can help to avoid the problem of ordinality, which can occur when a
categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).
• The disadvantages of using one hot encoding include:
1.It can lead to increased dimensionality, as a separate column is created for
each category in the variable. This can make the model more complex and
slow to train.
2.It can lead to sparse data, as most observations will have a value of 0 in
most of the one-hot encoded columns.
3.It can lead to overfitting, especially if there are many categories in the
variable and the sample size is relatively small.
4. One-hot-encoding is a powerful technique to treat categorical data, but it
can lead to increased dimensionality, sparsity, and overfitting. It is
important to use it cautiously and consider other methods such as ordinal
encoding or binary encoding.
One hot Encoding

Categorical
Fruit Price
value of fruit

apple 1 5

mango 2 10

apple 1 15

orange 3 20
One hot Encoding
Supervised learning and its characteristics:
• Supervised learning is a type of machine learning
where the model is trained on labeled examples,
consisting of input-output pairs
• The goal of supervised learning is to learn a
mapping or relationship between the input
features and their corresponding output labels
• During the training phase, the model learns from
the labeled data by adjusting its parameters to
minimize the difference between the predicted
outputs and the true labels
• The trained model can then make predictions or
classify new, unseen data based on the learned
patterns
Supervised Learning Examples
• Predicting House Prices:
• Given features like area, number of rooms,
location, etc., predict the price of a house
• The labeled data would consist of houses with
their corresponding prices

• Email Spam Classification:


• Given the content and features of an email,
classify it as either spam or non-spam
• The labeled data would consist of emails marked
as spam or non-spam
Applications:
Unsupervised learning • Customer Segmentation:
• Given a dataset of customer attributes and
What is it? behaviors, identify distinct groups or segments
of customers with similar characteristics
• Unsupervised learning is a type of • The model would cluster the customers based on
machine learning where the model their similarities and help businesses tailor their
is trained on unlabeled examples marketing strategies accordingly
without any specific output or target
variable
• The goal of unsupervised learning is
to discover patterns, relationships,
or structures in the data
• The model learns to identify
clusters, similarities, or anomalies in
the data without any prior
knowledge of the desired outcomes
Unsupervised Learning
Topic Modeling in Documents:
• Given a large collection of documents, discover the underlying topics
and themes within the text
• The model would identify common patterns, keywords, and
relationships among the documents to categorize them into different
topics.
Underfitting:

• Underfitting occurs when a machine learning model is unable to


capture the underlying patterns or relationships in the data,
resulting in poor performance and low accuracy.
• It usually happens when the model is too simple or lacks the
complexity to represent the true underlying function.
• Underfitting can be avoided by using more data and also
reducing the features by feature selection

Causes and consequences of underfitting:


• Insufficient model complexity: The model may not have enough
parameters or flexibility to capture the complexity of the data
• Limited training data: When the training data is limited or not
representative of the entire population, the model may not learn
enough to generalize well
• Consequences: Underfitting leads to high bias, meaning the model
oversimplifies the problem and fails to capture important patterns
or variations in the data
Reasons for Underfitting:
1. High bias and low variance.
2. The size of the training dataset used is not enough.
3. The model is too simple.
4. Training data is not cleaned and also contains noise in it.

Techniques to Reduce Underfitting


1.Increase model complexity.
2.Increase the number of features, performing feature
engineering.
3.Remove noise from the data.
4.Increase the number of epochs or increase the duration of
training to get better results.
Overfitting:
• Overfitting occurs when a machine learning model
learns the training data too well, including noise
and irrelevant details, resulting in poor
generalization to new, unseen data
• The model essentially memorizes the training
examples instead of learning the underlying
patterns
Causes and consequences of overfitting:
• Excessive model complexity: The model may have
too many parameters or degrees of freedom,
allowing it to fit the noise or outliers in the training
data
• Lack of regularization: Without proper
regularization techniques, the model can become
overly sensitive to the training data
• Consequences: Overfitting leads to high variance,
meaning the model is too sensitive to small
fluctuations in the training data, making it perform
poorly on new data.
Bias and Variance
Bias and Variance..continued..
Variance and its effects
Bias-Variance Tradeoff
Bias-Variance Trade-off
• A model with high bias makes strong assumptions about the
underlying data, often leading to underfitting. Underfitting occurs
when a model is too simple to capture the true patterns in the data,
resulting in poor performance and low accuracy.
• Variance, on the other hand, refers to the error introduced by the
model's sensitivity to fluctuations in the training data.
• A model with high variance is overly complex and captures noise or
random fluctuations in the training data, leading to overfitting.
• Overfitting occurs when a model fits the training data extremely
well but fails to generalize to new, unseen data, resulting in poor
performance and high error on test data.
• The bias-variance trade-off arises from the fact that decreasing bias
often increases variance and vice versa.
• A complex model can better fit the training data, reducing bias but
increasing variance.
Bias-Variance Trade-off
• To achieve this balance, various techniques can be employed, such
as:
1.Cross-validation: Splitting the data into multiple train-test splits and
using techniques like k-fold cross-validation can help estimate
model performance on unseen data and select the model with the
best trade-off.
2.Feature selection: Choosing the most relevant features and reducing
the dimensionality of the data can help reduce model complexity and
variance.
3.Ensemble methods: Combining predictions from multiple models,
such as bagging, boosting, or stacking, can help reduce variance by
averaging out individual model errors.
4.Model selection: Choosing a model that naturally balances bias and
variance, such as decision trees with limited depth or support vector
machines with appropriate kernel parameters.
Model Selection:

The model selection process typically involves the following steps:


1.Define the problem: Clearly understand the problem you are trying to solve,
including the type of task (classification, regression, etc.) and the evaluation metric
to assess model performance (accuracy, precision, recall, etc.).
2.Prepare the data: Preprocess and prepare your dataset by handling missing
values, encoding categorical variables, normalizing or standardizing features, and
splitting the data into training, validation, and test sets.
3.Choose a set of candidate models: Select a set of models that are suitable for
the problem at hand. This can include different algorithms (e.g., decision trees,
support vector machines, neural networks) or variations of the same algorithm
with different hyperparameters.
4.Train the models: Train each model on the training data using an appropriate training algorithm or
optimization procedure. The models learn from the input data and adjust their parameters to minimize a
specific loss or error function.
5.Evaluate the models: Assess the performance of each trained model on the validation set. This involves
computing the evaluation metric(s) defined earlier. This step helps estimate how well the models generalize
to unseen data.
6.Select the best model: Compare the performance of the models on the validation set and select the one
that performs the best according to the chosen evaluation metric. This model is typically expected to have a
good trade-off between bias and variance.
7.Validate the selected model: After choosing the best model, it is essential to validate its performance on
the test set, which serves as a final unbiased evaluation. This step ensures that the selected model performs
well on unseen data and is not overfitted to the validation set.
It is worth noting that model selection is an iterative process, and it may require adjusting hyperparameters,
exploring different architectures, or trying different feature engineering techniques to improve model
performance.
Additionally, techniques like cross-validation can be employed to automate and optimize the model
selection process.
Applications
• Ask Students to Identify through Research Papers in the
Lab(Experiment 1: Task 1)
• Discuss their research paper in the class
• Padlet can be used for understanding the concept of Unit-1

You might also like