Mlunit 1
Mlunit 1
By
Dr.V.Srilakshmi
Associate Professor,
CSE, GRIET
Unit-1 Content
► Introduction: Introduction to Machine learning, Supervised
learning,Unsupervised learning, Reinforcement learning. Deep
learning.
► Feature Selection: Filter, Wrapper, Embedded methods.
(OR)
2. Machine learning is a Computer Program is said to learn from Experience E with
respect to small Class of tasks T and Performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E – by Tom Mitchell in 1998
List of reasons why Machine Learning is so important:
List of reasons why Machine Learning is so important:
Traditional
Programming
Data
Output
Computer
Program
► Machine Learning:
Data and output is run on the computer to create a
program. Machine
Learning
Data
Program
Computer
Output
Relation between Data Science, Machine learning , Deep Learning &
Artificial Intelligence
Relation between Data Science, Machine learning & Artificial
Intelligence:
► Machine Learning (ML): Algorithms that learn from structured data to
predict outputs and discover patterns in that data.
► ML is an application or subset of AI.
► The major aim of ML is to allow the systems to learn by themselves through
the experience without any kind of human intervention or assistance.
► Ex: We use machine learning in our day to day life when we use services
like recommendation systems on Netflix, Youtube, Spotify; search engines
like google and yahoo; voice assistants like google home and amazon alexa.
(structured data)
► In Machine Learning we train the algorithm by providing it with a lot of
data and allowing it to learn more about the processed information.
Relation between Data Science, Machine learning & Artificial Intelligence:
► Deep Learning (DL): Algorithms based on highly complex neural networks that
mimic the way a human brain works to detect patterns in large unstructured data
sets.
► Deep learning is the evolution of machine learning and neural networks, which uses
advanced computer programming and training to understand complex patterns
hidden in large data sets.
► DL is about understanding how the human brain works in different situations and
then trying to recreate its behaviour.
► Deep learning is used to complete complex tasks and train models using
unstructured data.
► Ex: Deep learning is commonly used in image classification tasks like facial
recognition. Although machine learning models can also identify faces, deep
learning models are more accurate.
► In this case, it takes the unstructured data (images of faces) and extracts factors
Two major advantages of DL
over ML:
1. Feature Extraction
► Machine learning algorithms such as Naive Bayes, Logistic Regression, SVM, etc., are
termed as “flat algorithms”. By flat, we mean, these algorithms require
pre-processing phase (known as Feature Extraction which is quite complicated and
computationally expensive) before been applied to data such as images, text, CSV.
► For instance, if we want to determine whether a particular image is of a cat or dog
using the ML model.
► We have to manually extract features from the image such as size, color, shape,
etc., and then give these features to the ML model to identify whether the image is
of a dog or cat.
► However, DL models do not any feature extraction pre-processing step and are
capable of classifying data into different classes and categories themselves. That is,
in the case of identification of cat or dog in the image, we do not need to extract
features from the image and give it to the DL model. But, the image can be given as
the direct input to the DL model whose job is then to classify it without human
intervention.
Two major advantages of DL
over ML:
2. Big Data
► With technology and the ever-increasing use of the web, it is estimated that every
second 1.7MB of data is generated by every person on the planet Earth.
Therefore, analyzing and learning from data is of utmost importance.
► Deep Learning is seen as a rocket whose fuel is data.
► The accuracy of ML models stops increasing with an increasing amount of data after
a point while the accuracy of the DL model keeps on increasing with increasing data.
All the technologies at a
glance………
ML Tools
Step:1 - Gathering the
data
Data:
► Data: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analyzed.
► Data is the most important part of all Data Analytics, Machine Learning, Artificial
Intelligence.
► Without data, we can’t train any model and all modern research and automation will
go in vain. Big Enterprises are spending lots of money just to gather as much certain
data as possible.
► Data is typically divided into two types: labeled and unlabeled. Labeled data includes a
label or target variable that the model(Supervised) is trying to predict, whereas
unlabeled data does not include a label or target variable (UnSupervised) .
► A labeled dataset is one where you already know the target answer.
► The data used in machine learning is typically numerical or categorical.
• Numerical
data includes values that can be ordered and measured, such as age or
income.(Regression-if target variable is numerical)
• Categorical
data/Nominal data: includes values that represent categories, such as
gender or type of fruit.(Classification-if target variable is Categorical)
Types of Data
• Nominal data
• Ordinal data
• Discrete data
• Continuous data
Types of Data
What is a Data set ?
A data set is an organized collection of
data. They are generally associated with a
unique body of work and typically cover
one topic at a time.
Each data set has one output variable and
one/more input variables.
Instance
/observation/rows/RECORDS/SAMPLES/obj
ects/predictors
Columns/features/attributes/variables/fiel
ds and characteristics.
• Independent variables—Input
variable-predictor variable
• Dependent variables-output
variable/target
variable/response variable
Types of datasets:
► 1.Data set consists of only numerical attributes
► 2.Data set consists of only categorical attributes
► 3.Data set consists of both numerical and categorical attributes
► Dataset1: Dataset2: Dataset3:
age incom heig weight age income studen age income Credit
e ht t rating
20 12000 6.3 30 youth Fair Yes youth 12000 Yes
40 15000 5.2 70 youth Good No senior 15000 No
35 20000 5.6 65 senior excellent Yes middle 20000 Yes
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data
sets and summarize their main characteristics, often employing data visualization methods.
Data Preparation
► Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is collected in
raw format which is not feasible for the analysis.
► Data preparation is also known as data "pre-processing," "data wrangling," "data cleaning," "data
pre-processing," and "feature engineering." It is the later stage of the machine
► Few essential tasks when working with data in the data preparation step.
• Data cleaning: This taskincludes the identification of errors and making corrections
or improvements to those errors.
• Feature Selection: We need to identify the most important or relevant input data variables for
the model.
• Data Transformation: Data transformation involves converting raw data into a well suitable
format for the model.
• Dimensionality Reduction: The dimensionality reduction process involves converting higher
dimensions into lower dimension features without changing the information
The four stages of data preprocessing
► There are four stages of data processing: cleaning, integration, reduction, and transformation.
1. Data cleaning:
It is the process of cleaning datasets by accounting for missing values, removing outliers, correcting
inconsistent data points, and smoothing noisy data. In essence, the motive behind data cleaning is to offer
complete and accurate samples for machine learning models.
• Missing values
• Noisy data
i) Missing values:
► The problem of missing data values is quite common. It may happen during data collection or due
to some specific data validation rule. In such cases, you need to collect additional data samples or
look for additional datasets.
► The issue of missing values can also arise when you concatenate two or more datasets to form a
bigger dataset. If not all fields are present in both datasets, it’s better to delete such fields before
merging.
► Here are some ways to account for missing data:
• Manually fill in the missing values. This can be a tedious and time-consuming approach and is
not recommended for large datasets.
• Make use of a standard value to replace the missing data value. You can use a global constant like
“unknown” or “N/A” to replace the missing value. Although a straightforward approach, it isn’t
foolproof.
• Fill the missing value with the most probable value. To predict the probable value, you can use
algorithms like logistic regression or decision trees.
• Use a central tendency to replace the missing value. Central tendency is the tendency of a value to
cluster around its mean, mode, or median.
ii) Noisy data
► A large amount of meaningless data is called noise. More precisely, it’s the random variance in a
measured variable or data having incorrect attribute values. Noise includes duplicate or
semi-duplicates of data points, data segments of no value for a specific research process, or unwanted
information fields.
► For example, if you need to predict whether a person can drive, information about their hair
color, height, or weight will be irrelevant.
► An outlier can be treated as noise, although some consider it a valid data point. Suppose you’re
training an algorithm to detect tortoises in pictures. The image dataset may contain images of
turtles wrongly labeled as tortoises. This can be considered noise.
► However, there can be a tortoise’s image that looks more like a turtle than a tortoise. That sample can
be considered an outlier and not necessarily noise. This is because we want to teach the algorithm all
possible ways to detect tortoises, and so, deviation from the group is essential.
► For numeric values, you can use a scatter plot or box plot to identify outliers.
► The following are some methods used to solve the problem of noise:
• Regression: Regression analysis can help determine the variables that have an impact. This will
enable you to work with only the essential features instead of analyzing large volumes of data.
Both linear regression and multiple linear regression can be used for smoothing the data.
• Binning: Binning methods can be used for a collection of sorted data. They smoothen a sorted value
by looking at the values around it. The sorted values are then divided into “bins,” which means
sorting data into smaller segments of the same size. There are different techniques for binning,
including smoothing by bin means and smoothing by bin medians.
• Clustering: Clustering algorithms such as k-means clustering can be used to group data and
detect outliers in the process.
2. Data integration
It is involved in a data analysis task that combines data from multiple sources into a coherent
data store. These sources may include multiple databases. Do you think how data can be
matched up ?? For a data analyst in one database, he finds Customer_ID and in another he
finds cust_id, How can he sure about them and say these two belong to the same entity.
Databases and Data warehouses have Metadata (It is the data about data) it helps in avoiding
errors.
Since data is collected from various sources, data integration is a crucial part of data preparation.
Integration may lead to several inconsistent and redundant data points, ultimately leading to models
with inferior accuracy.
► Here are some approaches to integrate data:
• Data consolidation: Data is physically brought together and stored in a single place. Having all data
in one place increases efficiency and productivity. This step typically involves using data warehouse
software.
• Data virtualization: In this approach, an interface provides a unified and real-time view of data from
multiple sources. In other words, data can be viewed from a single point of view.
• Data propagation: Involves copying data from one location to another with the help of specific
3. Data reduction
► As the name suggests, data reduction is used to reduce the amount of data and thereby reduce the
costs associated with data mining or data analysis.
► It offers a condensed representation of the dataset. Although this step reduces the volume, it maintains
the integrity of the original data. This data preprocessing step is especially crucial when working with
big data as the amount of data involved would be gigantic.
► The following are some techniques used for data reduction.
► Dimensionality reduction, also known as dimension reduction, reduces the number of features or
input variables in a dataset.
► The number of features or input variables of a dataset is called its dimensionality. The higher the
number of features, the more troublesome it is to visualize the training dataset and create a
predictive model.
► In some cases, most of these attributes are correlated, hence redundant; therefore,
dimensionality reduction algorithms can be used to reduce the number of random variables and
obtain a set of principal variables.
3. Data reduction
► There are two segments of dimensionality reduction: feature selection and feature extraction.
i. Feature selection (selecting a subset of the variables)--try to find a subset of the original set of features.
This allows us to get a smaller subset that can be used to visualize the problem using data modeling
ii. Feature extraction (extracting new variables from the data)---reduces the data in a high-dimensional
space to a lower-dimensional space, or in other words, space with a lesser number of dimensions.
► The following are some ways to perform dimensionality reduction:
• Principal component analysis (PCA): A statistical technique used to extract a new set of variables from a
large set of variables. The newly extracted variables are called principal components. This method works only
for features with numerical values.
• High correlation filter: A technique used to find highly correlated features and remove them; otherwise, a
pair of highly correlated variables can increase the multicollinearity in the dataset.
• Missing values ratio: This method removes attributes having missing values more than a specified threshold.
• Low variance filter: Involves removing normalized attributes having variance less than a threshold value as
minor changes in data translate to less information.
• Random forest: This technique is used to assess the importance of each feature in a dataset, allowing us to
keep just the top most important features.
4. Data Transformation
► Data transformation is the process of converting data from one format to another. In essence, it involves methods for
transforming data into appropriate formats that the computer can learn efficiently from.
► For example, the speed units can be miles per hour, meters per second, or kilometers per hour. Therefore a dataset may
store values of the speed of a car in different units as such. Before feeding this data to an algorithm, we need to
transform the data into the same unit.
► The following are some strategies for data transformation.
► Smoothing
► This statistical approach is used to remove noise from the data with the help of algorithms. It helps highlight the most
valuable features in a dataset and predict patterns. It also involves eliminating outliers from the dataset to make the
patterns more visible.
► Aggregation
► Aggregation refers to pooling data from multiple sources and presenting it in a unified format for data mining or
analysis. Aggregating data from various sources to increase the number of data points is essential as only then the ML
model will have enough examples to learn from.
► Discretization
► Discretization involves converting continuous data into sets of smaller intervals. For example, it’s more efficient to
place people in categories such as “teen,” “young adult,” “middle age,” or “senior” than using continuous age values.
► Generalization
► Generalization involves converting low-level data features into high-level data features. For instance, categorical
attributes such as home address can be generalized to higher-level definitions such as city or state.
4. Data Transformation
► Normalization
► Normalization refers to the process of converting all data variables into a specific range. In other words, it’s
used to scale the values of an attribute so that it falls within a smaller range, for example, 0 to 1. Decimal
scaling, min-max normalization, and z-score normalization are some methods of data normalization.
► Feature construction
► Feature construction involves constructing new features from the given set of features. This method simplifies
the original dataset and makes it easier to analyze, mine, or visualize the data.
► Concept hierarchy generation
► Concept hierarchy generation lets you create a hierarchy between features, although it isn’t specified. For
example, if you have a house address dataset containing data about the street, city, state, and country, this
method can be used to organize the data in hierarchical forms.
► Accurate data, accurate results
► Machine learning algorithms are like kids. They have little to no understanding of what’s favorable or
unfavorable. Like how kids start repeating foul language picked up from adults, inaccurate or inconsistent data
easily influences ML models. The key is to feed them high-quality, accurate data, for which data
preprocessing is an essential step.
Data Preprocessing
Step:3 – Choosing the
Learning Model
Types of Machine Learning
Machine Learning
Markov Decision
Decision trees Simple Linear K-Means Process
KNN K-Modes
Multiple Linear
Divisive
Multinomial Logistic
Regression
Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Networks
Step:4 – Training the Model
Training set & Test Set
Training the Model
► Dataset split ratio is mainly depends on 2 things. First, the total number of
samples(instance/rows) in your data and second, on the actual model you are
training.
► Train/Validation/Test is a method to measure the accuracy of your model.
► We can split thedata set into three sets: a training set ,
Validation and testing set.
► 70%/80% for training, and 30%/20% for testing.(it depends on the given data)
► Train the model means create the model.
► Test the model means test the accuracy of the model.
► The fundamental purpose for splitting the dataset is to assess how effective
will the trained model be in generalizing to new data.
► This split can be achieved by using train_test_split function of scikit-learn.
Training the Model
► Training dataset: The sample of data used to fit the model. The actual
dataset that we use to train the model (weights and biases in the case of
a Neural Network). The model sees and learns from this data.
► This is the actual dataset from which a model trains .i.e. the model sees and
learns from this data to predict the outcome or to make the right decisions.
► Most of the training data is collected from several resources and
then preprocessed and organized to provide proper performance of the model.
► Type of training data hugely determines the ability of the model to generalize
.i.e. the better the quality and diversity of training data, the better will be the
performance of the model.
► This data is more than 60% of the total data available for the project.
Training the Model
► Test dataset : The sample of data used to provide an unbiased evaluation of
a final model fit on the training dataset.
► This dataset is independent of the training set but has a somewhat similar type
of probability distribution of classes and is used as a benchmark to evaluate
the model, used only after the training of the model is complete.
► Testing set is usually a properly organized dataset having all kinds of data for
scenarios that the model would probably be facing when used in the real world.
Often the validation and testing set combined is used as a testing set which is
not considered a good practice.
► If the accuracy of the model on training data is greater than that on testing data
then the model is said to have overfitting.
► This data is approximately 20-25% of the total data available for the project.
Training the Model
► Validation dataset: The sample of data used to provide an unbiased
evaluation of a model fit on the training dataset while tuning model
hyperparameters. The evaluation becomes more biased as skill on the
validation dataset is incorporated into the model configuration.
► The validation set is used to fine-tune the hyperparameters of the model and is
considered a part of the training of the model.
► The model only sees this data for evaluation but does not learn from this data,
providing an objective unbiased evaluation of the model.
► Validation dataset can be utilized for regression as well by interrupting training
of model when loss of validation dataset becomes greater than loss of training
dataset .i.e. reducing bias and variance. This data is approximately 10-15% of
the total data available for the project but this can change depending upon the
number of hyperparameters .i.e. if model has quite many hyperparameters
then using large validation set will give better results. Now, whenever the
Step:5 –Performance Evaluation
Performance metrics
► Evaluating the performance of a Machine learning model is one of the important steps while building an
effective ML model. To evaluate the performance or quality of the model, different metrics are used,
and these metrics are known as performance metrics or evaluation metrics.
► These performance metrics help us understand how well our model has performed for the given data. In
this way, we can improve the model's performance by tuning the hyper-parameters. Each ML model aims
to generalize well on unseen/new data, and performance metrics help determine how well the model
generalizes on the new dataset.
Performance metrics
► In machine learning, each task or problem is divided into classification and Regression. Not all
metrics can be used for all types of problems; hence, it is important to know and understand which
metrics should be used. Different evaluation metrics are used for both Regression and Classification
tasks. In this topic, we will discuss metrics used for classification and regression tasks.
Performance Metrics for Classification
► In a classification problem, the category or classes of data is identified based on training data. The
model learns from the given dataset and then classifies the new data into classes or groups based
on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not Spam,
etc. To evaluate the performance of a classification model, different metrics are used, and some
of them are as follows:
1. Accuracy-it can be determined as the number of correct predictions to the total number of
predictions.
2. Confusion Matrix
3. Precision
4. Recall
5. F-Score
6. AUC(Area Under the Curve)-ROC
Performance metrics
1. Accuracy- It can be determined as the number of correct predictions to the total number of predictions.
2. Confusion Matrix:
► A confusion matrix is a tabular representation of prediction outcomes of any binary classifier, which is
used to describe the performance of the classification model on a set of test data when true values are
known.
► The confusion matrix is simple to implement, but the terminologies used in this matrix might be confusing
for beginners.
► A typical confusion matrix for a binary classifier looks like the below image(However, it can be extended
to use for classifiers with more than two classes).
Performance metrics
► Accuracy for the matrix can be calculated by taking average of the
values lying across the “main diagonal” i.e
► Accuracy = (True Positives+False Negatives)/Total Number of Samples
2.Mean Squared Error--It measures the average of the Squared difference between predicted values and the actual value given by
the model.
3.R2 Score--R squared error is also known as Coefficient of Determination, which is another popular metric used for Regression
model evaluation. The R-squared metric enables us to compare our model with a constant baseline to determine the performance
of the model. To select the constant baseline, we need to take the mean of the data and draw the line at the mean.
Performance metrics
4. Adjusted R2
► Adjusted R squared, as the name suggests, is the improved version of R squared error. R square has a limitation of
improvement of a score on increasing the terms, even though the model is not improving, and it may mislead the
data scientists.
► To overcome the issue of R square, adjusted R squared is used, which will always show a lower value than R². It is
because it adjusts the values of increasing predictors and only shows improvement if there is a real improvement.
► We can calculate the adjusted R squared as follows:
Markov Decision
Decision trees Simple Linear K-Means Process
KNN K-Modes
Multiple Linear
Divisive
Multinomial Logistic
Regression
Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Networks
Types of Machine Learning
► There are primarily three types of machine learning: Supervised, Unsupervised,
and Reinforcement Learning.
• Supervised machine learning: User supervise the machine while training it to work on its
own. This requires labeled training data
• Unsupervised learning: There is training data, but it won’t be labeled
• Reinforcement learning: The system learns on its own
1.Supervised Learning
► Supervised learning is a type of machine learning that uses labeled data to train machine
learning models. In labeled data, the output is already known. The model just needs to map
the inputs to the respective outputs.
► An example of supervised learning is to train a system that identifies the image of an
animal.
► Supervised learning algorithms take labeled inputs and map them to the known outputs,
which means you already know the target variable.
► Supervised Learning methods need external supervision to train machine learning models.
Hence, the name supervised. They need guidance and additional information to return the
desired result.
► First, you have to provide a data set that contains pictures of a kind of fruit, e.g., apples.
► Then, provide another data set that lets the model know that these are pictures of apples.
This completes the training phase.
► Next, provide a new set of data that only contains pictures of apples. At this point, the
system can recognize what the fruit it is and will remember it.
1.Supervised Learning
1.Supervised Learning
► Supervised learning algorithms are generally used for solving
classification and regression problems.
• Classification-- Predicts a Class Label (Categorical)
• Regression--Predicts a Class Label (Numerical)
► Classification: Classification is used when the output variable is
categorical i.e. with 2 or more classes. For example, yes or no,
male or female, true or false, etc.
► In order to predict whether a mail is spam or not, we need to first
teach the machine what a spam mail is. This is done based on a
lot of spam filters - reviewing the content of the mail, reviewing
the mail header, and then searching if it contains any false
information.
► All of these features are used to score the mail and give it a
spam score. The lower the total spam score of the email, the
more likely that it is not a scam.
► Based on the content, label, and the spam score of the new
incoming mail, the algorithm decides whether it should land in
the inbox or spam folder.
1.Supervised Learning
► Regression:
► Regression is used when the output variable is a real or continuous
value. In this case, there is a relationship between two or more
variables i.e., a change in one variable is associated with a change in
the other variable. For example, salary based on work experience or
weight based on height, etc.
► Let’s consider two variables - humidity and temperature. Here,
‘temperature’ is the independent variable and ‘humidity' is the
dependent variable. If the temperature increases, then the humidity
decreases.
► These two variables are fed to the model and the machine learns the
relationship between them. After the machine is trained, it can easily
predict the humidity based on the given temperature.
Note: Real-Life Applications of Supervised Learning
Risk Assessment-to assess the risk in financial services or insurance
domains
Image Classification--Facebook can recognize your friend in a picture from
an album of tagged photos.
Fraud Detection--To identify whether the transactions made by the user are
authentic or not.
2.Unsupervised Learning
► Unsupervised learning is a type of machine learning that uses unlabeled data to train machines.
► Unlabeled data doesn’t have a fixed output variable.
► The model learns from the data, discovers the patterns and features in the data, and returns the
output.
► Consider a cluttered dataset: a collection of pictures of different spoons.
► Feed this data to the model, and the model analyzes it to recognize any patterns. I
► The machine categorizes the photos into two types, as shown in the image, based on their
similarities.
► Flipkart uses this model to find and recommend products that are well suited for you.
2.Unsupervised Learning
► Depicted below is an example of an unsupervised learning technique that uses the images of
vehicles to classify if it’s a bus or a truck.
► The model learns by identifying the parts of a vehicle, such as a length and width of the vehicle, the
front, and rear end covers, roof hoods, the types of wheels used, etc.
► Based on these features, the model classifies if the vehicle is a bus or a truck.
2.Unsupervised Learning
► Unsupervised learning finds patterns and understands the trends
in the data to discover the output. So, the model tries to label the
data based on the features of the input data.
► The training process used in unsupervised learning techniques
does not need any supervision to build models. They learn on
their own and predict the output.
► Unsupervised learning can be further grouped into types:
1. Clustering
2. Association
2.Unsupervised Learning
► 1. Clustering: Clustering is the method of dividing the objects into
clusters that are similar between them and are dissimilar to the objects
belonging to another cluster. For example, finding out which
customers made similar product purchases.
2.Unsupervised Learning
► Suppose a telecom company wants to reduce its customer
churn rate by providing personalized call and data plans. The
behavior of the customers is studied and the model segments
the customers with similar traits. Several strategies are adopted
to minimize churn rate and maximize profit through suitable
promotions and campaigns.
► 2. Association:
► Association is a rule-based machine learning to discover the
probability of the co-occurrence of items in a collection. For
example, finding out which products were purchased together.
2.Unsupervised Learning
► Let’s say that a customer goes to a supermarket and buys bread, milk, fruits,
and wheat. Another customer comes and buys bread, milk, rice, and butter.
Now, when another customer comes, it is highly likely that if he buys bread,
he will buy milk too. Hence, a relationship is established based on customer
behavior and recommendations are made.
2.Unsupervised Learning
► Real-Life Applications of Unsupervised Learning:
• Market Basket Analysis: It is a machine learning model based on the algorithm that
if you buy a certain group of items, you are less or more likely to buy another
group of items.
• Semantic Clustering: Semantically similar words share a similar context. People
post their queries on websites in their own ways. Semantic clustering groups all
these responses with the same meaning in a cluster to ensure that the customer
finds the information they want quickly and easily. It plays an important role in
information retrieval, good browsing experience, and comprehension.
• Delivery Store Optimization: Machine learning models are used to predict the
demand and keep up with supply. They are also used to open stores where the
demand is higher and optimizing roots for more efficient deliveries according to
past data and behavior.
• Identifying Accident Prone Areas: Unsupervised machine learning models can be
used to identify accident-prone areas and introduce safety measures based on the
intensity of those accidents.
Difference between Supervised and Unsupervised
Learning:
S. No Supervised Learning Un Supervised Learning
1 The data used in supervised learning This algorithm does not require any
is labeled. labeled data because its job is to look for
The system learns from the labeled patterns in the input data and organize it
data and makes future predictions
2 We get feedback--once you receive the That does not happen with unsupervised
output, the system remembers it and learning.
uses it for the next operation
•Action(): Actions are the moves taken by an agent within the environment.
•State(): State is a situation returned by the environment after each action taken by the agent.
•Reward(): A feedback returned to the agent from the environment to evaluate the action of the agent.
•Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
•Value(): It is expected long-term retuned with the discount factor and opposite to the short-term reward.
•Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action (a).
Reinforcement Learning:
► Key Features of Reinforcement Learning
• In RL, the agent is not instructed about the environment and what actions need to be taken.
• It is based on the hit and trial process.
• The agent takes the next action and changes states according to the feedback of the previous action.
• The agent may get a delayed reward.
• The environment is stochastic, and the agent needs to explore it to reach to get the maximum positive
rewards.
► Applications of Reinforcement Learning
Difference between Supervised, Unsupervised and Reinforcement
Learning:
S. No Supervised Unsupervised Reinforcement
1 Data provided is labeled Data provided is unlabeled The machine learns from
data with output values data, the output values not its environment using
specified specified, machine makes its rewards and errors
own predictions
• It reduces overfitting.
Feature Selection Methods
Filter Methods
where
X - discrete random variable X, P(X) - probability mass function, E - expected value operator,
I - information content of X, I(X) - a random variable.
Mutual information:-
In information theory, mutual information I(X;Y) is the amount of uncertainty in X due to the knowledge of Y.
Mathematically, mutual information is defined as
where p(x, y) - joint probability function of X and Y, p(x) - marginal probability distribution function of X
p(y) - marginal probability distribution function of Y
Wrapper Methods
• In wrapper methods, we try to use a subset of features and train a model using them.
Based on the inferences that we draw from the previous model, we decide to add or
remove features from your subset.
• The problem is essentially reduced to a search problem.
• These methods are usually computationally very expensive.
Wrapper Methods Examples
• Embedded methods combine the qualities of filter and wrapper methods. It’s
implemented by algorithms that have their own built-in feature selection methods.
• A learning algorithm takes advantage of its own variable selection process and
performs feature selection and classification simultaneously
• Some of the most popular examples of these methods are LASSO and RIDGE
regression which have inbuilt penalization functions to reduce overfitting.
• Lasso regression performs L1 regularization which adds penalty equivalent to
absolute value of the magnitude of coefficients.
• Ridge regression performs L2 regularization which adds penalty equivalent to
square of the magnitude of coefficients.
Embedded Methods
Feature Selection Embedded in Learning Algorithms
Some learning algorithms perform feature selection as part of their overall
operation. These include:
• L1-regularization techniques, such as sparse regression, LASSO( Least Absolute Shrinkage and
Selection Operator) , L1-SVM
• L2 Regularisation- Ridge Regression
• Regularized trees e.g. regularized random Forest
• Decision tree
• Memetic algorithm
• Random multinomial logit (RMNL)
• Auto-encoding networks with a bottleneck-layer
• Submodular feature selection
Differences between Filter &
Wrapper Methods
The main differences between the filter and wrapper methods for feature selection
are:
• Filter methods measure the relevance of features by their correlation with
dependent variable while wrapper methods measure the usefulness of a subset of
feature by actually training a model on it.
• Filter methods are much faster compared to wrapper methods as they do not
involve training the models. On the other hand, wrapper methods are
computationally very expensive as well.
• Filter methods use statistical methods for evaluation of a subset of features while
wrapper methods use cross validation.
• Filter methods might fail to find the best subset of features in many occasions but
wrapper methods can always provide the best subset of features.
• Using the subset of features from the wrapper methods make the model more
prone to overfitting as compared to using subset of features from the filter
methods.
Input Variable Output Variable Feature Selection technique
Example : Suppose that the minimum and maximum values for the price of the house be $125,000 and
$925,000 respectively. We need to normalize that price range in between (0,1). We can use min-max
normalization to transform any value between them (say, 300,000). In this case, we use the above
formula to find Ui with,
Vi=300,000
X1= 125,000
X2= 925,000
Y1= 0
Y2= 1
Therefore the normalized value Ui will be 0.21875.
Python Code:-
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)
[[0.5 0. 1. ]
[1. 0.5 0.33333333]
[0. 1. 0. ]]
Exercise :-
There are five numeric values: 14, 9, 24, 39, 60.
Apply Min-Max Normalization.
Solution:-
9: (9 - 9) / (60 - 9) = 0 / 51= 0.00
14: (14 - 9) / (60 - 9) = 5 / 51 = 0.098
24: (24 -9) / (60 - 9) =15 / 51 = 0.29
39: (39 - 9) / (60 - 9) = 30 / 51 = 0.58
60: (60 -9) / (60 - 9) = 51 / 51 = 1.00
Z-Score Normalization
Also called standardization, z-score normalization sees features rescaled in a way that follows standard
normal distribution property with μ=0 and σ=1, where μ is the mean (average) and σ is the standard
deviation from the mean.
It is the process of re-scaling original data without changing its original nature.The main aim of
normalization is to change the value of data in the dataset to a common scale, without distorting the
differences in the ranges of value. This technique is useful in classification algorithms involving neural
networks or distance-based algorithm (e.g. KNN, K-means).
The standard score or z-score of the samples are calculated using the following formula.
The benefit of performing this type of normalization is that the clear outlier in the dataset (134) has been
transformed in such a way that it’s no longer a massive outlier.
Constant Factor Normalization
The simplest normalization technique is constant factor normalization.
Expressed as a math equation constant factor normalization is x' = x / k, where x
is a raw value, x' is the normalized value, and k is a numeric constant. If k = 100,
the constant factor normalized values are:
► This criterion is known as the Fisher criterion and can be expressed as the following formula
for two classes: