0% found this document useful (0 votes)
13 views

Mlunit 1

Uploaded by

bogaabhinav17
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Mlunit 1

Uploaded by

bogaabhinav17
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

MACHINE LEARNING

By
Dr.V.Srilakshmi
Associate Professor,
CSE, GRIET
Unit-1 Content
► Introduction: Introduction to Machine learning, Supervised
learning,Unsupervised learning, Reinforcement learning. Deep
learning.
► Feature Selection: Filter, Wrapper, Embedded methods.

► Feature Normalization:- min-max normalization, z-score


normalization, and constant factor normalization

► Introduction to DimensionalityReduction: Principal


Component Analysis (PCA), Linear Discriminant Analysis (LDA)
Introduction to
Machine Learning
Definitions of Machine Learning
Machine learning (ML) is a subset / branch of Artificial Intelligence.
1. Machine learning is a "Field of study that gives computers the ability to learn
without being explicitly programmed“ defined by Arthur Samuel in 1959.
In machine learning, algorithms are trained to find patterns and correlations in
large data sets and to make the best decisions and predictions based on that analysis.

(OR)
2. Machine learning is a Computer Program is said to learn from Experience E with
respect to small Class of tasks T and Performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E – by Tom Mitchell in 1998
List of reasons why Machine Learning is so important:
List of reasons why Machine Learning is so important:

• Increase in Data Generation: Due to excessive production of data, we need a method


that can be used to structure, analyze and draw useful insights from data. This is where
Machine Learning comes in. It uses data to solve problems and find solutions to the
most complex tasks faced by organizations.
• Improve Decision Making: By making use of various algorithms, Machine Learning can be
used to make better business decisions. For example, Machine Learning is used to
forecast sales, predict downfalls in the stock market, identify risks and anomalies, etc.
• Uncover patterns & trends in data: Finding hidden patterns and extracting key insights
from data is the most essential part of Machine Learning. By building predictive models
and using statistical techniques, Machine Learning allows you to dig beneath the
surface and explore the data at a minute scale. Understanding data and extracting
patterns manually will take days, whereas Machine Learning algorithms can perform
such computations in less than a second.
• Solve complex problems: From detecting the genes linked to the deadly ALS disease to
building self-driving cars, Machine Learning can be used to solve the most complex
problems.
Machine Learning Applications:

• Netflix’s Recommendation Engine: The core of Netflix is its infamous


recommendation engine. Over 75% of what you watch is recommended by
Netflix and these recommendations are made by implementing Machine
Learning.
• Facebook’s Auto-tagging feature: The logic behind Facebook’s DeepMind face
verification system is Machine Learning and Neural Networks. DeepMind studies
the facial features in an image to tag your friends and family.
• Amazon’s Alexa: The infamous Alexa, which is based on Natural Language
Processing and Machine Learning is an advanced level Virtual Assistant that
does more than just play songs on your playlist. It can book you an Uber,
connect with the other IoT devices at home, track your health, etc.
• Google’s Spam Filter: Gmail makes use of Machine Learning to filter out spam
messages. It uses Machine Learning algorithms and Natural Language
Processing to analyze emails in real-time and classify them as either spam or
non-spam.
Features of Machine Learning
Machine Learning Vs Traditional Programming
► Traditional Programming:
Data and program is run on the computer to produce the output.

Traditional
Programming

Data
Output
Computer
Program
► Machine Learning:
Data and output is run on the computer to create a
program. Machine
Learning

Data
Program
Computer
Output
Relation between Data Science, Machine learning , Deep Learning &
Artificial Intelligence
Relation between Data Science, Machine learning & Artificial
Intelligence:
► Machine Learning (ML): Algorithms that learn from structured data to
predict outputs and discover patterns in that data.
► ML is an application or subset of AI.
► The major aim of ML is to allow the systems to learn by themselves through
the experience without any kind of human intervention or assistance.
► Ex: We use machine learning in our day to day life when we use services
like recommendation systems on Netflix, Youtube, Spotify; search engines
like google and yahoo; voice assistants like google home and amazon alexa.
(structured data)
► In Machine Learning we train the algorithm by providing it with a lot of
data and allowing it to learn more about the processed information.
Relation between Data Science, Machine learning & Artificial Intelligence:

► Deep Learning (DL): Algorithms based on highly complex neural networks that
mimic the way a human brain works to detect patterns in large unstructured data
sets.
► Deep learning is the evolution of machine learning and neural networks, which uses
advanced computer programming and training to understand complex patterns
hidden in large data sets.
► DL is about understanding how the human brain works in different situations and
then trying to recreate its behaviour.
► Deep learning is used to complete complex tasks and train models using
unstructured data.
► Ex: Deep learning is commonly used in image classification tasks like facial
recognition. Although machine learning models can also identify faces, deep
learning models are more accurate.
► In this case, it takes the unstructured data (images of faces) and extracts factors
Two major advantages of DL
over ML:
1. Feature Extraction
► Machine learning algorithms such as Naive Bayes, Logistic Regression, SVM, etc., are
termed as “flat algorithms”. By flat, we mean, these algorithms require
pre-processing phase (known as Feature Extraction which is quite complicated and
computationally expensive) before been applied to data such as images, text, CSV.
► For instance, if we want to determine whether a particular image is of a cat or dog
using the ML model.
► We have to manually extract features from the image such as size, color, shape,
etc., and then give these features to the ML model to identify whether the image is
of a dog or cat.
► However, DL models do not any feature extraction pre-processing step and are
capable of classifying data into different classes and categories themselves. That is,
in the case of identification of cat or dog in the image, we do not need to extract
features from the image and give it to the DL model. But, the image can be given as
the direct input to the DL model whose job is then to classify it without human
intervention.
Two major advantages of DL
over ML:
2. Big Data
► With technology and the ever-increasing use of the web, it is estimated that every
second 1.7MB of data is generated by every person on the planet Earth.
Therefore, analyzing and learning from data is of utmost importance.
► Deep Learning is seen as a rocket whose fuel is data.
► The accuracy of ML models stops increasing with an increasing amount of data after
a point while the accuracy of the DL model keeps on increasing with increasing data.
All the technologies at a
glance………
ML Tools
Step:1 - Gathering the
data
Data:
► Data: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analyzed.
► Data is the most important part of all Data Analytics, Machine Learning, Artificial
Intelligence.
► Without data, we can’t train any model and all modern research and automation will
go in vain. Big Enterprises are spending lots of money just to gather as much certain
data as possible.
► Data is typically divided into two types: labeled and unlabeled. Labeled data includes a
label or target variable that the model(Supervised) is trying to predict, whereas
unlabeled data does not include a label or target variable (UnSupervised) .
► A labeled dataset is one where you already know the target answer.
► The data used in machine learning is typically numerical or categorical.
• Numerical
data includes values that can be ordered and measured, such as age or
income.(Regression-if target variable is numerical)
• Categorical
data/Nominal data: includes values that represent categories, such as
gender or type of fruit.(Classification-if target variable is Categorical)
Types of Data

The data is classified into majorly four categories:

• Nominal data
• Ordinal data
• Discrete data
• Continuous data
Types of Data
What is a Data set ?
A data set is an organized collection of
data. They are generally associated with a
unique body of work and typically cover
one topic at a time.
Each data set has one output variable and
one/more input variables.
Instance
/observation/rows/RECORDS/SAMPLES/obj
ects/predictors
Columns/features/attributes/variables/fiel
ds and characteristics.

• Independent variables—Input
variable-predictor variable

• Dependent variables-output
variable/target
variable/response variable
Types of datasets:
► 1.Data set consists of only numerical attributes
► 2.Data set consists of only categorical attributes
► 3.Data set consists of both numerical and categorical attributes
► Dataset1: Dataset2: Dataset3:
age incom heig weight age income studen age income Credit
e ht t rating
20 12000 6.3 30 youth Fair Yes youth 12000 Yes
40 15000 5.2 70 youth Good No senior 15000 No
35 20000 5.6 65 senior excellent Yes middle 20000 Yes

60 100000 5.4 59 middle Good Yes youth 100000 Yes


senior Fair No
middle good no
Step:2 – Data Preparation

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data
sets and summarize their main characteristics, often employing data visualization methods.
Data Preparation

► Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is collected in
raw format which is not feasible for the analysis.
► Data preparation is also known as data "pre-processing," "data wrangling," "data cleaning," "data
pre-processing," and "feature engineering." It is the later stage of the machine
► Few essential tasks when working with data in the data preparation step.
• Data cleaning: This taskincludes the identification of errors and making corrections
or improvements to those errors.
• Feature Selection: We need to identify the most important or relevant input data variables for
the model.
• Data Transformation: Data transformation involves converting raw data into a well suitable
format for the model.
• Dimensionality Reduction: The dimensionality reduction process involves converting higher
dimensions into lower dimension features without changing the information
The four stages of data preprocessing
► There are four stages of data processing: cleaning, integration, reduction, and transformation.
1. Data cleaning:
It is the process of cleaning datasets by accounting for missing values, removing outliers, correcting
inconsistent data points, and smoothing noisy data. In essence, the motive behind data cleaning is to offer
complete and accurate samples for machine learning models.
• Missing values
• Noisy data
i) Missing values:
► The problem of missing data values is quite common. It may happen during data collection or due
to some specific data validation rule. In such cases, you need to collect additional data samples or
look for additional datasets.
► The issue of missing values can also arise when you concatenate two or more datasets to form a
bigger dataset. If not all fields are present in both datasets, it’s better to delete such fields before
merging.
► Here are some ways to account for missing data:
• Manually fill in the missing values. This can be a tedious and time-consuming approach and is
not recommended for large datasets.
• Make use of a standard value to replace the missing data value. You can use a global constant like
“unknown” or “N/A” to replace the missing value. Although a straightforward approach, it isn’t
foolproof.
• Fill the missing value with the most probable value. To predict the probable value, you can use
algorithms like logistic regression or decision trees.
• Use a central tendency to replace the missing value. Central tendency is the tendency of a value to
cluster around its mean, mode, or median.
ii) Noisy data
► A large amount of meaningless data is called noise. More precisely, it’s the random variance in a
measured variable or data having incorrect attribute values. Noise includes duplicate or
semi-duplicates of data points, data segments of no value for a specific research process, or unwanted
information fields.
► For example, if you need to predict whether a person can drive, information about their hair
color, height, or weight will be irrelevant.
► An outlier can be treated as noise, although some consider it a valid data point. Suppose you’re
training an algorithm to detect tortoises in pictures. The image dataset may contain images of
turtles wrongly labeled as tortoises. This can be considered noise.
► However, there can be a tortoise’s image that looks more like a turtle than a tortoise. That sample can
be considered an outlier and not necessarily noise. This is because we want to teach the algorithm all
possible ways to detect tortoises, and so, deviation from the group is essential.
► For numeric values, you can use a scatter plot or box plot to identify outliers.
► The following are some methods used to solve the problem of noise:
• Regression: Regression analysis can help determine the variables that have an impact. This will
enable you to work with only the essential features instead of analyzing large volumes of data.
Both linear regression and multiple linear regression can be used for smoothing the data.
• Binning: Binning methods can be used for a collection of sorted data. They smoothen a sorted value
by looking at the values around it. The sorted values are then divided into “bins,” which means
sorting data into smaller segments of the same size. There are different techniques for binning,
including smoothing by bin means and smoothing by bin medians.
• Clustering: Clustering algorithms such as k-means clustering can be used to group data and
detect outliers in the process.
2. Data integration
It is involved in a data analysis task that combines data from multiple sources into a coherent
data store. These sources may include multiple databases. Do you think how data can be
matched up ?? For a data analyst in one database, he finds Customer_ID and in another he
finds cust_id, How can he sure about them and say these two belong to the same entity.
Databases and Data warehouses have Metadata (It is the data about data) it helps in avoiding
errors.
Since data is collected from various sources, data integration is a crucial part of data preparation.
Integration may lead to several inconsistent and redundant data points, ultimately leading to models
with inferior accuracy.
► Here are some approaches to integrate data:
• Data consolidation: Data is physically brought together and stored in a single place. Having all data
in one place increases efficiency and productivity. This step typically involves using data warehouse
software.
• Data virtualization: In this approach, an interface provides a unified and real-time view of data from
multiple sources. In other words, data can be viewed from a single point of view.
• Data propagation: Involves copying data from one location to another with the help of specific
3. Data reduction

► As the name suggests, data reduction is used to reduce the amount of data and thereby reduce the
costs associated with data mining or data analysis.
► It offers a condensed representation of the dataset. Although this step reduces the volume, it maintains
the integrity of the original data. This data preprocessing step is especially crucial when working with
big data as the amount of data involved would be gigantic.
► The following are some techniques used for data reduction.
► Dimensionality reduction, also known as dimension reduction, reduces the number of features or
input variables in a dataset.
► The number of features or input variables of a dataset is called its dimensionality. The higher the
number of features, the more troublesome it is to visualize the training dataset and create a
predictive model.
► In some cases, most of these attributes are correlated, hence redundant; therefore,
dimensionality reduction algorithms can be used to reduce the number of random variables and
obtain a set of principal variables.
3. Data reduction
► There are two segments of dimensionality reduction: feature selection and feature extraction.
i. Feature selection (selecting a subset of the variables)--try to find a subset of the original set of features.
This allows us to get a smaller subset that can be used to visualize the problem using data modeling
ii. Feature extraction (extracting new variables from the data)---reduces the data in a high-dimensional
space to a lower-dimensional space, or in other words, space with a lesser number of dimensions.
► The following are some ways to perform dimensionality reduction:
• Principal component analysis (PCA): A statistical technique used to extract a new set of variables from a
large set of variables. The newly extracted variables are called principal components. This method works only
for features with numerical values.
• High correlation filter: A technique used to find highly correlated features and remove them; otherwise, a
pair of highly correlated variables can increase the multicollinearity in the dataset.
• Missing values ratio: This method removes attributes having missing values more than a specified threshold.
• Low variance filter: Involves removing normalized attributes having variance less than a threshold value as
minor changes in data translate to less information.
• Random forest: This technique is used to assess the importance of each feature in a dataset, allowing us to
keep just the top most important features.
4. Data Transformation
► Data transformation is the process of converting data from one format to another. In essence, it involves methods for
transforming data into appropriate formats that the computer can learn efficiently from.
► For example, the speed units can be miles per hour, meters per second, or kilometers per hour. Therefore a dataset may
store values of the speed of a car in different units as such. Before feeding this data to an algorithm, we need to
transform the data into the same unit.
► The following are some strategies for data transformation.
► Smoothing
► This statistical approach is used to remove noise from the data with the help of algorithms. It helps highlight the most
valuable features in a dataset and predict patterns. It also involves eliminating outliers from the dataset to make the
patterns more visible.
► Aggregation
► Aggregation refers to pooling data from multiple sources and presenting it in a unified format for data mining or
analysis. Aggregating data from various sources to increase the number of data points is essential as only then the ML
model will have enough examples to learn from.
► Discretization
► Discretization involves converting continuous data into sets of smaller intervals. For example, it’s more efficient to
place people in categories such as “teen,” “young adult,” “middle age,” or “senior” than using continuous age values.
► Generalization
► Generalization involves converting low-level data features into high-level data features. For instance, categorical
attributes such as home address can be generalized to higher-level definitions such as city or state.
4. Data Transformation
► Normalization
► Normalization refers to the process of converting all data variables into a specific range. In other words, it’s
used to scale the values of an attribute so that it falls within a smaller range, for example, 0 to 1. Decimal
scaling, min-max normalization, and z-score normalization are some methods of data normalization.
► Feature construction
► Feature construction involves constructing new features from the given set of features. This method simplifies
the original dataset and makes it easier to analyze, mine, or visualize the data.
► Concept hierarchy generation
► Concept hierarchy generation lets you create a hierarchy between features, although it isn’t specified. For
example, if you have a house address dataset containing data about the street, city, state, and country, this
method can be used to organize the data in hierarchical forms.
► Accurate data, accurate results
► Machine learning algorithms are like kids. They have little to no understanding of what’s favorable or
unfavorable. Like how kids start repeating foul language picked up from adults, inaccurate or inconsistent data
easily influences ML models. The key is to feed them high-quality, accurate data, for which data
preprocessing is an essential step.
Data Preprocessing
Step:3 – Choosing the
Learning Model
Types of Machine Learning
Machine Learning

Supervised Learning Unsupervised Reinforcement Learning


Learning

Classification Regression Clustering Q-Learning

Markov Decision
Decision trees Simple Linear K-Means Process

KNN K-Modes
Multiple Linear

Naïve Bayes K-Medoids


Polynomial
SVM DBScan

Logistic Regression Agglomerative

Divisive
Multinomial Logistic
Regression

Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Networks
Step:4 – Training the Model
Training set & Test Set
Training the Model
► Dataset split ratio is mainly depends on 2 things. First, the total number of
samples(instance/rows) in your data and second, on the actual model you are
training.
► Train/Validation/Test is a method to measure the accuracy of your model.
► We can split thedata set into three sets: a training set ,
Validation and testing set.
► 70%/80% for training, and 30%/20% for testing.(it depends on the given data)
► Train the model means create the model.
► Test the model means test the accuracy of the model.
► The fundamental purpose for splitting the dataset is to assess how effective
will the trained model be in generalizing to new data.
► This split can be achieved by using train_test_split function of scikit-learn.
Training the Model
► Training dataset: The sample of data used to fit the model. The actual
dataset that we use to train the model (weights and biases in the case of
a Neural Network). The model sees and learns from this data.
► This is the actual dataset from which a model trains .i.e. the model sees and
learns from this data to predict the outcome or to make the right decisions.
► Most of the training data is collected from several resources and
then preprocessed and organized to provide proper performance of the model.
► Type of training data hugely determines the ability of the model to generalize
.i.e. the better the quality and diversity of training data, the better will be the
performance of the model.
► This data is more than 60% of the total data available for the project.
Training the Model
► Test dataset : The sample of data used to provide an unbiased evaluation of
a final model fit on the training dataset.
► This dataset is independent of the training set but has a somewhat similar type
of probability distribution of classes and is used as a benchmark to evaluate
the model, used only after the training of the model is complete.
► Testing set is usually a properly organized dataset having all kinds of data for
scenarios that the model would probably be facing when used in the real world.
Often the validation and testing set combined is used as a testing set which is
not considered a good practice.
► If the accuracy of the model on training data is greater than that on testing data
then the model is said to have overfitting.
► This data is approximately 20-25% of the total data available for the project.
Training the Model
► Validation dataset: The sample of data used to provide an unbiased
evaluation of a model fit on the training dataset while tuning model
hyperparameters. The evaluation becomes more biased as skill on the
validation dataset is incorporated into the model configuration.
► The validation set is used to fine-tune the hyperparameters of the model and is
considered a part of the training of the model.
► The model only sees this data for evaluation but does not learn from this data,
providing an objective unbiased evaluation of the model.
► Validation dataset can be utilized for regression as well by interrupting training
of model when loss of validation dataset becomes greater than loss of training
dataset .i.e. reducing bias and variance. This data is approximately 10-15% of
the total data available for the project but this can change depending upon the
number of hyperparameters .i.e. if model has quite many hyperparameters
then using large validation set will give better results. Now, whenever the
Step:5 –Performance Evaluation
Performance metrics
► Evaluating the performance of a Machine learning model is one of the important steps while building an
effective ML model. To evaluate the performance or quality of the model, different metrics are used,
and these metrics are known as performance metrics or evaluation metrics.
► These performance metrics help us understand how well our model has performed for the given data. In
this way, we can improve the model's performance by tuning the hyper-parameters. Each ML model aims
to generalize well on unseen/new data, and performance metrics help determine how well the model
generalizes on the new dataset.
Performance metrics
► In machine learning, each task or problem is divided into classification and Regression. Not all
metrics can be used for all types of problems; hence, it is important to know and understand which
metrics should be used. Different evaluation metrics are used for both Regression and Classification
tasks. In this topic, we will discuss metrics used for classification and regression tasks.
Performance Metrics for Classification
► In a classification problem, the category or classes of data is identified based on training data. The
model learns from the given dataset and then classifies the new data into classes or groups based
on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not Spam,
etc. To evaluate the performance of a classification model, different metrics are used, and some
of them are as follows:
1. Accuracy-it can be determined as the number of correct predictions to the total number of
predictions.
2. Confusion Matrix
3. Precision
4. Recall
5. F-Score
6. AUC(Area Under the Curve)-ROC
Performance metrics
1. Accuracy- It can be determined as the number of correct predictions to the total number of predictions.

2. Confusion Matrix:
► A confusion matrix is a tabular representation of prediction outcomes of any binary classifier, which is
used to describe the performance of the classification model on a set of test data when true values are
known.
► The confusion matrix is simple to implement, but the terminologies used in this matrix might be confusing
for beginners.
► A typical confusion matrix for a binary classifier looks like the below image(However, it can be extended
to use for classifiers with more than two classes).
Performance metrics
► Accuracy for the matrix can be calculated by taking average of the
values lying across the “main diagonal” i.e
► Accuracy = (True Positives+False Negatives)/Total Number of Samples

► 3.Precision:-It is the number of correct positive results divided by the


number of positive results predicted by classifier

► 4. Recall :- It is the number of correct positive results divided by the


number of all relevant samples
► 5. F-Score:F-score or F1 Score is a metric to evaluate a binary classification model
on the basis of predictions that are made for the positive class. It is calculated with
the help of Precision and Recall. It is a type of single score that represents both
Precision and Recall. So, the F1 Score can be calculated as the harmonic mean of
both precision and Recall, assigning equal weight to each of them.
► The formula for calculating the F1 score is given below:
Performance metrics

► 6.AUC(Area Under the Curve)-ROC


► Sometimes we need to visualize the performance of the classification model on charts; then, we can use the
AUC-ROC curve. It is one of the popular and important metrics for evaluating the performance of the
classification model.
► Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents a graph to show
the performance of a classification model at different threshold levels. The curve is plotted between two
parameters, which are:
• True Positive Rate
• False Positive Rate
• TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
• FPR or False Positive Rate can be calculated as:
Performance metrics
Performance Metrics for Regression
► Regression is a supervised learning technique that aims to find the relationships between the dependent and independent
variables. A predictive regression model predicts a numeric or discrete value. The metrics used for regression are different
from the classification metrics. It means we cannot use the Accuracy metric (explained above) to evaluate a regression
model; instead, the performance of a Regression model is reported as errors in the prediction. Following are the popular
metrics that are used to evaluate the performance of Regression models.
1. Mean Absolute Error--MAE is one of the simplest metrics, which measures the absolute difference between actual and
predicted values, where absolute means taking a number as Positive.
► Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.

2.Mean Squared Error--It measures the average of the Squared difference between predicted values and the actual value given by
the model.

3.R2 Score--R squared error is also known as Coefficient of Determination, which is another popular metric used for Regression
model evaluation. The R-squared metric enables us to compare our model with a constant baseline to determine the performance
of the model. To select the constant baseline, we need to take the mean of the data and draw the line at the mean.
Performance metrics
4. Adjusted R2
► Adjusted R squared, as the name suggests, is the improved version of R squared error. R square has a limitation of
improvement of a score on increasing the terms, even though the model is not improving, and it may mislead the
data scientists.
► To overcome the issue of R square, adjusted R squared is used, which will always show a lower value than R². It is
because it adjusts the values of increasing predictors and only shows improvement if there is a real improvement.
► We can calculate the adjusted R squared as follows:

► n is the number of observations


► k denotes the number of independent variables
► and
a
R 2 denotes the adjusted R2
Step:6 –
Hyperparamete
r Tuning
Parameters and hyperparameters
Parameters
► A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.
• They are required by the model when making predictions.
• They values define the skill of the model on your problem.
• They are estimated or learned from data.
• They are often not set manually by the practitioner.
• They are often saved as part of the learned model.
► Parameters are key to machine learning algorithms. They are the part of the model that is learned from historical training data.
► possible parameter values.
• Statistics: In statistics, you may assume a distribution for a variable, such as a Gaussian distribution. Two parameters of the Gaussian
distribution are the mean (mu) and the standard deviation (sigma). This holds in machine learning, where these parameters may be
estimated from data and used as part of a predictive model.
• Programming: In programming, you may pass a parameter to a function. In this case, a parameter is a function argument that could have one
of a range of values. In machine learning, the specific model you are using is the function and requires parameters in order to make a
prediction on new data.
► Whether a model has a fixed or variable number of parameters determines whether it may be referred to as “parametric” or
“nonparametric“.
► Some examples of model parameters include:
• The weights in an artificial neural network.
• The support vectors in a support vector machine.
• The coefficients in a linear regression or logistic regression.
Parameters and hyperparameters
► Hyperparameters
► A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.
• They are often used in processes to help estimate model parameters.
• They are often specified by the practitioner.
• They can often be set using heuristics.
• They are often tuned for a given predictive modeling problem.
► We cannot know the best value for a model hyperparameter on a given problem. We may use rules of thumb, copy values used on
other problems, or search for the best value by trial and error.
► When a machine learning algorithm is tuned for a specific problem, such as when you are using a grid search or a random search,
then you are tuning the hyperparameters of the model or order to discover the parameters of the model that result in the most skillful
predictions.
► Model hyperparameters are often referred to as model parameters which can make things confusing. A good rule of thumb to
overcome this confusion is as follows:
► If you have to specify a model parameter
manually then it is probably a model
hyperparameter.
► Some examples of model hyperparameters include:
• The learning rate for training a neural network.
• The C and sigma hyperparameters for support vector machines.
• The k in k-nearest neighbors.
Hyperparameter Tuning
► Hyperparameters are adjustable parameters that let you control the model training process. For example, with
neural networks, you decide the number of hidden layers and the number of nodes in each layer. Model
performance depends heavily on hyperparameters.
► Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of
hyperparameters that results in the best performance. The process is typically computationally expensive and
manual.
► we are not aware of optimal values for hyperparameters which would generate the best model output. So, what
we tell the model is to explore and select the optimal model architecture automatically. This selection procedure
for hyperparameter is known as Hyperparameter Tuning.
Hyperparameter Tuning
Step:7 – Prediction
Types of
Machine Learning
Types of Machine Learning
Machine Learning

Supervised Learning Unsupervised Reinforcement Learning


Learning

Classification Regression Clustering Q-Learning

Markov Decision
Decision trees Simple Linear K-Means Process

KNN K-Modes
Multiple Linear

Naïve Bayes K-Medoids


Polynomial
SVM DBScan

Logistic Regression Agglomerative

Divisive
Multinomial Logistic
Regression

Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Networks
Types of Machine Learning
► There are primarily three types of machine learning: Supervised, Unsupervised,
and Reinforcement Learning.
• Supervised machine learning: User supervise the machine while training it to work on its
own. This requires labeled training data
• Unsupervised learning: There is training data, but it won’t be labeled
• Reinforcement learning: The system learns on its own
1.Supervised Learning
► Supervised learning is a type of machine learning that uses labeled data to train machine
learning models. In labeled data, the output is already known. The model just needs to map
the inputs to the respective outputs.
► An example of supervised learning is to train a system that identifies the image of an
animal.
► Supervised learning algorithms take labeled inputs and map them to the known outputs,
which means you already know the target variable.
► Supervised Learning methods need external supervision to train machine learning models.
Hence, the name supervised. They need guidance and additional information to return the
desired result.
► First, you have to provide a data set that contains pictures of a kind of fruit, e.g., apples.
► Then, provide another data set that lets the model know that these are pictures of apples.
This completes the training phase.
► Next, provide a new set of data that only contains pictures of apples. At this point, the
system can recognize what the fruit it is and will remember it.
1.Supervised Learning
1.Supervised Learning
► Supervised learning algorithms are generally used for solving
classification and regression problems.
• Classification-- Predicts a Class Label (Categorical)
• Regression--Predicts a Class Label (Numerical)
► Classification: Classification is used when the output variable is
categorical i.e. with 2 or more classes. For example, yes or no,
male or female, true or false, etc.
► In order to predict whether a mail is spam or not, we need to first
teach the machine what a spam mail is. This is done based on a
lot of spam filters - reviewing the content of the mail, reviewing
the mail header, and then searching if it contains any false
information.
► All of these features are used to score the mail and give it a
spam score. The lower the total spam score of the email, the
more likely that it is not a scam.
► Based on the content, label, and the spam score of the new
incoming mail, the algorithm decides whether it should land in
the inbox or spam folder.
1.Supervised Learning
► Regression:
► Regression is used when the output variable is a real or continuous
value. In this case, there is a relationship between two or more
variables i.e., a change in one variable is associated with a change in
the other variable. For example, salary based on work experience or
weight based on height, etc.
► Let’s consider two variables - humidity and temperature. Here,
‘temperature’ is the independent variable and ‘humidity' is the
dependent variable. If the temperature increases, then the humidity
decreases.
► These two variables are fed to the model and the machine learns the
relationship between them. After the machine is trained, it can easily
predict the humidity based on the given temperature.
Note: Real-Life Applications of Supervised Learning
Risk Assessment-to assess the risk in financial services or insurance
domains
Image Classification--Facebook can recognize your friend in a picture from
an album of tagged photos.
Fraud Detection--To identify whether the transactions made by the user are
authentic or not.
2.Unsupervised Learning
► Unsupervised learning is a type of machine learning that uses unlabeled data to train machines.
► Unlabeled data doesn’t have a fixed output variable.
► The model learns from the data, discovers the patterns and features in the data, and returns the
output.
► Consider a cluttered dataset: a collection of pictures of different spoons.
► Feed this data to the model, and the model analyzes it to recognize any patterns. I
► The machine categorizes the photos into two types, as shown in the image, based on their
similarities.
► Flipkart uses this model to find and recommend products that are well suited for you.
2.Unsupervised Learning
► Depicted below is an example of an unsupervised learning technique that uses the images of
vehicles to classify if it’s a bus or a truck.
► The model learns by identifying the parts of a vehicle, such as a length and width of the vehicle, the
front, and rear end covers, roof hoods, the types of wheels used, etc.
► Based on these features, the model classifies if the vehicle is a bus or a truck.
2.Unsupervised Learning
► Unsupervised learning finds patterns and understands the trends
in the data to discover the output. So, the model tries to label the
data based on the features of the input data.
► The training process used in unsupervised learning techniques
does not need any supervision to build models. They learn on
their own and predict the output.
► Unsupervised learning can be further grouped into types:
1. Clustering
2. Association
2.Unsupervised Learning
► 1. Clustering: Clustering is the method of dividing the objects into
clusters that are similar between them and are dissimilar to the objects
belonging to another cluster. For example, finding out which
customers made similar product purchases.
2.Unsupervised Learning
► Suppose a telecom company wants to reduce its customer
churn rate by providing personalized call and data plans. The
behavior of the customers is studied and the model segments
the customers with similar traits. Several strategies are adopted
to minimize churn rate and maximize profit through suitable
promotions and campaigns.
► 2. Association:
► Association is a rule-based machine learning to discover the
probability of the co-occurrence of items in a collection. For
example, finding out which products were purchased together.
2.Unsupervised Learning

► Let’s say that a customer goes to a supermarket and buys bread, milk, fruits,
and wheat. Another customer comes and buys bread, milk, rice, and butter.
Now, when another customer comes, it is highly likely that if he buys bread,
he will buy milk too. Hence, a relationship is established based on customer
behavior and recommendations are made.
2.Unsupervised Learning
► Real-Life Applications of Unsupervised Learning:
• Market Basket Analysis: It is a machine learning model based on the algorithm that
if you buy a certain group of items, you are less or more likely to buy another
group of items.
• Semantic Clustering: Semantically similar words share a similar context. People
post their queries on websites in their own ways. Semantic clustering groups all
these responses with the same meaning in a cluster to ensure that the customer
finds the information they want quickly and easily. It plays an important role in
information retrieval, good browsing experience, and comprehension.
• Delivery Store Optimization: Machine learning models are used to predict the
demand and keep up with supply. They are also used to open stores where the
demand is higher and optimizing roots for more efficient deliveries according to
past data and behavior.
• Identifying Accident Prone Areas: Unsupervised machine learning models can be
used to identify accident-prone areas and introduce safety measures based on the
intensity of those accidents.
Difference between Supervised and Unsupervised
Learning:
S. No Supervised Learning Un Supervised Learning

1 The data used in supervised learning This algorithm does not require any
is labeled. labeled data because its job is to look for
The system learns from the labeled patterns in the input data and organize it
data and makes future predictions

2 We get feedback--once you receive the That does not happen with unsupervised
output, the system remembers it and learning.
uses it for the next operation

3 Supervised learning is mostly used Unsupervised learning is used to find


to predict data hidden patterns or structures in data.
Reinforcement learning:
► Reinforcement learning is a sub-branch of Machine Learning that trains a model to return an
optimum solution for a problem by taking a sequence of decisions by itself.
► Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns
to behave in an environment by performing the actions and seeing the results of actions.
► For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled
data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its experience only.
• RL solves a specific typeof problem where decision making is sequential, and the
goal is long-term, such as game-playing, robotics, etc.
• The agent interacts with the environment and explores it by itself. The primary goal of an agent
in reinforcement learning is to improve the performance by getting the maximum positive
rewards.
• The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a type of
machine learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns the movement of his
Terms used in Reinforcement Learning:
•Agent(): An entity that can perceive/explore the environment and act upon it.
•Environment(): A situation in which an agent is present or surrounded by. In RL, we assume the stochastic
environment, which means it is random in nature.

•Action(): Actions are the moves taken by an agent within the environment.
•State(): State is a situation returned by the environment after each action taken by the agent.
•Reward(): A feedback returned to the agent from the environment to evaluate the action of the agent.
•Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
•Value(): It is expected long-term retuned with the discount factor and opposite to the short-term reward.
•Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action (a).
Reinforcement Learning:
► Key Features of Reinforcement Learning
• In RL, the agent is not instructed about the environment and what actions need to be taken.
• It is based on the hit and trial process.
• The agent takes the next action and changes states according to the feedback of the previous action.
• The agent may get a delayed reward.
• The environment is stochastic, and the agent needs to explore it to reach to get the maximum positive
rewards.
► Applications of Reinforcement Learning
Difference between Supervised, Unsupervised and Reinforcement
Learning:
S. No Supervised Unsupervised Reinforcement
1 Data provided is labeled Data provided is unlabeled The machine learns from
data with output values data, the output values not its environment using
specified specified, machine makes its rewards and errors
own predictions

2 Used to solve classification Used to solve Used to solve resolve based


and regression problems clustering and association problems
problems
3 Labeled data is used unlabeled data is used No predefined data is used
4 External supervision No Supervision No Supervision
5 Solves the problems by Solves problems by Follows trail and error
mapping labelled input to understanding patterns and problem solving approach
known output discovering output
Machine Learning algorithms
Deep Learning
► Deep Learning is about learning multiple levels of representation and
abstraction that help to make sense of data such as images, sound, and text.
it makes use of deep neural networks.
► Deep learning mimics the network of neurons in a brain.
► It is a subset of Machine Learning, these algorithms are constructed with
connected layers.
Feature Selection:
Filter, Wrapper, Embedded
methods
What is Feature Selection
• In statistics, machine learning, and information theory, dimensionality
reduction or dimension reduction is the process of reducing the number of random
variables under consideration by obtaining a set of principal variables. Approaches
can be divided into
• feature selection
• feature extraction.
• Feature Selection is the process where you automatically or
manually select those features which contribute most to your prediction variable or
output in which you are interested in.
• Feature projection (also called Feature extraction) transforms the data in
the high-dimensional space to a space of fewer dimensions
• The key difference between feature selection and extraction is that feature
selection keeps a subset of the original features while feature extraction creates
brand new ones.
Feature Selection in Machine Learning
Why feature selection ?
Top reasons to use feature selection are:

• It enables the machine learning algorithm to train faster.

• It reduces the complexity of a model and makes it easier to interpret.

• It improves the accuracy of a model if the right subset is chosen.

• It reduces overfitting.
Feature Selection Methods
Filter Methods

• Filter methods are generally used as a preprocessing step.


• The selection of features is independent of any machine learning
algorithms.
• Instead, features are selected on the basis of their scores in various
statistical tests for their correlation with the outcome variable.
Filter Method Examples
Pearson’s Correlation: It is used as a measure for quantifying linear dependence
between two continuous variables X and Y. Its value varies from -1 to +1. Pearson’s
correlation is given as:

where cov(X, Y) – covariance, sigma(X) - standard deviation of X, sigma(Y) - standard deviation of Y


LDA: Linear discriminant analysis is used to find a linear combination of features that
characterizes or separates two or more classes (or levels) of a categorical variable.
ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the
fact that it is operated using one or more categorical independent features and one
continuous dependent feature. It provides a statistical test of whether the means of
several groups are equal or not.
Chi-Square: It is a is a statistical test applied to the groups of categorical features to
evaluate the likelihood of correlation or association between them using their frequency
distribution.
Filter Method Examples
Entropy:-
Entropy is the measure of the average information content. The higher the entropy, the higher is the information
contribution by that feature. Entropy (H) can be formulated as:

where
X - discrete random variable X, P(X) - probability mass function, E - expected value operator,
I - information content of X, I(X) - a random variable.
Mutual information:-
In information theory, mutual information I(X;Y) is the amount of uncertainty in X due to the knowledge of Y.
Mathematically, mutual information is defined as

where p(x, y) - joint probability function of X and Y, p(x) - marginal probability distribution function of X
p(y) - marginal probability distribution function of Y
Wrapper Methods
• In wrapper methods, we try to use a subset of features and train a model using them.
Based on the inferences that we draw from the previous model, we decide to add or
remove features from your subset.
• The problem is essentially reduced to a search problem.
• These methods are usually computationally very expensive.
Wrapper Methods Examples

Forward Selection: Forward selection is an iterative method in which we start with


having no feature in the model. In each iteration, we keep adding the feature which
best improves our model till an addition of a new variable does not improve the
performance of the model.
Backward Elimination: In backward elimination, we start with all the features and
removes the least significant feature at each iteration which improves the
performance of the model. We repeat this until no improvement is observed on
removal of features.
Recursive Feature elimination: It is a greedy optimization algorithm which aims to
find the best performing feature subset. It repeatedly creates models and keeps aside
the best or the worst performing feature at each iteration. It constructs the next model
with the left features until all the features are exhausted. It then ranks the features
based on the order of their elimination.
Embedded Methods

• Embedded methods combine the qualities of filter and wrapper methods. It’s
implemented by algorithms that have their own built-in feature selection methods.
• A learning algorithm takes advantage of its own variable selection process and
performs feature selection and classification simultaneously
• Some of the most popular examples of these methods are LASSO and RIDGE
regression which have inbuilt penalization functions to reduce overfitting.
• Lasso regression performs L1 regularization which adds penalty equivalent to
absolute value of the magnitude of coefficients.
• Ridge regression performs L2 regularization which adds penalty equivalent to
square of the magnitude of coefficients.
Embedded Methods
Feature Selection Embedded in Learning Algorithms
Some learning algorithms perform feature selection as part of their overall
operation. These include:
• L1-regularization techniques, such as sparse regression, LASSO( Least Absolute Shrinkage and
Selection Operator) , L1-SVM
• L2 Regularisation- Ridge Regression
• Regularized trees e.g. regularized random Forest
• Decision tree
• Memetic algorithm
• Random multinomial logit (RMNL)
• Auto-encoding networks with a bottleneck-layer
• Submodular feature selection
Differences between Filter &
Wrapper Methods
The main differences between the filter and wrapper methods for feature selection
are:
• Filter methods measure the relevance of features by their correlation with
dependent variable while wrapper methods measure the usefulness of a subset of
feature by actually training a model on it.
• Filter methods are much faster compared to wrapper methods as they do not
involve training the models. On the other hand, wrapper methods are
computationally very expensive as well.
• Filter methods use statistical methods for evaluation of a subset of features while
wrapper methods use cross validation.
• Filter methods might fail to find the best subset of features in many occasions but
wrapper methods can always provide the best subset of features.
• Using the subset of features from the wrapper methods make the model more
prone to overfitting as compared to using subset of features from the filter
methods.
Input Variable Output Variable Feature Selection technique

Numerical Numerical •Pearson's correlation coefficient (For linear Correlation).

•Spearman's rank coefficient (for non-linear correlation).

Numerical Categorical •ANOVA correlation coefficient (linear).


•Kendall's rank coefficient (nonlinear).

Categorical Numerical •Kendall's rank coefficient (linear).


•ANOVA correlation coefficient (nonlinear).

Categorical Categorical •Chi-Squared test (contingency tables).


•Mutual Information.
Feature Normalization:-
Min-max normalization,
z-score
normalization,
Feature/ Data Normalization
Definition:
► The process of transforming the columns in a dataset to the same/standard scale is referred to as normalization. Or The production
of clean data is generally referred to as Data Normalization. Or Data Normalization is the process of organizing data such that it seems
consistent across all records and fields.
► Every dataset does not need to be normalized for machine learning. It is only required when the ranges of characteristics are
different It is required only when features of machine learning models have different ranges.
► Normalization is also known as feature scaling ,it is applied on numerical features . Many machine learning algorithms like Gradient
descent methods, KNN algorithm, linear and logistic regression, etc. require data scaling to produce good results. Various scalers
are defined for this purpose.
► Data normalization consists of remodeling numeric columns to a standard scale.
► Why do you Need Data Normalization?
► As data becomes more useful to all types of businesses, the manner it is arranged in mass amounts becomes increasingly important. It is clear that
when Data Normalization is done effectively,
• it results in a better overall business function,
• from assuring email delivery to preventing misdials and
• improving group analysis without the fear of duplicates.
► Let’s say we have a dataset containing two variables: time traveled and distance covered. Time is measured in hours (e.g. 5, 10, 25
hours ) and distance in miles (e.g. 500, 800, 1200 miles). Do you see the problem?
► One obvious problem of course is that these two variables are measured in two different units — one in hours and the other in miles.
The other problem — which is not obvious but if you take a closer look you'll find it — is the distribution of data, which is quite
different in these two variables (both within and between variables).
► The purpose of normalization is to transform data in a way that they are either dimensionless and/or have similar distributions. This
Data Normalization
► Types of Normalization techniques in Machine Learning
► The most widely used types of normalization in machine learning are:
1. Min-Max Scaling – Subtract the minimum value from each column’s
value and divide by the range. Each new column has a minimum value
of 0 and a maximum value of 1.
2. Standardization Scaling – The term “standardization” refers to the
process of centering a variable at zero and standardizing the variance at
one. Subtracting the mean of each observation and then dividing by the
standard deviation is the procedure
► Normalization and standardization
► Normalization and standardization are not the same things.
Standardization, interestingly, refers to setting the mean to zero and the
standard deviation to one. Normalization in machine learning is the process
of translating data into the range [0, 1] (or any other range) or simply
transforming data onto the unit sphere.
Mini-Max Normalization
Min-Max normalization performs on original data a linear transformation. Min-max normalization gives
the values between 0.0 and 1.0. The smallest value is normalized to 0.0 and the largest value is
normalized to 1.0.
Let (X1, X2) be a min and max attribute boundary and
(Y1, Y2) be the new scale at which we are standardizing, then the standardized value Ui is given for Vi
of the attribute as,

Example : Suppose that the minimum and maximum values for the price of the house be $125,000 and
$925,000 respectively. We need to normalize that price range in between (0,1). We can use min-max
normalization to transform any value between them (say, 300,000). In this case, we use the above
formula to find Ui with,
Vi=300,000
X1= 125,000
X2= 925,000
Y1= 0
Y2= 1
Therefore the normalized value Ui will be 0.21875.
Python Code:-
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)

[[0.5 0. 1. ]
[1. 0.5 0.33333333]
[0. 1. 0. ]]
Exercise :-
There are five numeric values: 14, 9, 24, 39, 60.
Apply Min-Max Normalization.
Solution:-
9: (9 - 9) / (60 - 9) = 0 / 51= 0.00
14: (14 - 9) / (60 - 9) = 5 / 51 = 0.098
24: (24 -9) / (60 - 9) =15 / 51 = 0.29
39: (39 - 9) / (60 - 9) = 30 / 51 = 0.58
60: (60 -9) / (60 - 9) = 51 / 51 = 1.00
Z-Score Normalization
Also called standardization, z-score normalization sees features rescaled in a way that follows standard
normal distribution property with μ=0 and σ=1, where μ is the mean (average) and σ is the standard
deviation from the mean.
It is the process of re-scaling original data without changing its original nature.The main aim of
normalization is to change the value of data in the dataset to a common scale, without distorting the
differences in the ranges of value. This technique is useful in classification algorithms involving neural
networks or distance-based algorithm (e.g. KNN, K-means).
The standard score or z-score of the samples are calculated using the following formula.

There are five numeric values: 14, 9, 24, 39, 60.


µ = (14+9+24+39+60) / 5 = 146/ 5 = 29.
σ = sqrt( [(14 - 29)^2 + (9 - 29)^2 + (24 - 29)^2 + (39 - 29)^2 + (60 - 29)^2] / 5 )
= sqrt( [(-15)^2 + (-20.0)^2 + (-5.0)^2+ (10.0)^2 + (31)^2] / 5 )
= sqrt( [225.0 + 400.0 + 25.0+100+961 ] / 5 )
= sqrt( 1711.0 / 5 )
= sqrt(342.0)
= 18.49
Therefore, the z-score normalized values are:
14: (14 - 29.0) / 18.49 = -0.811
9: (9 - 29.0) / 18.49 = +1.08
24: (24 - 29.0) / 18.49 = -0.27
39: (39 - 29.0) / 18.49 = +0.54
60: (60 - 29.0) / 18.49 = +1.72
A z-score normalized value that is positive corresponds to an x value that is greater than the mean value, while a
z-score that is negative corresponds to an x value that is less than the mean.
Python Code:

from sklearn.preprocessing import StandardScaler


X=[[101,105,222,333,225,334,556],[105,105,258,354,221,334,556]]
print("Before standardisation X values are ", X)
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
print("After standardisation X values are ", X)
Before standardization X values are
[[101, 105, 222, 333, 225, 334, 556],
[105, 105, 258, 354, 221, 334, 556]]
After standardization X values are
[-1. 0. -1. -1. 1. 0. 0.]
[ 1. 0. 1. 1. -1. 0. 0.]]
Exercise:-
Using a calculator, we can find that the mean of the dataset is 21.2 and the standard deviation is 29.8.
To perform a z-score normalization on the first value in the dataset, we can use the following formula:
•New value = (x – μ) / σ
•New value = (3 – 21.2) / 29.8
•New value = -0.61
We can use this formula to perform a z-score normalization on every value in the dataset:
The mean of the normalized values is 0 and the standard deviation of the normalized values is 1.

The benefit of performing this type of normalization is that the clear outlier in the dataset (134) has been
transformed in such a way that it’s no longer a massive outlier.
Constant Factor Normalization
The simplest normalization technique is constant factor normalization.
Expressed as a math equation constant factor normalization is x' = x / k, where x
is a raw value, x' is the normalized value, and k is a numeric constant. If k = 100,
the constant factor normalized values are:

28: 28 / 100 = 0.28


46: 46 / 100 = 0.46
34: 34 / 100 = 0.34
Introduction to
Dimensionality
Reduction
Dimensionality Reduction
► The number of input variables or features for a dataset is referred to as its dimensionality.
► Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.
► Problem With Many Input Variables
► The performance of machine learning algorithms can degrade with too many input variables.
► If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the
columns that are fed as input to a model to predict the target variable. Input variables are also called features.
► We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of
data as points in that space. This is a useful geometric interpretation of a dataset.
► Having a large number of dimensions in the feature space can mean that the volume of that space is very large,
and in turn, the points that we have in that space (rows of data) often represent a small and non-representative
sample.
► This can dramatically impact the performance of machine learning algorithms fit on data with many input
features, generally referred to as the “ curse of dimensionality.”
► Therefore, it is often desirable to reduce the number of input features.
► This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.”
Dimensionality Reduction
► An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem where we need to classify whether the e-mail is spam or not. This can involve
a large number of features, such as whether or not the e-mail has a generic title, the content of the
e-mail, whether the e-mail uses a template, etc. However, some of these features may overlap.
► In another condition, a classification problem that relies on both humidity and rainfall can be
collapsed into just one underlying feature, since both of the aforementioned are correlated to a high
degree. Hence, we can reduce the number of features in such problems.
► A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a simple
2 dimensional space, and a 1-D problem to a simple line. The below figure illustrates this concept,
where a 3-D feature space is split into two 2-D feature spaces, and later, if found to be correlated, the
number of features can be reduced even further.
Dimensionality Reduction
Dimensionality Reduction
► Why Dimensionality Reduction:
• Prevent from curse of Dimensionality
• Improve the performance of the model
• To visualize the data or to understand the data

► Advantages of Dimensionality Reduction


• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.

► Disadvantages of Dimensionality Reduction


• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables, which is sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to define datasets.
• We may not know how many principal components to keep- in practice, some thumb rules
are applied.
Dimensionality Reduction

► Methods of Dimensionality Reduction:


The various methods used for dimensionality reduction include:
1. Principal Component Analysis (PCA)
2. Linear Discriminant Analysis (LDA)
3. Generalized Discriminant Analysis (GDA)
Principal Component Analysis
(PCA)
Principal Component Analysis (PCA)
► In large dimensional datasets, there might be lots of inconsistencies in the features or lots of
redundant features in the dataset, which will only increase the computation time and make
data processing and EDA more convoluted.
► To get rid of the curse of dimensionality, (When working with high-dimensional data, there
are a number of issues known as the “Curse of Dimensionality” ) a process called
dimensionality reduction was introduced.
► Dimensionality reduction techniques can be used to filter only a limited number of significant
features needed for training and this is where PCA comes in.
► Principal components analysis (PCA) is a dimensionality reduction technique that enables you
to identify correlations and patterns in a data set so that it can be transformed into a data
set of significantly lower dimension without loss of any important information.
► The main idea behind PCA is to figure out patterns and correlations among various features in
the data set. On finding a strong correlation between different variables, a final decision is
made about reducing the dimensions of the data in such a way that the significant data is still
retained.
► Such a process is very essential in solving complex data-driven problems that involve the use
of high-dimensional data sets. PCA can be achieved via a series of steps. Let’s discuss the
whole end-to-end process.
Principal Component Analysis (PCA)

Step By Step Computation Of PCA


► The below steps need to be followed to perform dimensionality reduction using
PCA:
1. Standardization of the data
2. Computing the covariance matrix
3. Calculating the eigenvectors and eigenvalues
4. Computing the Principal Components
5. Reducing the dimensions of the data set
Principal Component Analysis (PCA)
Step1: Standardization of the data:
► Standardization is all about scaling your data in such a way that all the variables and their values lie within a
similar range.
► Consider an example, let’s say that we have 2 variables in our data set, one has values ranging between 10-100 and
the other has values between 1000-5000. In such a scenario, it is obvious that the output calculated by using these
predictor variables is going to be biased since the variable with a larger range will have a more obvious impact on
the outcome.
► Therefore, standardizing the data into a comparable range is very important. Standardization is carried out by
subtracting each value in the data from the mean and dividing it by the overall deviation in the data set.
► It can be calculated like so:
Principal Component Analysis (PCA)
Step 2: Computing the covariance matrix
► As mentioned earlier, PCA helps to identify the correlation and dependencies among the features in a data set. A covariance
matrix expresses the correlation between the different variables in the data set. It is essential to identify heavily dependent
variables because they contain biased and redundant information which reduces the overall performance of the model.
► Mathematically, a covariance matrix is a p × p matrix, where p represents the dimensions of the data set. Each entry in the
matrix represents the covariance of the corresponding variables.
► Consider a case where we have a 2-Dimensional data set with variables a and b, the covariance matrix is a 2×2 matrix as
shown below:

► In the above matrix:


• Cov(a, a) represents the covariance of a variable with itself, which is nothing but the variance of the variable ‘a’
• Cov(a, b) represents the covariance of the variable ‘a’ with respect to the variable ‘b’. And since covariance is commutative,
Cov(a, b) = Cov(b, a)
► Here are the key takeaways from the covariance matrix:
• The covariance value denotes how co-dependent two variables are with respect to each other
• If the covariance value is negative, it denotes the respective variables are indirectly proportional to each other
• A positive covariance denotes that the respective variables are directly proportional to each other
Principal Component Analysis (PCA)
Step 3: Calculating the Eigenvectors and Eigenvalues
Eigenvectors and eigenvalues are the mathematical constructs that must be computed from the covariance
matrix in order to determine the principal components of the data set.
But first, let’s understand more about principal components
► What are Principal Components?
► Simply put, principal components are the new set of variables that are obtained from the initial set of
variables. The principal components are computed in such a manner that newly obtained variables are
highly significant and independent of each other. The principal components compress and possess most
of the useful information that was scattered among the initial variables.
► If your data set is of 5 dimensions, then 5 principal components are computed, such that, the first
principal component stores the maximum possible information and the second one stores the remaining
maximum info and so on.
► Now, where do Eigenvectors fall into this whole process?
Principal Component Analysis (PCA)
Step 3: Calculating the Eigenvectors and Eigenvalues
► Assuming that you all have a basic understanding of Eigenvectors and eigenvalues, we know
that these two algebraic formulations are always computed as a pair, i.e, for every eigenvector
there is an eigenvalue. The dimensions in the data determine the number of eigenvectors that
you need to calculate.
► Consider a 2-Dimensional data set, for which 2 eigenvectors (and their respective eigenvalues)
are computed. The idea behind eigenvectors is to use the Covariance matrix to understand
where in the data there is the most amount of variance. Since more variance in the data denotes
more information about the data, eigenvectors are used to identify and compute Principal
Components.
► Eigenvalues, on the other hand, simply denote the scalars of the respective
eigenvectors. Therefore, eigenvectors and eigenvalues will compute the Principal
Components of the data set.
Principal Component Analysis (PCA)
Step 4: Computing the Principal Components
► Once we have computed the Eigenvectors and eigenvalues, all we have to do is order them in the
descending order, where the eigenvector with the highest eigenvalue is the most significant and thus
forms the first principal component. The principal components of lesser significances can thus be
removed in order to reduce the dimensions of the data.
► The final step in computing the Principal Components is to form a matrix known as the feature matrix
that contains all the significant data variables that possess maximum information about the data.
Step 5: Reducing the dimensions of the data set
► The last step in performing PCA is to re-arrange the original data with the final principal
components which represent the maximum and the most significant information of the data set.
In order to replace the original data axis with the newly formed Principal Components, you
simply multiply the transpose of the original data set by the transpose of the obtained feature
vector.
► So that was the theory behind the entire PCA process.
► URL: https://www.youtube.com/watch?v=osgqQy9Hr8s
Principal Component Analysis (PCA)
► Finding the covariance of a matrix:
Principal Component Analysis (PCA)
► Finding Eigen values and eigen vectors:
Principal Component Analysis (PCA)

► Advantages of Principal Component Analysis


• Correlated features are removed
• Enhances the performance of the algorithm.
• Enhanced Visualization
► Disadvantages of Principal Component Analysis
• The major components are difficult to comprehend
• Data normalization is required.
• Loss of information
Linear Discriminant
Analysis (LDA)
Linear Discriminant Analysis (LDA)

► Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis


is a dimensionality reduction technique that is commonly used for supervised classification
problems.
► It is used for modelling differences in groups i.e. separating two or more classes.
► It is used to project the features in higher dimension space into a lower dimension space.
► LDA is similar to PCA (principal component analysis) in the sense that LDA reduces the dimensions.
However, the main purpose of LDA is to find the line (or plane) that best separates data points
belonging to different classes.
► The key idea behind LDA is that the decision boundary should be chosen such that it maximizes
the distance between the means of the two classes while simultaneously minimizing the variance
within each classes data or within-class scatter.
► For mathematical calculation : Here I am providing link
► https://www.youtube.com/watch?v=mtTVXZq-9gE
Linear Discriminant Analysis (LDA)

► This criterion is known as the Fisher criterion and can be expressed as the following formula
for two classes:

► The following are some of the benefits of using LDA:


• LDA is used for classification problems.
• LDA is a powerful tool for dimensionality reduction.
• LDA is not susceptible to the “curse of dimensionality” like many other machine
learning algorithms.
Linear Discriminant Analysis (LDA)
► How LDA works and the steps involved in the process?
• LDA is a supervised machine learning algorithmthat can be used for both classification
and dimensionality reduction.
LDA algorithm works based on the following steps:
• The first step is to calculate the means and standard deviation of each feature.
• Within class scatter matrix and between class scatter matrix is calculated
• These matrices are then used to calculate the eigenvectors and eigenvalues.
• LDA chooses the k eigenvectors with the largest eigenvalues to form a transformation matrix.
• LDA uses this transformation matrix to transform the data into a new space with k dimensions.
• Once the transformation matrix transforms the data into new space with k dimensions, LDA can then be
used for classification or dimensionality reduction
► Examples of how LDA can be used in practice
The following are some examples of how LDA can be used in practice:
• LDA can be used for classification, such as classifying emails as spam or not spam.
• LDA can be used for dimensionality reduction, such as reducing the number of features in a dataset.
• LDA can be used to find the most important features in a dataset.

You might also like