0% found this document useful (0 votes)
11 views

ML NOTES(UNIT 1&2)

The document provides an introduction to machine learning, covering key concepts such as supervised, unsupervised, and reinforcement learning, along with feature selection techniques like filter, wrapper, and embedded methods. It emphasizes the importance of feature selection in improving model performance by removing irrelevant or redundant data. Additionally, it outlines various applications of machine learning across different fields, highlighting its significance in data analysis and decision-making.

Uploaded by

freefiretopup606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

ML NOTES(UNIT 1&2)

The document provides an introduction to machine learning, covering key concepts such as supervised, unsupervised, and reinforcement learning, along with feature selection techniques like filter, wrapper, and embedded methods. It emphasizes the importance of feature selection in improving model performance by removing irrelevant or redundant data. Additionally, it outlines various applications of machine learning across different fields, highlighting its significance in data analysis and decision-making.

Uploaded by

freefiretopup606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT-I

Introduction: Introduction to Machine learning, Supervised learning, Unsupervised


learning, Reinforcement learning. Deep learning. Feature Selection: Filter, Wrapper
Embedded methods. Feature Normalization:- min-max normalization, z-score
normalization, and constant factor normalization Introduction to Dimensionality
Reduction : Principal Component Analysis(PCA), Linear Discriminant Analysis(LDA)

Introduction

1.1 Definition of Machine Learning

Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. He defined machine
learning as “the field of study that gives computers the ability to learn without being
explicitly programmed.” However, there is no universally accepted definition for machine
learning. Different authors define the term differently. We give below two more definitions.

Machine learning is programming computers to optimize a performance criterion


using example data or past experience. We have a model defined up to some
parameters, and learning is the execution of a computer program to optimize the
parameters of the model using the training data or past experience.
The field of study known as machine learning is concerned with the question of how to
construct computer programs that automatically improve with experience.

In the above definitions we have used the term “model” and we will be using this term at several
contexts later. It appears that there is no universally accepted one sentence definition of this term.
Loosely, it may be understood as some mathematical expression or equation, or some
mathematical structures such as graphs and trees, or a division of sets into disjoint
subsets, or a set of logical “if . . . then . . . else . . .” rules, or some such thing. It may be noted that
this is not an exhaustive list.

Definition of learning

A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with
experience E.
Examples:

i) Handwriting recognition learning problem

• Task T: Recognising and classifying handwritten words within images

• Performance P: Percent of words correctly classified

• Training experience E: A dataset of handwritten words with given classifications

ii) A robot driving learning problem

• Task T: Driving on highways using vision sensors

• Performance measure P: Average distance traveled before an error

• training experience: A sequence of images and steering commands recorded while


observing a human driver

iii) A chess learning problem

• Task T: Playing chess

• Performance measure P: Percent of games won against opponents

• Training experience E: Playing practice games against itself

A computer program which learns from experience is called a machine learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.

1.1.2 How machines learn

Basic components of learning process:

The learning process, whether by a human or a machine, can be divided into four
components, namely, data storage, abstraction, generalization and evaluation. Figure 1.1
illustrates the various components and the steps involved in the learning process.
Data storage: Facilities for storing and retrieving huge amounts of data are an important
component of the learning process. Humans and computers alike utilize data storage as a
foundation for advanced reasoning.

• In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals.

• Computers use hard disk drives, flash memory, random access memory and similar devices
to store data and use cables and other technology to retrieve data.

Abstraction

The second component of the learning process is known as abstraction. Abstraction is the
process of extracting knowledge about stored data. This involves creating general concepts
about the data as a whole. The creation of knowledge involves application of known models and
creation of new models.

The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original
information.

Generalization

The third component of the learning process is known as generalisation.

The term generalization describes the process of turning the knowledge about stored data into a
form that can be utilized for future action. These actions are to be carried out on tasks that are
similar, but not identical, to those what have been seen before. In generalization, the goal is to
discover those properties of the data that will be most relevant to future tasks.

Evaluation

Evaluation is the last component of the learning process. It is the process of giving feedback to
the user to measure the utility of the learned knowledge. This feedback is then utilised to effect
improvements in the whole learning process.

1.1.3 Applications of machine learning

Application of machine learning methods to large databases is called data mining. In data
mining, a large volume of data is processed to construct a simple model with valuable use, for
example, having high predictive accuracy.

The following is a list of some of the typical applications of machine learning.


1. In retail business, machine learning is used to study consumer behaviour.

2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.

3. In manufacturing, learning models are used for optimization, control, and troubleshooting.

4. In medicine, learning programs are used for medical diagnosis.

5. In telecommunications, call patterns are analyzed for network optimization and


maximizing the quality of service.

6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast
enough by computers. The World Wide Web is huge; it is constantly growing and
searching for relevant information cannot be done manually.

7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.

8. It is used to find solutions to many problems in vision, speech recognition, and robotics.

9. Machine learning methods are applied in the design of computer-controlled vehicles to


steer correctly when driving on a variety of roads.

10. Machine learning methods have been used to develop programmes for playing games
such as chess, backgammon and Go.

1.2 Different types of learning

In general, machine learning algorithms can be classified into three types.

Supervised learning:

Supervised learning is the machine learning task of learning a function that maps an input to an
output based on example input-output pairs.

In supervised learning, each example in the training set is a pair consisting of an input object
(typically a vector) and an output value. A supervised learning algorithm analyzes the
training data and produces a function, which can be used for mapping new examples. In the
optimal case, the function will correctly determine the class labels for unseen instances. Both
classification and regression problems are supervised learning problems.
A wide range of supervised learning algorithms are available, each with its strengths and
weaknesses. There is no single learning algorithm that works best on all supervised learning
problems.

A “supervised learning” is so called because the process of algorithm learning from the
training dataset can be thought of as a teacher supervising the learning process. We know the
correct answers (that is, the correct outputs), the algorithm iteratively makes predictions on the
training data and is corrected by the teacher. Learning stops when the algorithm achieves an
acceptable level of performance.

Example :

Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients and each patient is labelled as “healthy” or “sick”.

Unsupervised learning

Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses.

In unsupervised learning algorithms, a classification or categorization is not included in the


observations. There are no output values and so there is no estimation of functions. Since the
examples given to the learner are unlabeled, the accuracy of the structure that is output by the
algorithm cannot be evaluated.

The most common unsupervised learning method is cluster analysis, which is used for
exploratory data analysis to find hidden patterns or grouping in data.

Example :

Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
Based on this data, can we infer anything regarding the patients entering the clinic?

Reinforcement learning

Reinforcement learning is the problem of getting an agent to act in the world so as to


maximize its rewards.

A learner (the program) is not told what actions to take as in most forms of machine
learning, but instead must discover which actions yield the most reward by trying them. In the
most interesting and challenging cases, actions may affect not only the immediate reward but also
the next situations and, through that, all subsequent rewards.

For example, consider teaching a dog a new trick: we cannot tell it what to do, but we can
reward/punish it if it does the right/wrong thing. It has to find out what it did that made it get the
reward/punishment. We can use a similar method to train computers to do many tasks, such
as playing backgammon or chess, scheduling jobs, and controlling robot limbs.
Reinforcement learning is different from supervised learning. Supervised learning is learning
from examples provided by a knowledgeable expert.

1.3 Feature Selection

“Feature selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features.”

While developing the machine learning model, only a few variables in the dataset are useful for
building the model, and the rest features are either redundant or irrelevant. If we input the dataset
with all these redundant and irrelevant features, it may negatively impact and reduce the overall
performance and accuracy of the model. Hence it is very important to identify and select the most
appropriate features from the data and remove the irrelevant or less important features, which is
done with the help of feature selection in machine learning.

Feature selection is one of the important concepts of machine learning, which highly impacts the
performance of the model. As machine learning works on the concept of "Garbage In
Garbage Out", so we always need to input the most appropriate and relevant dataset to the
model in order to get a better result.

In this topic, we will discuss different feature selection techniques for machine learning. But
before that, let's first understand some basics of feature selection.

What is Feature Selection?

A feature is an attribute that has an impact on a problem or is useful for the problem, and
choosing the important features for the model is known as feature selection. Each machine
learning process depends on feature engineering, which mainly contains two processes;
which are Feature Selection and Feature Extraction.

Although feature selection and extraction processes may have the same objective, both are
completely different from each other. The main difference between them is that feature
selection is about selecting the subset of the original feature set, whereas feature extraction
creates new features.

Feature selection is a way of reducing the input variable for the model by using only relevant data
in order to reduce over fitting in the model.

So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model

building." Feature selection is performed by either including the important features or


excluding the irrelevant features in the dataset without changing them.

Need for Feature Selection:

Before implementing any technique, it is really important to understand, need for the
technique and so for the Feature Selection. As we know, in machine learning, it is necessary to
provide a pre-processed and good input dataset in order to get better outcomes. We collect a huge
amount of data to train our model and help it to learn better. Generally, the dataset consists of
noisy data, irrelevant data, and some part of useful data. Moreover, the huge amount of
data also slows down the training process of the model, and with noise and irrelevant
data, the model may not predict and perform well. So, it is very necessary to remove
such noises and less-important data from the dataset and to do this, and Feature selection
techniques are used.
Selecting the best features helps the model to perform well. For example, Suppose we want to
create a model that automatically decides which car should be crushed for a spare part, and to do
this, we have a dataset. This dataset contains a Model of the car, Year, Owner's name,
Miles. So, in this dataset, the name of the owner does not contribute to the model
performance as it does not decide if the car should be crushed or not, so we can remove this
column and select the rest of the features(column) for the model building.

Below are some benefits of using feature selection in machine learning: It

helps in avoiding the curse of dimensionality.

It helps in the simplification of the model so that it can be easily interpreted by the
researchers.

It reduces the training time.

It reduces over fitting hence enhance the generalization.

1.2 Feature Selection Techniques:

There are mainly two types of Feature Selection techniques, which are:

Supervised Feature selection techniques consider the target variable and can be used for the
labelled dataset.

Unsupervised Feature selection techniques ignore the target variable and can be used for the
unlabelled dataset.
1.2.1 Filter Methods:

In Filter Method, features are selected on the basis of statistics measures. This method does not
depend on the learning algorithm and chooses the features as a pre-processing step.

The filter method filters out the irrelevant feature and redundant columns from the model by
using different metrics through ranking.

The advantage of using filter methods is that it needs low computational time and does not over
fit the data.

Some common techniques of Filter methods are as follows::

information Gain Chi-

square Test Fisher's

Score Missing Value

Ratio

Information Gain: Information gain determines the reduction in entropy while transforming the
dataset. It can be used as a feature selection technique by calculating the information gain of each
variable with respect to the target variable.

Chi-square Test: Chi-square test is a technique to determine the relationship between the
categorical variables. The chi-square value is calculated between each feature and the target
variable, and the desired number of features with the best chi-square value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised techniques of features selection. It returns the rank
of the variable on the fisher's criteria in descending order. Then we can select the
variables with a large fisher's score.

Missing Value Ratio:

The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number of missing

values in each column divided by the total number of observations. The variable is having
more than the threshold value can be dropped.

1.2.2 Wrapper Methods:


In wrapper methodology, selection of features is done by considering it as a search problem,
in which different combinations are made, evaluated, and compared with other
combinations. It trains the algorithm by using the subset of features iteratively.

On the basis of the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.

Some techniques of wrapper methods


are:

Forward selection - Forward selection is an iterative process, which begins with an empty
set of features. After each iteration, it keeps adding on a feature and evaluates
the performance to check whether it is improving the performance or not. The process
continues until the addition of a new variable/feature does not improve the performance of
the model.

Backward elimination - Backward elimination is also an iterative approach, but it is


the opposite of forward selection. This technique begins the process by considering
all the features and removes the least significant feature. This elimination process
continues until
removing the features does not improve the performance of the
model.

Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature
selection methods, which evaluates each feature set as brute-force. It means this method tries
& make each possible combination of features and return the best performing feature set.

Recursive feature elimination

Recursive feature elimination is a recursive greedy optimization approach, where features are
selected by recursively taking a smaller and smaller subset of features. Now, an estimator is
trained with each set of features, and the importance of each feature is determined
using coef_attribute or through a feature_importances_attribute.

1.2.3 Embedded Methods

Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are fast
processing methods similar to the filter method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally finds the most
important features that contribute the most to training in a particular iteration. Some
techniques of embedded methods are:

Regularization- Regularization adds a penalty term to different parameters of the machine


learning model for avoiding overfitting in the model. This penalty term is added to the
coefficients; hence it shrinks some coefficients to zero. Those features with zero coefficients can
be removed from the dataset. The types of regularization techniques are L1
Regularization (Lasso Regularization) or Elastic Nets (L1 and L2 regularization).

Random Forest Importance - Different tree-based methods of feature selection help us with
feature importance to provide a way of selecting features. Here, feature importance specifies
which feature has more importance in model building or has a great impact on the target
variable. Random Forest is such a tree-based method, which is a type of bagging algorithm that
aggregates a different number of decision trees. It automatically ranks the nodes by their
performance or decrease in the impurity (Gini impurity) over all the trees. Nodes are arranged as
per the impurity values, and thus it allows to pruning of trees below a specific node. The
remaining nodes create a subset of the most important features.

1.3 Feature normalization:

Normalization is a scaling technique in Machine Learning applied during data preparation to


change the values of numeric columns in the dataset to use a common scale. It is not
necessary for all datasets in a model. It is required only when features of machine learning
models have different ranges.

Although there are so many feature normalization techniques in Machine Learning, few of
them are most frequently used. These are as follows:

1.3.1 Min-max normalization

Min-max normalization (usually called feature scaling) performs a linear transformation on the
original data. This technique gets all the scaled data in the range (0, 1). The formula to achieve
this is the following:
For the three example values, min = 28 and max = 46. Therefore, the min-max normalized
values are:

28: (28 - 28) / (46 - 28) = 0 / 18 = 0.00

46: (46 - 28) / (46 - 28) = 18 / 18 = 1.00

34: (34 - 28) / (46 - 28) = 6 / 18 = 0.33

The min-max technique results in values between 0.0 and 1.0 where the smallest value is
normalized to 0.0 and the largest value is normalized to 1.0.

1.3.2 Z-score normalization refers to the process of normalizing every value in a dataset
such that the mean of all of the values is 0 and the standard deviation is 1.

We use the following formula to perform a z-score normalization on every value in a dataset:

New value = (x – μ) / σ

where:

x: Original value

μ: Mean of data

σ: Standard deviation of data

For the three example values, mean(μ) = (28 + 46 + 34) / 3 = 108 / 3 = 36.0. The standard
deviation of a set of values is the square root of the sum of the squared difference of each
value and the mean, divided by the number of values, and so is:

σ = sqrt( [(28 - 36.0)^2 + (46 - 36.0)^2 + (34 - 36.0)^2] / 3 )

= sqrt( [(-8.0)^2 + (10.0)^2 + (-2.0)^2] / 3 )

= sqrt( [64.0 + 100.0 + 4.0] / 3 )

= sqrt( 168.0 / 3 )
= sqrt(56.0)

= 7.48

Therefore, the z-score normalized values are:

28: (28 - 36.0) / 7.48 = -1.07

46: (46 - 36.0) / 7.48 = +1.34

34: (34 - 36.0) / 7.48 = -0.27

A z-score normalized value that is positive corresponds to an x value that is greater than the
mean value, and a z-score that is negative corresponds to an x value that is less than
the mean.

1.3.3 Constant Factor Normalization:

The simplest normalization technique is constant factor normalization. Expressed as a math


equation constant factor normalization is x' = x / k, where x is a raw value, x' is the
normalized value, and k is a numeric constant. If k = 100, the constant factor normalized
values are:

28: 28 / 100 = 0.28

46: 46 / 100 = 0.46

34: 34 / 100 = 0.34

1.4 Dimensionality Reduction

Dimensionality reduction or dimension reduction is the process of reducing the number of


variables under consideration by obtaining a smaller set of principal variables.
Dimensionality reduction may be implemented in two ways.
• Feature selection
In feature selection, we are interested in finding k of the total of n features that give us the
most information and we discard the other (n−k) dimensions. We are going to discuss subset

selection as a feature selection method.

• Feature extraction
In feature extraction, we are interested in finding a new set of k features that are the
combination of the original n features. These methods may be supervised or unsupervised
depending on whether or not they use the output information. The best known and most
widely used feature extraction methods are Principal Components Analysis (PCA) and Linear
Discriminant Analysis (LDA), which are both linear projection methods, unsupervised and
supervised respectively.

Why dimensionality reduction is useful?


There are several reasons why we are interested in reducing dimensionality.
 In most learning algorithms, the complexity depends on the number of input
dimensions, d, as well as on the size of the data sample, N, and for reduced memory and
computation, we are interested in reducing the dimensionality of the problem.
Decreasing d also decreases the complexity of the inference algorithm during testing.
 When an input is decided to be unnecessary, we save the cost of extracting it.
 Simpler models are more robust on small datasets. Simpler models have less variance, that
is, they vary less depending on the particulars of a sample, including noise,
outliers, and so forth.
 When data can be explained with fewer features, we get a better idea about the
process that underlies the data, which allows knowledge extraction.
 When data can be represented in a few dimensions without loss of information, it can be
plotted and analyzed visually for structure and outliers.

1.4.1 Principal component analysis:

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal


transformation to convert a set of observations of possibly correlated variables into a set of values
of linearly uncorrelated variables called principal components. The number of principal
components is less than or equal to the smaller of the number of original variables or the number
of observations. This transformation is defined in such a way that the first principal
component has the largest possible variance (that is, accounts for as much of the variability
in the data as possible), and each succeeding component in turn has the highest
variance possible under the constraint that it is orthogonal to the preceding components.

Computation of the principal component vectors


(PCA algorithm)
The following is an outline of the procedure for performing a principal component analysis on a
given data. The procedure is heavily dependent on mathematical concepts. A knowledge of these
concepts is essential to carry out this procedure.
Step 1. Data
We consider a dataset having n features or variables denoted by X1;X2; : : : ;Xn. Let there be
N examples. Let the values of the i-th feature Xi be Xi1;Xi2; : : : ;XiN (see Table 4.1).

Step 2. Compute the means of the variables


We compute the mean _Xi of the variable Xi:

Step 3. Calculate the covariance matrix


Consider the variables Xi and Xj (i and j need not be different). The covariance of the
Ordered pair (Xi;Xj) is defined as

Step 4. Calculate the eigen values and eigenvectors of the covariance matrix
Let S be the covariance matrix and let I be the identity matrix having the same dimension
as the dimension of S.
i) Set up the equation:
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
 Improved Visualization: High dimensional data is difficult to visualize,
and dimensionality reduction techniques can help in visualizing the data in 2D or 3D,
which can help in better understanding and analysis.
 Overfitting Prevention: High dimensional data may lead to overfitting in
machine learning models, which can lead to poor generalization performance.
Dimensionality reduction can help in reducing the complexity of the data, and hence
prevent overfitting.
 Feature Extraction: Dimensionality reduction can help in extracting important
features from high dimensional data, which can be useful in feature selection for
machine learning models.
 Data Preprocessing: Dimensionality reduction can be used as a preprocessing
step before applying machine learning algorithms to reduce the dimensionality of
the data and hence improve the performance of the model.
 Improved Performance: Dimensionality reduction can help in improving
the performance of machine learning models by reducing the complexity of the
data, and hence reducing the noise and irrelevant information in the data.

Disadvantages of Dimensionality Reduction


 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is
sometimes undesirable.
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice,
some thumb rules are applied.
 Interpretability: The reduced dimensions may not be easily interpretable, and it may
be difficult to understand the relationship between the original features and the reduced
dimensions.
 Overfitting: In some cases, dimensionality reduction may lead to
overfitting, especially when the number of components is chosen based on the training
data.
 Sensitivity to outliers: Some dimensionality reduction techniques are sensitive
to outliers, which can result in a biased representation of the data.
 Computational complexity: Some dimensionality reduction techniques, such as
manifold learning, can be computationally intensive, especially when dealing with large
datasets.

1.4.2 Linear Discriminant Analysis


(LDA):
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction
techniques in machine learning to solve more than two-class classification problems. It is
also known as Normal Discriminant Analysis (NDA) or Discriminant Function Analysis
(DFA).

Linear Discriminant analysis is one of the most popular dimensionality reduction


techniques used for supervised classification problems in machine learning. It is also
considered a pre- processing step for modeling differences in ML and applications of
pattern classification

Whenever there is a requirement to separate two or more classes having multiple


features efficiently, the Linear Discriminant Analysis model is considered the
most common technique to solve such classification problems. For e.g., if we have two
classes with multiple features and need to separate them efficiently. When we classify them
using a single feature, then it may show overlapping.

Consider a situation where you have plotted the relationship between two variables where
each color represents a different class. One is shown with a red color and the other with
blue.
If you are willing to reduce the number of dimensions to 1, you can just project everything
to the x-axis as shown below:

This approach neglects any helpful information provided by the second feature. However,
you can use LDA to plot it. The advantage of LDA is that it uses information from both the
features to create a new axis which in turn minimizes the variance and maximizes
the class distance of the two variables.
Drawbacks of Linear Discriminant Analysis
(LDA)

Although, LDA is specifically used to solve supervised classification problems for


two or more classes which are not possible using logistic regression in machine learning.
But LDA

also fails in some cases where the Mean of the distributions is shared. In this case, LDA
fails to create a new axis that makes both the classes linearly
separable. Real-world Applications of LDA
Some of the common real-world applications of Linear discriminant Analysis are
given below:

o FaceRecognition
Face recognition is the popular application of computer vision, where each
face is represented as the combination of a number of pixel values. In this case,
LDA is used to minimize the number of features to a manageable number before
going through the classification process. It generates a new template in which each
dimension consists of a linear combination of pixel values. If a linear
combination is generated using Fisher's linear discriminant, then it is called
Fisher's face.

o Medical
In the medical field, LDA has a great application in classifying the patient disease
on the basis of various parameters of patient health and the medical treatment
which is going on. On such parameters, it classifies disease as mild, moderate, or
severe. This classification helps the doctors in either increasing or decreasing
the pace of the treatment.
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the
group of customers who are likely to purchase a specific product in a shopping
mall. This can be helpful when we want to identify a group of customers who
mostly purchase a product in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making.
For example, "will you buy this product” will give a predicted result of either one or
two possible classes as a buying or not.

o InLearning
Nowadays, robots are being trained for learning and talking to simulate human
work, and it can also be considered a classification problem. In this case, LDA
builds similar groups on the basis of different parameters, including pitches,
frequencies, sound, tunes, etc.
UNIT – II

Supervised Learning – I (Regression/Classification) Regression models: Simple


Linear Regression, multiple linear Regression. Cost Function, Gradient Descent,
Performance Metrics: Mean Absolute Error(MAE),Mean Squared Error(MSE) R-
Squared error, Adjusted R Square. Classification models: Decision Trees-ID3,CART,
Naive Bayes, K-Nearest- Neighbours (KNN), Logistic Regression, Multinomial
Logistic Regression Support Vector Machines (SVM) - Nonlinearity and Kernel
Methods

Linear regression:
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)

a1 = Linear regression coefficient (scale factor to each input value).


ε = random error
The values for x and y variables are training datasets for Linear Regression
model representation.
Regression Models
Linear regression can be further divided into two types of the
algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression.
2.1 Linear regression
Linear regression in simple term is answering a question on “How can I use X to predict
Y?”
where X is some information that you have, and Y is some information that you
want.
Let’s say you wanted a sell a house and you wanted to know how much you can sell it for.
You have information about the house that is your X and the selling price that you wanted
to know will be your Y.
Linear regression creates an equation in which you input your given numbers (X)
and it outputs the target variable that you want to find out (Y).
Linear Regression model
representation
Linear regression is such a useful and established algorithm, that it is both a statistical
model and a machine learning model. Linear regression tries a draw a best fit line that is
close to the data by finding the slope and intercept.
Linear regression equation is,
Y=a+bx
In this equation:
 y is the output variable. It is also called the target variable in machine learning or
the dependent variable.
 x is the input variable. It is also referred to as the feature in machine learning or it is
called the independent variable.
 a is the constant
b is the coefficient of independent variable

2.2 Multiple linear regression


Multiple Linear Regression assumes there is a linear relationship between two or
more independent variables and one dependent variable.
The Formula for multiple linear
regression:
Y=B0+B0X1+B2X2+……+BnXn+e
 Y = the predicted value of the dependent variable
 B0 = the y-intercept (value of y when all other parameters are set to 0)
 B1X1= the regression coefficient (B1) of the first independent variable (X1)
 BnXn = the regression coefficient of the last independent variable
 e = model error
2.3 Cost-function
The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number.
2.4 Gradient Descent
It is known as one of the most commonly used optimization algorithms to train
machine learning models by means of minimizing errors between actual and expected
results. Further, gradient descent is also used to train Neural Networks.

2.4.1 Types of Gradient Descent


Based on the error in various training models, the Gradient Descent learning algorithm can
be divided into Batch gradient descent, stochastic gradient descent, and mini-
batch gradient descent. Let's understand these different types of gradient descent:
1. Batch Gradient Descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as
the training epoch. In simple words, it is a greedy approach where we have to sum
over all examples for each update.
Advantages of Batch gradient
descent:
o It produces less noise in comparison to other gradient descent.
o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent


Stochastic gradient descent (SGD) is a type of gradient descent that runs one
training example per iteration.
3. MiniBatch Gradient Descent:
Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent.

 Mean Squared Error represents the average of the squared difference between
the original and predicted values in the data set. It measures the variance of the
residuals.
 Root Mean Squared Error is the square root of Mean Squared error. It measures
the standard deviation of residuals.

 The coefficient of determination or R-squared represents the proportion of the


variance in the dependent variable which is explained by the linear regression model. It
is a scale- free score i.e. irrespective of the values being small or large, the value of R
square will be
less than one.

 Adjusted R squared is a modified version of R square, and it is adjusted


for the number of independent variables in the model, and it will always be less than
or equal to R².In the formula below n is the number of observations in the data and k is
the number of the independent variables in the data.

2.5 Evaluation Metrics


 Mean Squared Error(MSE) and Root Mean Square Error penalizes the large
prediction errors vi-a-vis Mean Absolute Error (MAE). However, RMSE is widely
used than MSE to evaluate the performance of the regression model with other random
models as it has the same units as the dependent variable (Y-axis).
 MSE is a differentiable function that makes it easy to perform mathematical
operations in comparison to a non-differentiable function like MAE. Therefore,
in many models, RMSE is used as a default metric for calculating Loss Function
despite being harder to interpret than MAE.
 The lower value of MAE, MSE, and RMSE implies higher accuracy of a
regression model. However, a higher value of R square is considered desirable.
 R Squared & Adjusted R Squared are used for explaining how well the
independent variables in the linear regression model explains the variability in the
dependent variable. R Squared value always increases with the addition of the
independent variables which might lead to the addition of the redundant variables in
our model. However, the adjusted R-squared solves this problem.
 Adjusted R squared takes into account the number of predictor variables, and it is
used to determine the number of independent variables in our model. The value of
Adjusted R squared decreases if the increase in the R square by the additional
variable isn’t significant enough.

 For comparing the accuracy among different linear regression models, RMSE
is a better choice than R Squared.

2.6 Decision Trees


In simple words, a decision tree is a structure that contains nodes (rectangular
boxes) and edges(arrows) and is built from a dataset (table of columns representing
features/attributes and rows corresponds to records). Each node is either used to make a
decision (known as decision node) or represent an outcome (known as leaf node).

2.6.1 ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each
step.

Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree.
In simple words, the top-down approach means that we start building the tree from the top
and the greedy approach means that at each iteration we select the best feature at the
present moment to create a node.
Most generally ID3 is only used for classification problems with nominal features
only.
ID3 Steps
1. Calculate the Information Gain of each feature.
2. Considering that all rows don’t belong to the same class, split the dataset S into
subsets using the feature for which the Information Gain is maximum.
3. Make a decision tree node using the feature with the maximum Information
gain.
4. If all rows belong to the same class, make the current node as a leaf node with the
class as its label.
5. Repeat for the remaining features until we run out of all features, or the decision tree
has all leaf nodes.
2.6.2 CART Algorithm
The CART algorithm works via the following process:
 The best split point of each input is obtained.
 Based on the best split points of each input in Step 1, the new “best” split point
is identified.
 Split the chosen input according to the “best” split point.
 Continue splitting until a stopping rule is satisfied or no further desirable splitting is
available.

CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini
index criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that
is wrongly being classified when chosen randomly and a variation of the Gini
coefficient. It works on categorical variables, provides outcomes either “successful” or
“failure” and hence conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,

 Where 0 depicts that all the elements are allied to a certain class, or only one class
exists there.
 The Gini index of value 1 signifies that all the elements are randomly
distributed across various classes, and
 A value of 0.5 denotes the elements are uniformly distributed into some classes.
Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm
is then used to identify the “Class” within which the target variable is most likely
to fall. Classification trees are used when the dataset needs to be split into classes that
belong to the response variable(like yes or no)
Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is
used to predict its value. Regression trees are used when the response variable is
continuous. For example, if the response variable is the temperature of the day.

CART models are formed by picking input variables and evaluating split points on
those variables until an appropriate tree is produced.
Steps to create a Decision Tree using the CART algorithm:
 Greedy algorithm: In this The input space is divided using the Greedy method
which is known as a recursive binary spitting. This is a numerical method within
which all of the values are aligned and several other split points are tried and
assessed using a cost function.
 Stopping Criterion: As it works its way down the tree with the training data,
the recursive binary splitting method described above must know when to stop splitting.
The most frequent halting method is to utilize a minimum amount of training data
allocated to every leaf node. If the count is smaller than the specified threshold, the
split is rejected and also the node is considered the last leaf node.
 Tree pruning: Decision tree’s complexity is defined as the number of splits
in the tree. Trees with fewer branches are recommended as they are simple to
grasp and less prone to cluster the data. Working through each leaf node in the tree and
evaluating the effect of deleting it using a hold-out test set is the quickest and
simplest pruning approach.
 Data preparation for the CART: No special data preparation is required for
the CART algorithm.
2.7 Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on
Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning
models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature
is independent of the occurrence of other features. Such as if the fruit is identified
on the bases of color, shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually contributes to identify
that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends
on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of
a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of
Evidence. Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

K-Nearest Neighbor(KNN) Algorithm


o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification,
it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
Why do we need a K-NN
Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type
of problem, we need a K-NN algorithm. With the help of K-NN, we can easily
identify the
category or class of a particular dataset. Consider the below
diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider
the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points.
The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the
below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.

o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
2.9 Logistic Regression
o Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of independent
variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the
classification.
The below image is showing the logistic function:

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends to 1,
and a value below the threshold values tends to 0.
Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-
collinearity. Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's


divide the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of


the equation it will become:

The above equation is the final equation for Logistic


Regression.
Type of Logistic
Regression:
On the basis of the categories, Logistic Regression can be classified into three
types:
o Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".
2.10 Multinomial Logistic Regression
Multinomial Logistic Regression is a classification technique that extends the logistic
regression algorithm to solve multiclass possible outcome problems, given one or
more independent variables.
Example for Multinomial Logistic
Regression:
(a) Which Flavor of ice cream will a person
choose?
Dependent Variable:
 Vanilla
 Chocolate
 Butterscotch
 Black Current
Independent Variables:
 Gender
 Age
 Occasion
 Happiness
 Etc.
Multinomial Logistic Regression is also known as multiclass logistic regression,
softmax regression, polytomous logistic regression, multinomial logit, maximum
entropy (MaxEnt) classifier and conditional maximum entropy model.

Dependent Variable:
The dependent Variable can have two or more possible outcomes/classes.
The dependent variables are nominal in nature means there is no any kind of
ordering in target dependent classes i.e. these classes cannot be meaningfully ordered.
The dependent variable to be predicted belongs to a limited set of items defined.
Basic Steps
The basic steps of the SVM are:
1. select two hyperplanes (in 2D) which separates the data with no points
between them (red lines)
2. maximize their distance (the margin)
3. the average line (here the line half way between the two red lines) will be the
decision boundary
This is very nice and easy, but finding the best margin, the optimization problem is not
trivial (it is easy in 2D, when we have only two attributes, but what if we have N dimensions
with N a very big number).
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z.
It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear


data.

Kernel
Methods
Kernels or kernel methods (also called Kernel functions) are sets of different types
of algorithms that are being used for pattern analysis. They are used to solve a non-
linear problem by using a linear classifier. Kernels Methods are employed in SVM (Support
Vector Machines) which are used in classification and regression problems. The SVM uses
what is called a “Kernel Trick” where the data is transformed and an optimal boundary is
found for the possible outputs.
The Need for Kernel Method and its
Working
Before we get into the working of the Kernel Methods, it is more important to
understand support vector machines or the SVMs because kernels are implemented in SVM
models. So, Support Vector Machines are supervised machine learning algorithms
that are used in classification and regression problems such as classifying an apple
to class fruit while classifying a Lion to the class animal.
we have 2 dimension which represents the ambient space but the lone which
divides or classifies the space is one dimension less than the ambient space and is called
hyperplane.
But what if we have input like
this:
It is very difficult to solve this classification using a linear classifier as there is no good
linear line that should be able to classify the red and the green dots as the points
are randomly distributed. Here comes the use of kernel function which takes the
points to higher dimensions, solves the problem over there and returns the output. Think
of this in this way, we can see that the green dots are enclosed in some perimeter
area while the red one lies outside it, likewise, there could be other scenarios where
green dots might be distributed in a trapezoid-shaped area.
So what we do is to convert the two-dimensional plane which was first classified by
one- dimensional hyperplane (“or a straight line”) to the three-dimensional area and
here our classifier i.e. hyperplane will not be a straight line but a two-dimensional plane
which will cut the area.
In order to get a mathematical understanding of kernel, let us understand the Lili
Jiang’s equation of kernel which is: K(x, y)=<f(x), f(y)> where, K is the kernel function,
X and Y are the dimensional inputs, f is the map from n-dimensional
to m-dimensional space and,< x, y > is the dot product.

You might also like