ML NOTES(UNIT 1&2)
ML NOTES(UNIT 1&2)
Introduction
Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. He defined machine
learning as “the field of study that gives computers the ability to learn without being
explicitly programmed.” However, there is no universally accepted definition for machine
learning. Different authors define the term differently. We give below two more definitions.
In the above definitions we have used the term “model” and we will be using this term at several
contexts later. It appears that there is no universally accepted one sentence definition of this term.
Loosely, it may be understood as some mathematical expression or equation, or some
mathematical structures such as graphs and trees, or a division of sets into disjoint
subsets, or a set of logical “if . . . then . . . else . . .” rules, or some such thing. It may be noted that
this is not an exhaustive list.
Definition of learning
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with
experience E.
Examples:
A computer program which learns from experience is called a machine learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.
The learning process, whether by a human or a machine, can be divided into four
components, namely, data storage, abstraction, generalization and evaluation. Figure 1.1
illustrates the various components and the steps involved in the learning process.
Data storage: Facilities for storing and retrieving huge amounts of data are an important
component of the learning process. Humans and computers alike utilize data storage as a
foundation for advanced reasoning.
• In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals.
• Computers use hard disk drives, flash memory, random access memory and similar devices
to store data and use cables and other technology to retrieve data.
Abstraction
The second component of the learning process is known as abstraction. Abstraction is the
process of extracting knowledge about stored data. This involves creating general concepts
about the data as a whole. The creation of knowledge involves application of known models and
creation of new models.
The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original
information.
Generalization
The term generalization describes the process of turning the knowledge about stored data into a
form that can be utilized for future action. These actions are to be carried out on tasks that are
similar, but not identical, to those what have been seen before. In generalization, the goal is to
discover those properties of the data that will be most relevant to future tasks.
Evaluation
Evaluation is the last component of the learning process. It is the process of giving feedback to
the user to measure the utility of the learned knowledge. This feedback is then utilised to effect
improvements in the whole learning process.
Application of machine learning methods to large databases is called data mining. In data
mining, a large volume of data is processed to construct a simple model with valuable use, for
example, having high predictive accuracy.
2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast
enough by computers. The World Wide Web is huge; it is constantly growing and
searching for relevant information cannot be done manually.
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
10. Machine learning methods have been used to develop programmes for playing games
such as chess, backgammon and Go.
Supervised learning:
Supervised learning is the machine learning task of learning a function that maps an input to an
output based on example input-output pairs.
In supervised learning, each example in the training set is a pair consisting of an input object
(typically a vector) and an output value. A supervised learning algorithm analyzes the
training data and produces a function, which can be used for mapping new examples. In the
optimal case, the function will correctly determine the class labels for unseen instances. Both
classification and regression problems are supervised learning problems.
A wide range of supervised learning algorithms are available, each with its strengths and
weaknesses. There is no single learning algorithm that works best on all supervised learning
problems.
A “supervised learning” is so called because the process of algorithm learning from the
training dataset can be thought of as a teacher supervising the learning process. We know the
correct answers (that is, the correct outputs), the algorithm iteratively makes predictions on the
training data and is corrected by the teacher. Learning stops when the algorithm achieves an
acceptable level of performance.
Example :
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients and each patient is labelled as “healthy” or “sick”.
Unsupervised learning
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses.
The most common unsupervised learning method is cluster analysis, which is used for
exploratory data analysis to find hidden patterns or grouping in data.
Example :
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
Based on this data, can we infer anything regarding the patients entering the clinic?
Reinforcement learning
A learner (the program) is not told what actions to take as in most forms of machine
learning, but instead must discover which actions yield the most reward by trying them. In the
most interesting and challenging cases, actions may affect not only the immediate reward but also
the next situations and, through that, all subsequent rewards.
For example, consider teaching a dog a new trick: we cannot tell it what to do, but we can
reward/punish it if it does the right/wrong thing. It has to find out what it did that made it get the
reward/punishment. We can use a similar method to train computers to do many tasks, such
as playing backgammon or chess, scheduling jobs, and controlling robot limbs.
Reinforcement learning is different from supervised learning. Supervised learning is learning
from examples provided by a knowledgeable expert.
“Feature selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features.”
While developing the machine learning model, only a few variables in the dataset are useful for
building the model, and the rest features are either redundant or irrelevant. If we input the dataset
with all these redundant and irrelevant features, it may negatively impact and reduce the overall
performance and accuracy of the model. Hence it is very important to identify and select the most
appropriate features from the data and remove the irrelevant or less important features, which is
done with the help of feature selection in machine learning.
Feature selection is one of the important concepts of machine learning, which highly impacts the
performance of the model. As machine learning works on the concept of "Garbage In
Garbage Out", so we always need to input the most appropriate and relevant dataset to the
model in order to get a better result.
In this topic, we will discuss different feature selection techniques for machine learning. But
before that, let's first understand some basics of feature selection.
A feature is an attribute that has an impact on a problem or is useful for the problem, and
choosing the important features for the model is known as feature selection. Each machine
learning process depends on feature engineering, which mainly contains two processes;
which are Feature Selection and Feature Extraction.
Although feature selection and extraction processes may have the same objective, both are
completely different from each other. The main difference between them is that feature
selection is about selecting the subset of the original feature set, whereas feature extraction
creates new features.
Feature selection is a way of reducing the input variable for the model by using only relevant data
in order to reduce over fitting in the model.
So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
Before implementing any technique, it is really important to understand, need for the
technique and so for the Feature Selection. As we know, in machine learning, it is necessary to
provide a pre-processed and good input dataset in order to get better outcomes. We collect a huge
amount of data to train our model and help it to learn better. Generally, the dataset consists of
noisy data, irrelevant data, and some part of useful data. Moreover, the huge amount of
data also slows down the training process of the model, and with noise and irrelevant
data, the model may not predict and perform well. So, it is very necessary to remove
such noises and less-important data from the dataset and to do this, and Feature selection
techniques are used.
Selecting the best features helps the model to perform well. For example, Suppose we want to
create a model that automatically decides which car should be crushed for a spare part, and to do
this, we have a dataset. This dataset contains a Model of the car, Year, Owner's name,
Miles. So, in this dataset, the name of the owner does not contribute to the model
performance as it does not decide if the car should be crushed or not, so we can remove this
column and select the rest of the features(column) for the model building.
It helps in the simplification of the model so that it can be easily interpreted by the
researchers.
There are mainly two types of Feature Selection techniques, which are:
Supervised Feature selection techniques consider the target variable and can be used for the
labelled dataset.
Unsupervised Feature selection techniques ignore the target variable and can be used for the
unlabelled dataset.
1.2.1 Filter Methods:
In Filter Method, features are selected on the basis of statistics measures. This method does not
depend on the learning algorithm and chooses the features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns from the model by
using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and does not over
fit the data.
Ratio
Information Gain: Information gain determines the reduction in entropy while transforming the
dataset. It can be used as a feature selection technique by calculating the information gain of each
variable with respect to the target variable.
Chi-square Test: Chi-square test is a technique to determine the relationship between the
categorical variables. The chi-square value is calculated between each feature and the target
variable, and the desired number of features with the best chi-square value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised techniques of features selection. It returns the rank
of the variable on the fisher's criteria in descending order. Then we can select the
variables with a large fisher's score.
The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number of missing
values in each column divided by the total number of observations. The variable is having
more than the threshold value can be dropped.
On the basis of the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.
Forward selection - Forward selection is an iterative process, which begins with an empty
set of features. After each iteration, it keeps adding on a feature and evaluates
the performance to check whether it is improving the performance or not. The process
continues until the addition of a new variable/feature does not improve the performance of
the model.
Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature
selection methods, which evaluates each feature set as brute-force. It means this method tries
& make each possible combination of features and return the best performing feature set.
Recursive feature elimination is a recursive greedy optimization approach, where features are
selected by recursively taking a smaller and smaller subset of features. Now, an estimator is
trained with each set of features, and the importance of each feature is determined
using coef_attribute or through a feature_importances_attribute.
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are fast
processing methods similar to the filter method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally finds the most
important features that contribute the most to training in a particular iteration. Some
techniques of embedded methods are:
Random Forest Importance - Different tree-based methods of feature selection help us with
feature importance to provide a way of selecting features. Here, feature importance specifies
which feature has more importance in model building or has a great impact on the target
variable. Random Forest is such a tree-based method, which is a type of bagging algorithm that
aggregates a different number of decision trees. It automatically ranks the nodes by their
performance or decrease in the impurity (Gini impurity) over all the trees. Nodes are arranged as
per the impurity values, and thus it allows to pruning of trees below a specific node. The
remaining nodes create a subset of the most important features.
Although there are so many feature normalization techniques in Machine Learning, few of
them are most frequently used. These are as follows:
Min-max normalization (usually called feature scaling) performs a linear transformation on the
original data. This technique gets all the scaled data in the range (0, 1). The formula to achieve
this is the following:
For the three example values, min = 28 and max = 46. Therefore, the min-max normalized
values are:
The min-max technique results in values between 0.0 and 1.0 where the smallest value is
normalized to 0.0 and the largest value is normalized to 1.0.
1.3.2 Z-score normalization refers to the process of normalizing every value in a dataset
such that the mean of all of the values is 0 and the standard deviation is 1.
We use the following formula to perform a z-score normalization on every value in a dataset:
New value = (x – μ) / σ
where:
x: Original value
μ: Mean of data
For the three example values, mean(μ) = (28 + 46 + 34) / 3 = 108 / 3 = 36.0. The standard
deviation of a set of values is the square root of the sum of the squared difference of each
value and the mean, divided by the number of values, and so is:
= sqrt( 168.0 / 3 )
= sqrt(56.0)
= 7.48
A z-score normalized value that is positive corresponds to an x value that is greater than the
mean value, and a z-score that is negative corresponds to an x value that is less than
the mean.
• Feature extraction
In feature extraction, we are interested in finding a new set of k features that are the
combination of the original n features. These methods may be supervised or unsupervised
depending on whether or not they use the output information. The best known and most
widely used feature extraction methods are Principal Components Analysis (PCA) and Linear
Discriminant Analysis (LDA), which are both linear projection methods, unsupervised and
supervised respectively.
Step 4. Calculate the eigen values and eigenvectors of the covariance matrix
Let S be the covariance matrix and let I be the identity matrix having the same dimension
as the dimension of S.
i) Set up the equation:
Advantages of Dimensionality Reduction
It helps in data compression, and hence reduced storage space.
It reduces computation time.
It also helps remove redundant features, if any.
Improved Visualization: High dimensional data is difficult to visualize,
and dimensionality reduction techniques can help in visualizing the data in 2D or 3D,
which can help in better understanding and analysis.
Overfitting Prevention: High dimensional data may lead to overfitting in
machine learning models, which can lead to poor generalization performance.
Dimensionality reduction can help in reducing the complexity of the data, and hence
prevent overfitting.
Feature Extraction: Dimensionality reduction can help in extracting important
features from high dimensional data, which can be useful in feature selection for
machine learning models.
Data Preprocessing: Dimensionality reduction can be used as a preprocessing
step before applying machine learning algorithms to reduce the dimensionality of
the data and hence improve the performance of the model.
Improved Performance: Dimensionality reduction can help in improving
the performance of machine learning models by reducing the complexity of the
data, and hence reducing the noise and irrelevant information in the data.
Consider a situation where you have plotted the relationship between two variables where
each color represents a different class. One is shown with a red color and the other with
blue.
If you are willing to reduce the number of dimensions to 1, you can just project everything
to the x-axis as shown below:
This approach neglects any helpful information provided by the second feature. However,
you can use LDA to plot it. The advantage of LDA is that it uses information from both the
features to create a new axis which in turn minimizes the variance and maximizes
the class distance of the two variables.
Drawbacks of Linear Discriminant Analysis
(LDA)
also fails in some cases where the Mean of the distributions is shared. In this case, LDA
fails to create a new axis that makes both the classes linearly
separable. Real-world Applications of LDA
Some of the common real-world applications of Linear discriminant Analysis are
given below:
o FaceRecognition
Face recognition is the popular application of computer vision, where each
face is represented as the combination of a number of pixel values. In this case,
LDA is used to minimize the number of features to a manageable number before
going through the classification process. It generates a new template in which each
dimension consists of a linear combination of pixel values. If a linear
combination is generated using Fisher's linear discriminant, then it is called
Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease
on the basis of various parameters of patient health and the medical treatment
which is going on. On such parameters, it classifies disease as mild, moderate, or
severe. This classification helps the doctors in either increasing or decreasing
the pace of the treatment.
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the
group of customers who are likely to purchase a specific product in a shopping
mall. This can be helpful when we want to identify a group of customers who
mostly purchase a product in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making.
For example, "will you buy this product” will give a predicted result of either one or
two possible classes as a buying or not.
o InLearning
Nowadays, robots are being trained for learning and talking to simulate human
work, and it can also be considered a classification problem. In this case, LDA
builds similar groups on the basis of different parameters, including pitches,
frequencies, sound, tunes, etc.
UNIT – II
Linear regression:
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
Mean Squared Error represents the average of the squared difference between
the original and predicted values in the data set. It measures the variance of the
residuals.
Root Mean Squared Error is the square root of Mean Squared error. It measures
the standard deviation of residuals.
For comparing the accuracy among different linear regression models, RMSE
is a better choice than R Squared.
2.6.1 ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each
step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree.
In simple words, the top-down approach means that we start building the tree from the top
and the greedy approach means that at each iteration we select the best feature at the
present moment to create a node.
Most generally ID3 is only used for classification problems with nominal features
only.
ID3 Steps
1. Calculate the Information Gain of each feature.
2. Considering that all rows don’t belong to the same class, split the dataset S into
subsets using the feature for which the Information Gain is maximum.
3. Make a decision tree node using the feature with the maximum Information
gain.
4. If all rows belong to the same class, make the current node as a leaf node with the
class as its label.
5. Repeat for the remaining features until we run out of all features, or the decision tree
has all leaf nodes.
2.6.2 CART Algorithm
The CART algorithm works via the following process:
The best split point of each input is obtained.
Based on the best split points of each input in Step 1, the new “best” split point
is identified.
Split the chosen input according to the “best” split point.
Continue splitting until a stopping rule is satisfied or no further desirable splitting is
available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini
index criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that
is wrongly being classified when chosen randomly and a variation of the Gini
coefficient. It works on categorical variables, provides outcomes either “successful” or
“failure” and hence conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
Where 0 depicts that all the elements are allied to a certain class, or only one class
exists there.
The Gini index of value 1 signifies that all the elements are randomly
distributed across various classes, and
A value of 0.5 denotes the elements are uniformly distributed into some classes.
Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm
is then used to identify the “Class” within which the target variable is most likely
to fall. Classification trees are used when the dataset needs to be split into classes that
belong to the response variable(like yes or no)
Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is
used to predict its value. Regression trees are used when the response variable is
continuous. For example, if the response variable is the temperature of the day.
CART models are formed by picking input variables and evaluating split points on
those variables until an appropriate tree is produced.
Steps to create a Decision Tree using the CART algorithm:
Greedy algorithm: In this The input space is divided using the Greedy method
which is known as a recursive binary spitting. This is a numerical method within
which all of the values are aligned and several other split points are tried and
assessed using a cost function.
Stopping Criterion: As it works its way down the tree with the training data,
the recursive binary splitting method described above must know when to stop splitting.
The most frequent halting method is to utilize a minimum amount of training data
allocated to every leaf node. If the count is smaller than the specified threshold, the
split is rejected and also the node is considered the last leaf node.
Tree pruning: Decision tree’s complexity is defined as the number of splits
in the tree. Trees with fewer branches are recommended as they are simple to
grasp and less prone to cluster the data. Working through each leaf node in the tree and
evaluating the effect of deleting it using a hold-out test set is the quickest and
simplest pruning approach.
Data preparation for the CART: No special data preparation is required for
the CART algorithm.
2.7 Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on
Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning
models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature
is independent of the occurrence of other features. Such as if the fruit is identified
on the bases of color, shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually contributes to identify
that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends
on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of
a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of
Evidence. Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the
below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
2.9 Logistic Regression
o Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of independent
variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the
classification.
The below image is showing the logistic function:
Dependent Variable:
The dependent Variable can have two or more possible outcomes/classes.
The dependent variables are nominal in nature means there is no any kind of
ordering in target dependent classes i.e. these classes cannot be meaningfully ordered.
The dependent variable to be predicted belongs to a limited set of items defined.
Basic Steps
The basic steps of the SVM are:
1. select two hyperplanes (in 2D) which separates the data with no points
between them (red lines)
2. maximize their distance (the margin)
3. the average line (here the line half way between the two red lines) will be the
decision boundary
This is very nice and easy, but finding the best margin, the optimization problem is not
trivial (it is easy in 2D, when we have only two attributes, but what if we have N dimensions
with N a very big number).
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z.
It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Kernel
Methods
Kernels or kernel methods (also called Kernel functions) are sets of different types
of algorithms that are being used for pattern analysis. They are used to solve a non-
linear problem by using a linear classifier. Kernels Methods are employed in SVM (Support
Vector Machines) which are used in classification and regression problems. The SVM uses
what is called a “Kernel Trick” where the data is transformed and an optimal boundary is
found for the possible outputs.
The Need for Kernel Method and its
Working
Before we get into the working of the Kernel Methods, it is more important to
understand support vector machines or the SVMs because kernels are implemented in SVM
models. So, Support Vector Machines are supervised machine learning algorithms
that are used in classification and regression problems such as classifying an apple
to class fruit while classifying a Lion to the class animal.
we have 2 dimension which represents the ambient space but the lone which
divides or classifies the space is one dimension less than the ambient space and is called
hyperplane.
But what if we have input like
this:
It is very difficult to solve this classification using a linear classifier as there is no good
linear line that should be able to classify the red and the green dots as the points
are randomly distributed. Here comes the use of kernel function which takes the
points to higher dimensions, solves the problem over there and returns the output. Think
of this in this way, we can see that the green dots are enclosed in some perimeter
area while the red one lies outside it, likewise, there could be other scenarios where
green dots might be distributed in a trapezoid-shaped area.
So what we do is to convert the two-dimensional plane which was first classified by
one- dimensional hyperplane (“or a straight line”) to the three-dimensional area and
here our classifier i.e. hyperplane will not be a straight line but a two-dimensional plane
which will cut the area.
In order to get a mathematical understanding of kernel, let us understand the Lili
Jiang’s equation of kernel which is: K(x, y)=<f(x), f(y)> where, K is the kernel function,
X and Y are the dimensional inputs, f is the map from n-dimensional
to m-dimensional space and,< x, y > is the dot product.