UNIT II Machine Learning
UNIT II Machine Learning
• Unsupervised learning.
Classification and Regression in Machine Learning
• For example, a classification algorithm will learn to identify animals after being trained on a
dataset of images that are properly labeled with the species of the animal and some
identifying characteristics.
• Supervised learning problems can be further grouped into Regression and Classification
problems.
• Both problems have as goal the construction of a brief model that can predict the value of
the
dependent attribute from the attribute variables.
• The difference between the two tasks is the fact that the dependent attribute is numerical
for
regression and categorical for classification.
Classification and Regression in Machine Learning
The main difference
between Regression and
Classification algorithms
that Regression
algorithms are used
to predict the
continuous values such
as price, salary, age,
etc. and Classification
algorithms are used
to predict/Classify the
discrete values such
Classification in Machine Learning
• A classification problem is when the output variable is a category, such as “apple” or
“mango” or
“yes” and “no”. A classification model attempts to draw some conclusion from observed values.
• Given one or more inputs a classification model will try to predict the value of one or more
outcomes.
• For example, when filtering emails “spam” or “not spam”, when looking at transaction
data,
“fraudulent”, or “authorized”.
• In short Classification either predicts categorical class labels or classifies data (construct a
model) based on the training set and the values (class labels) in classifying attributes and uses it in
classifying new data.
• In classification, data is categorized under different labels according to some parameters given
in input and then the labels are predicted for the data.
• The derived mapping function could be demonstrated in the form of “IF-THEN” rules.
• The classification process deal with the problems where the data can be divided into
binary or multiple discrete labels.
• Let’s take an example, suppose we want to predict the possibility of the wining of match by
Team
A on the basis of some parameters recorded earlier. Then there would be two labels Yes and No.
Classification in Machine Learning
• It can also identify the distribution movement depending on the historical data.
• Many different models can be used, the simplest is the linear regression.
• It tries to fit data with the best hyper-plane which goes through the points.
• Multiple Linear
Regression
• Polynomial Regression
Basic Mapping Function is used for mapping of values Mapping Function is used for mapping of values
to predefined classes. to continuous output.
Involves Discrete values Continuous values or real values
prediction
of
Nature of the Unordered Ordered
predicted
data
Method of by measuring accuracy by measurement of root mean square error
calculation
Algorithms Decision tree, logistic regression, etc. Regression tree (Random forest), Linear
regression, etc.
Output Try to find the decision boundary, which can Try to find the best fit line, which can predict the
divide the dataset into different classes output more accurately.
Example Classification Algorithms can be used to Regression algorithms can be used to solve
solve classification problems such as the regression problems such as Weather
Identification of spam emails, Speech Prediction, House price prediction, etc.
Recognition, Identification of cancer cells, etc.
Types The Classification algorithms can be divided The regression Algorithm can be further
into Binary Classifier and Multi-class Classifier. divided into Linear and Non-linear Regression.
Machine Learning Algorithms
• Decision Tree
• Naïve Bayes
• Linear Regression
• Logistic Regression
• Support Vector
Machines
Decision Tree Learning
• A decision tree is a decision support tool that uses a tree-like graph or model of decisions and
their possible consequences, including chance event outcomes, resource costs, and utility.
• It is one way to display an algorithm that only contains conditional control statements.
• each internal node (decision node) represents a “test” on an attribute (e.g. whether a coin
flip comes up heads or tails),
• each branch represents the outcome of the test,
• each leaf node represents a class label (decision taken after computing all attributes).
• Tree based methods empower predictive models with high accuracy, stability
and
ease of interpretation.
• The root node splits further into the next decision node (distance from the office) and one leaf
node based on the corresponding labels.
• The next decision node further gets split into one decision node (Cab facility) and one
leaf node.
• Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Example
1
• Consider the below
diagram:
Example 2
• Decision trees classify instances by sorting them down the tree from
the
root to some leaf node, which provides the classification of the instance.
• This process is then repeated for the subtree rooted at the new node.
Example 2
Example 2
• This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
• For the next node, the algorithm again compares the attribute value with the other
• It continues the process until it reaches the leaf node of the tree.
Decision Tree algorithm
• Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select the
best
attribute for the root node and for sub-nodes.
• By this measurement, user can easily select the best attribute for the nodes of the tree.
Ye
s
Ye Ye Ye Ye
s s s s
Example
• Entropy
=Entropy(5/14, 9/14)
=Entropy(0.36, 0.64)
=-(0.36 log2 0.36)-(0.64 log2 0.64)
=0.53+0.41
= 0.94
Exampl
eb) Entropy using the frequency table of two
attributes:
Entropy(Two attribute)=(Weighted Avg) *Entropy(each attribute)
• Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then
it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the
entropy before the split. The result is the Information Gain, or decrease in entropy.
Outlook
the same process on every branch. Yes
Yes
Yes
40
Exampl
e• Step 4a: A branch with entropy of 0 is a leaf
node.
Entropy(Overcast) = E(4,0) =
0.0
Example
• Step 4b: A branch with entropy more than 0 needs further
splitting
• Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data
is
classified
Decision Tree to Decision Rules
• A decision tree can easily be transformed to a set of rules by mapping from the root node to the
leaf
nodes one by one.
Types of Decision Trees
Types of decision tree is based on the type of target variable that user have. It can
be
of two types:
• Categorical Variable Decision Tree: Decision Tree which has categorical target
variable then it called as categorical variable decision tree. E.g.:- In above
scenario of student problem, where the target variable was “Student will play Golf or
not” i.e. YES or NO.
• Continuous Variable Decision Tree: Decision Tree has continuous target variable
then it
is called as Continuous Variable Decision Tree.
Advantages of Decision
Tree
• Easy to Understand
• Decision trees require relatively little effort from users for data
preparation.
• Less data cleaning required
• Data type is not a constraint
• Non-Parametric Method
• Non-linear relationships between parameters do not affect tree
performance.
Disadvantages of Decision Tree
• Over fitting
• Not fit for continuous variables
• Calculations can become complex when there are many class label.
• Generally, it gives low prediction accuracy for a dataset as compared
to
other machine learning algorithms.
• Information gain in a decision tree with categorical variables gives a
biased response for attributes with greater no. of categories.
Applications of Decision Tree
• Direct Marketing
• Customer Retention
• Fraud Detection
• Diagnosis of Medical
Problems
Machine Learning Algorithms
• Decision Tree
• Naïve Bayes
• Linear Regression
• Logistic Regression
• Support Vector
Machines
Naïve Bayes
• Naïve Bayes algorithm is a supervised learning algorithm, which is based on
Bayes
theorem and used for solving classification problems.
• With the help of Bayes theorem, we can express this in quantitative form as follows:
where, ‘c’ is class variable and ‘x’ is a dependent feature vector (of size
n)
Example: Naïve Targe
Predictors
Bayes t
Total no. of samples for class Outlook Tem Humidity Wind Play
Rainy pHot High False No
Golf
1: Rainy Hot High True No
Play_golf =“Yes”= 9 Overcast Hot High False Yes
Total no. of samples for class Sunny Mild High False Yes
2: Sunny Cool Normal False Yes
Sunny Cool Normal True No
Play_golf =“No”= 5
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Sunny Mild High True No
Example: Naïve
Bayes
For data sample X = (Outlook= rainy, Temp= cool, Humidity= high, Windy=
true)
= 0.0081
= 0.0567
Example: Naïve Bayes
• Total no. of sample for class “Yes”= 9/14 = 0.64
= 0.0081 X 0.64
=0.0051
= 0.0567 X 0.36
=0.020
2.Multinomial: The Multinomial Naïve Bayes classifier is used when the data
is multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc. The classifier uses the frequency of words for the
predictors.
Types of Naïve Bayes
3. Bernoulli: The Bernoulli classifier works similar to the
Multinomial classifier, but the predictor variables are the
independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document
classification tasks.
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
• Naive Bayes requires a small amount of training data to estimate the test data.
So,
the training period is less.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
• If categorical variable has a category in test data set, which was not observed in
training data set, then model will assign a 0 (zero) probability and will be
unable to make a prediction. This is often known as Zero Frequency.
Application of Naïve Bayes
Classifier
• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used
for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class prediction feature. It is
able to predict the probability of multiple classes of target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in
text classification (due to better result in multi class problems and independence rule) have higher success
rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam
e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer
sentiments).
• Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds
a Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not.
Machine Learning Algorithms
• Decision Tree
• Naïve Bayes
• Linear Regression
• Logistic Regression
• Support Vector
Machines
Linear Regression
• Linear regression is one of the easiest and most popular Machine Learning algorithms.
• Since linear regression shows the linear relationship, which means it finds how the
value
63
Linear
Regression
• The linear regression model provides a
straight
line
sloped representing the relationship between the
• variables.
Consider the image.
Y=a0+a1X+ ε
• Here,
- ve line of
regression
+ ve line of
regression
The line Equation will be: Y= The line Equation will be: Y= -
a0+a1x a0+a1x
Example: Making Predictions with Linear
Regression
• Given the representation is a linear equation, making predictions is
as
simple as solving the equation for a specific set of inputs.
• Imagine we are predicting weight (y) from height (x).
• A linear regression model representation for this problem would be:
Y = b0+b1X
or
weight = b0 + b1 * height
Example: Making Predictions with Linear
Regression
• Where b is the bias coefficient and b is the coefficient for the height column.
0 1
• Once found, user can switch in different height values to predict the weight.
• Let’s plug them in and calculate the weight (in kilograms) for a person with
the
height of 182 centimeters.
weight = 91.1
Example: Making Predictions with Linear
Regression
• The above equation could be plotted as
a line in two-dimensions.
• Rescale Inputs: Linear regression will often make more reliable predictions if you
rescale
input variables using standardization or normalization.
Types of Linear Regression
Y= a0+a1x+ ε
Multiple Linear
• regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
• In Multiple Linear Regression, the dependent variable(Y) is a linear combination of
multiple independent variables x1 , x2, x3 , ...,xn.
• Since it is an enhancement of Simple Linear Regression, so the same is applied for the
multiple linear regression equation, the equation becomes:
Easier to implement, interpret and efficient to train. It is often quite prone to noise and overfitting.
It is prone to multicollinearity
One more advantage is the extrapolation beyond
a specific data set
Applications of Linear Regression
• Sales Forecasting
• Risk Analysis
• Naïve Bayes
• Linear Regression
• Logistic Regression
• Support Vector
Machines
Logistic Regression
• Logistic regression is one of the most popular Machine Learning
algorithms,
which comes under the Supervised Learning technique.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact
Logistic
Regression
•Logistic Regression is much similar to the Linear Regression except that how
they are used.
• The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a person is obese or not based
Logistic Regression
• The mathematical steps to get Logistic Regression equations are given below:
1. Binomial: In binomial Logistic regression, there can be only two possible types of
the
dependent variables, such as 0 or 1, Pass or Fail, etc.
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
Applications of Logistic Regression
• Spam Detection
• Spam detection is a binary classification problem where we are given an email and we
need to classify whether or not it is spam.
• In order to apply Logistic Regression to the spam detection problem, the following features of
the email are extracted: Sender of the email, Number of types in the email, Occurrence of
words/phrases like “offer”, “prize”, “free gift”, etc.
• The resulting feature vector is then used to train a Logistic classifier which emits a score in the
range 0 to 1. If the score is more than 0.5, we label the email as spam. Otherwise, we don’t
label it as spam.
• Credit Card Fraud
Detection
• In banking sector when a credit card transaction happens, the bank makes a note of several
factors. For instance, the date of the transaction, amount, place, type of purchase, etc. Based on
these factors, they develop a Logistic Regression model of whether or not the transaction is a fraud.
For instance, if the amount is too high and the bank knows that the concerned person never
makes purchases that high, they may label it as a fraud.
• Tumour Prediction
• A Logistic Regression classifier may be used to identify whether a tumour is malignant or if
it is benign. Several medical imaging techniques are used to extract various features of
tumours. For instance, the size of the tumour, the affected body area, etc. These features are then
fed to a Logistic Regression classifier to identify if the tumour is malignant or if it is benign.
Marketing
• Every day, when you browse your Facebook newsfeed, the powerful
algorithms running behind the scene predict whether or not you would be
interested in certain content (which could be, for instance, an advertisement).
• Naïve Bayes
• Linear Regression
• Logistic Regression
• Support Vector
Machines
Support Vector Machines
• Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression problems.
• However, primarily, it is used for Classification problems in Machine Learning.
• SVMs have their unique way of implementation as compared to other machine
learning algorithms.
• Lately, they are extremely popular because of their ability to handle multiple
continuous and categorical variables.
• SVM algorithm can be used for Face detection, image classification,
text categorization, etc.
Support Vector Machines
• The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that one can easily put
the new data point in the correct category in the future.
• SVM chooses the extreme points/vectors that help in creating the hyperplane.
• These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine.
• Consider the below diagram in which there are two different categories that
are
classified using a decision boundary or hyperplane:
Example
• Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm.
• We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange
creature.
• So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and
dog.
• On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
Types of Support Vector Machines
• Linear SVM:
• Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
• Non-linear SVM:
• Support Vectors: The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. These vectors support the
hyperplane, hence called a Support vector
How does SVM works?
• Linear SVM: Consider the below
image:
• The working of the SVM algorithm is
shown using an example.
z = x 2 + y2
How does SVM works?
• So now, SVM will divide
the datasets into classes in the
following way.
• Consider the below image:
How does SVM works?
• SVM uses a technique called the kernel trick in which kernel takes a low dimensional input
space and transforms it into a higher dimensional space.
• In simple words, kernel converts non-separable problems into separable problems by adding
more dimensions to it.
K(x , xi )=sum(x∗ xi )
linear kernel is as below:
• From the above formula, we can see that the product between two vectors say 𝑥
• Polynomial Kernel
• It is more generalized form of linear kernel and distinguish curved or
nonlinear input space. Following is the formula for polynomial kernel −
K(x , xi )= 1+sum(x , xi )^d
• Here d is the degree of polynomial, which we need to specify manually in
SVM Kernels
• Radial Basis Function (RBF) Kernel
• It is effective in cases where the number of dimensions is greater than the number of samples.
• It uses a subset of training points in the decision function (called support vectors), so it is
also memory efficient.
• SVM Classifiers offer good accuracy and perform faster prediction compared to other
Machine
Learning models.
• Disadvantages of SVM
• SVM is not suitable for large datasets because of its high training time and it also takes more time
in
training.
• It also doesn’t perform very well, when the target classes are overlapped.
• Applications of
SVM
Beyond binary classifications: multiclass
classification
• Binary Classifiers for Multi-Class Classification
• Binary classification are those tasks where examples are assigned exactly one of two
classes.
• Multi-class classification is those tasks where examples are assigned exactly one of more
than
two classes:
• Binary Classification: Classification tasks with two classes.
Beyond binary classifications: multiclass
classification
• One approach for using binary classification algorithms for multi-classification
problems is to split the multi-class classification dataset into multiple binary
classification datasets and fit a binary classification model on each.
• Two different methods of this approach are the One-vs-Rest and One-vs-One
strategies.
• The One-vs-Rest strategy splits a multi-class classification into one binary
classification problem per class.
• The One-vs-One strategy splits a multi-class classification into one binary
classification problem per each pair of classes.
One-Vs-Rest for Multi-Class Classification
• One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for using
• It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier
is then trained on each binary classification problem and predictions are made using the model that is the
most confident.
• For example, given a multi-class classification problem with examples for class ‘red,’ ‘blue,’
each and
‘green‘. This could be divided into three binary classification datasets as follows:
2 (4 * (4 – 1)) / 2
(4 * 3) / 2
12 / 2
6
One-Vs-One for Multi-Class Classification
• Each binary classification model may predict one class label and the
model
with the most predictions or votes is predicted by the one-vs-one strategy.