UNIT II Machine Learning
UNIT II Machine Learning
• At a high level, these different algorithms can be classified into two groups
based on the way they “learn” about data to make predictions:
• Supervised learning
• Unsupervised learning.
Classification and Regression in Machine Learning
• For example, a classification algorithm will learn to identify animals after being trained on a dataset
of images that are properly labeled with the species of the animal and some identifying
characteristics.
• Supervised learning problems can be further grouped into Regression and Classification problems.
• Both problems have as goal the construction of a brief model that can predict the value of the
dependent attribute from the attribute variables.
• The difference between the two tasks is the fact that the dependent attribute is numerical for
regression and categorical for classification.
Classification and Regression in Machine Learning
The main difference
between Regression and
Classification algorithms
that Regression algorithms
are used to predict the
continuous values such as
price, salary, age, etc. and
Classification algorithms
are used
to predict/Classify the
discrete values such as
Male or Female, True or
Classification in Machine Learning
• A classification problem is when the output variable is a category, such as “apple” or “mango” or
“yes” and “no”. A classification model attempts to draw some conclusion from observed values.
• Given one or more inputs a classification model will try to predict the value of one or more outcomes.
• For example, when filtering emails “spam” or “not spam”, when looking at transaction data,
“fraudulent”, or “authorized”.
• In short Classification either predicts categorical class labels or classifies data (construct a model)
based on the training set and the values (class labels) in classifying attributes and uses it in classifying
new data.
• There are a number of classification models. Classification models include logistic regression,
decision tree, random forest, SVM, one-vs-rest, and Naive Bayes.
Classification in Machine Learning
• Predict the number of copies a music album will be sold next month
Classification in Machine Learning
• Classification is the process of finding or discovering a model or function which helps in
separating the data into multiple categorical classes i.e. discrete values.
• In classification, data is categorized under different labels according to some parameters given in
input and then the labels are predicted for the data.
• The derived mapping function could be demonstrated in the form of “IF-THEN” rules.
• The classification process deal with the problems where the data can be divided into binary or
multiple discrete labels.
• Let‟s take an example, suppose we want to predict the possibility of the wining of match by Team
A on the basis of some parameters recorded earlier. Then there would be two labels Yes and No.
Classification in Machine Learning
• It can also identify the distribution movement depending on the historical data.
• Because a regression predictive model predicts a quantity, therefore, the skill of
the model must be reported as an error in those predictions.
• Let‟s take a example in regression also, where we are finding the possibility of
rain in some particular regions with the help of some parameters recorded earlier.
• Many different models can be used, the simplest is the linear regression.
• It tries to fit data with the best hyper-plane which goes through the points.
• Polynomial Regression
Basic Mapping Function is used for mapping of values to Mapping Function is used for mapping of values to
predefined classes. continuous output.
Involves Discrete values Continuous values or real values
prediction of
Nature of the Unordered Ordered
predicted data
Method of by measuring accuracy by measurement of root mean square error
calculation
Algorithms Decision tree, logistic regression, etc. Regression tree (Random forest), Linear regression,
etc.
Output Try to find the decision boundary, which can divide Try to find the best fit line, which can predict the
the dataset into different classes output more accurately.
Example Classification Algorithms can be used to solve Regression algorithms can be used to solve the
classification problems such as Identification of regression problems such as Weather Prediction,
spam emails, Speech Recognition, Identification of House price prediction, etc.
cancer cells, etc.
Types The Classification algorithms can be divided into The regression Algorithm can be further divided
Binary Classifier and Multi-class Classifier. into Linear and Non-linear Regression.
Machine Learning Algorithms
• Decision Tree
• Naïve Bayes
• Linear Regression
• Logistic Regression
• It is one way to display an algorithm that only contains conditional control statements.
• each leaf node represents a class label (decision taken after computing all attributes).
• Tree based methods empower predictive models with high accuracy, stability and
ease of interpretation.
•Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
•Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
•Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
• The next decision node further gets split into one decision node (Cab facility) and one leaf
node.
• Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Example 1
• Consider the below diagram:
Example 2
• Decision trees classify instances by sorting them down the tree from the
root to some leaf node, which provides the classification of the instance.
• An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in the figure.
• This process is then repeated for the subtree rooted at the new node.
Example 2
Example 2
The decision tree in above figure classifies a particular morning,
according to whether it is suitable for playing tennis and returning the
classification associated with the particular leaf. (in this case Yes or No).
For example, the instance
(Outlook = Sunny, Humidity = High)
would be sorted down the leftmost branch of this decision tree and
would therefore be classified as a negative instance.
How does the Decision Tree algorithm Work?
• In a decision tree, for predicting the class of the given dataset, the algorithm starts
• This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
• For the next node, the algorithm again compares the attribute value with the other sub-
• It continues the process until it reaches the leaf node of the tree.
Decision Tree algorithm
• Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called as Attribute selection
measure or ASM.
• By this measurement, user can easily select the best attribute for the nodes of the tree.
• An attribute with the low Gini index should be preferred as compared to the
high Gini index.
• It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
• Gini index can be calculated using the below formula:
Yes
=Entropy(5/14, 9/14)
=Entropy(0.36, 0.64)
=-(0.36 log2 0.36)-(0.64 log2 0.64)
=0.53+0.41
= 0.94
Example
b) Entropy using the frequency table of two attributes:
Entropy(Two attribute)=(Weighted Avg) *Entropy(each attribute)
• The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a
decision tree is all about finding attribute that returns the highest information gain (i.e., the most
homogeneous branches).
• Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is
added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy
before the split. The result is the Information Gain, or decrease in entropy.
Outlook
Yes
Yes
Yes
40
Example
• Step 4a: A branch with entropy of 0 is a leaf node.
• Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is
classified
Decision Tree to Decision Rules
• A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf
nodes one by one.
Types of Decision Trees
Types of decision tree is based on the type of target variable that user have. It can be
of two types:
• Categorical Variable Decision Tree: Decision Tree which has categorical target variable
then it called as categorical variable decision tree. E.g.:- In above scenario of student
problem, where the target variable was “Student will play Golf or not” i.e. YES or NO.
• Continuous Variable Decision Tree: Decision Tree has continuous target variable then it
is called as Continuous Variable Decision Tree.
Advantages of Decision Tree
• Easy to Understand
• Decision trees require relatively little effort from users for data preparation.
• Less data cleaning required
• Data type is not a constraint
• Non-Parametric Method
• Non-linear relationships between parameters do not affect tree performance.
Disadvantages of Decision Tree
• Over fitting
• Not fit for continuous variables
• Calculations can become complex when there are many class label.
• Generally, it gives low prediction accuracy for a dataset as compared to
other machine learning algorithms.
• Information gain in a decision tree with categorical variables gives a biased
response for attributes with greater no. of categories.
Applications of Decision Tree
• Direct Marketing
• Customer Retention
• Fraud Detection
• Decision Tree
• Naïve Bayes
• Linear Regression
• Logistic Regression
• Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
• With the help of Bayes theorem, we can express this in quantitative form as follows:
where, „c‟ is class variable and „x‟ is a dependent feature vector (of size n)
Example: Naïve Bayes Target
Predictors
Total no. of samples for class 1: Outlook Temp Humidity Wind Play Golf
Rainy Hot High False No
Play_golf =“Yes”= 9
Rainy Hot High True No
Overcast Hot High False Yes
Total no. of samples for class 2: Sunny Mild High False Yes
Play_golf =“No”= 5 Sunny Cool Normal False Yes
Sunny Cool Normal True No
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Sunny Mild High True No
Example: Naïve Bayes
For data sample X = (Outlook= rainy, Temp= cool, Humidity= high, Windy= true)
= 0.0081
= 0.0567
Example: Naïve Bayes
• Total no. of sample for class “Yes”= 9/14 = 0.64
= 0.0081 X 0.64
=0.0051
= 0.0567 X 0.36
=0.020
2.Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc. The classifier uses the frequency of words for the predictors.
Types of Naïve Bayes
3. Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• Naive Bayes requires a small amount of training data to estimate the test data. So,
the training period is less.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
• If categorical variable has a category in test data set, which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable to make a
prediction. This is often known as Zero Frequency.
Application of Naïve Bayes Classifier
• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for
making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class prediction feature. It is able to
predict the probability of multiple classes of target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems and independence rule) have higher success rate as
compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and
Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments).
• Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen information
and predict whether a user would like a given resource or not.
Machine Learning Algorithms
• Decision Tree
• Naïve Bayes
• Linear Regression
• Logistic Regression
• Linear regression makes predictions for continuous/real or numeric variables such as sales,
salary, age, product price, etc.
• Linear regression algorithm shows a linear relationship between a dependent (y) variable
and one or more independent (x) variables, hence called as linear regression.
• Since linear regression shows the linear relationship, which means it finds how the value
of the dependent variable is changing according to the value of the independent variable.
63
Linear Regression
• The linear regression model provides a sloped straight line
representing the relationship between the variables.
• Consider the image.
Y=a0+a1X+ ε
• Here,
- ve line of regression
+ ve line of regression
The line Equation will be: Y= a0+a1x The line Equation will be: Y= -a0+a1x
Example: Making Predictions with Linear Regression
• Given the representation is a linear equation, making predictions is as
simple as solving the equation for a specific set of inputs.
• Imagine we are predicting weight (y) from height (x).
• A linear regression model representation for this problem would be:
Y = b0+b1X
or
weight = b0 + b1 * height
Example: Making Predictions with Linear Regression
• Where b0 is the bias coefficient and b1 is the coefficient for the height column.
• Once found, user can switch in different height values to predict the weight.
• Let‟s plug them in and calculate the weight (in kilograms) for a person with the
height of 182 centimeters.
weight = 91.1
Example: Making Predictions with Linear Regression
• Gaussian Distributions. Linear regression will make more reliable predictions if your
input and output variables have a Gaussian distribution. You may get some benefit using
transforms on you variables to make their distribution more Gaussian looking.
• Rescale Inputs: Linear regression will often make more reliable predictions if you rescale
input variables using standardization or normalization.
Types of Linear Regression
Y= a0+a1x+ ε
Multiple Linear regression:
• If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
• Since it is an enhancement of Simple Linear Regression, so the same is applied for the multiple
linear regression equation, the equation becomes:
It handles overfitting pretty well using dimensionally Linear regression is quite sensitive to outliers.
• Risk Analysis
• Naïve Bayes
• Linear Regression
• Logistic Regression
• It is used for predicting the categorical dependent variable using a given set of
independent variables.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression
• Logistic Regression is much similar to the Linear Regression except that how they
are used.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Assumptions for Logistic Regression:
• The mathematical steps to get Logistic Regression equations are given below:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep“
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Applications of Logistic Regression
• Spam Detection
• Spam detection is a binary classification problem where we are given an email and we need to
classify whether or not it is spam.
• The resulting feature vector is then used to train a Logistic classifier which emits a score in the range
0 to 1. If the score is more than 0.5, we label the email as spam. Otherwise, we don‟t label it as
spam.
• Credit Card Fraud Detection
• In banking sector when a credit card transaction happens, the bank makes a note of several factors.
For instance, the date of the transaction, amount, place, type of purchase, etc. Based on these factors,
they develop a Logistic Regression model of whether or not the transaction is a fraud. For instance, if
the amount is too high and the bank knows that the concerned person never makes purchases that
high, they may label it as a fraud.
• Tumour Prediction
• A Logistic Regression classifier may be used to identify whether a tumour is malignant or if it is
benign. Several medical imaging techniques are used to extract various features of tumours. For
instance, the size of the tumour, the affected body area, etc. These features are then fed to a Logistic
Regression classifier to identify if the tumour is malignant or if it is benign.
Marketing
• Every day, when you browse your Facebook newsfeed, the powerful algorithms
running behind the scene predict whether or not you would be interested in certain
content (which could be, for instance, an advertisement).
• Naïve Bayes
• Linear Regression
• Logistic Regression
• SVM chooses the extreme points/vectors that help in creating the hyperplane.
• These extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine.
• Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example
• Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm.
• We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature.
• So as support vector creates a decision boundary between these two data (cat and dog)
and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
• Non-linear SVM:
• Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
Hyperplane & Support Vectors in the SVM :
• Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the data
points. This best boundary is known as the hyperplane of SVM. The dimensions of the hyperplane
depend on the features present in the dataset, which means if there are 2 features then hyperplane
will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane. We
always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
• Support Vectors: The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. These vectors support the hyperplane,
hence called a Support vector
How does SVM works?
• Linear SVM: Consider the below image:
• The working of the SVM algorithm is shown
using an example.
z = x2 + y2
How does SVM works?
• So now, SVM will divide the
datasets into classes in the following
way.
• In simple words, kernel converts non-separable problems into separable problems by adding more
dimensions to it.
• Polynomial Kernel
• It is more generalized form of linear kernel and distinguish curved or nonlinear
input space. Following is the formula for polynomial kernel −
K(x , xi )= 1+sum(x , xi )^d
• Here d is the degree of polynomial, which we need to specify manually in the
learning algorithm.
SVM Kernels
• Radial Basis Function (RBF) Kernel
RBF kernel, mostly used in SVM classification, maps input space in indefinite
dimensional space. Following formula explains it mathematically −
K(x , xi )= exp(-ɣ|| x - xi ||2)
Here, gamma ranges from 0 to 1. We need to manually specify it in the learning
algorithm. A good default value of gamma is 0.1.
• Advantages of SVM
• It works really well with a clear margin of separation.
• It is effective in cases where the number of dimensions is greater than the number of samples.
• It uses a subset of training points in the decision function (called support vectors), so it is also
memory efficient.
• SVM Classifiers offer good accuracy and perform faster prediction compared to other Machine
Learning models.
• Disadvantages of SVM
• SVM is not suitable for large datasets because of its high training time and it also takes more time in
training.
• It also doesn‟t perform very well, when the target classes are overlapped.
• Applications of SVM
Beyond binary classifications: multiclass classification
• Binary Classifiers for Multi-Class Classification
• Binary classification are those tasks where examples are assigned exactly one of two classes.
• Multi-class classification is those tasks where examples are assigned exactly one of more than
two classes:
• Binary Classification: Classification tasks with two classes.
• Multi-class Classification: Classification tasks with more than two classes.
Beyond binary classifications: multiclass classification
• One approach for using binary classification algorithms for multi-classification problems
is to split the multi-class classification dataset into multiple binary classification datasets
and fit a binary classification model on each.
• Two different methods of this approach are the One-vs-Rest and One-vs-One strategies.
• The One-vs-Rest strategy splits a multi-class classification into one binary classification
problem per class.
• The One-vs-One strategy splits a multi-class classification into one binary classification
problem per each pair of classes.
One-Vs-Rest for Multi-Class Classification
• One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for using binary
classification algorithms for multi-class classification.
• It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is
then trained on each binary classification problem and predictions are made using the model that is the most
confident.
• For example, given a multi-class classification problem with examples for each class „red,‟ „blue,‟ and
„green„. This could be divided into three binary classification datasets as follows:
• We can see that for four classes, this gives us the expected value of six binary classification
problems:
(NumClasses * (NumClasses – 1)) / 2
(4 * (4 – 1)) / 2
(4 * 3) / 2
12 / 2
6
One-Vs-One for Multi-Class Classification
• Each binary classification model may predict one class label and the model
with the most predictions or votes is predicted by the one-vs-one strategy.
• “An alternative is to introduce K(K − 1)/2 binary discriminant functions,
one for every possible pair of classes. This is known as a one-versus-one
classifier. Each point is then classified according to a majority vote amongst
the discriminant functions.”
END
Of
UNIT- II