Machine Learning
Machine Learning
Module 2
Decision Trees
Decision Tree algorithm belongs to the family of supervised learning algorithms and practical
methods for inductive inference .Decision tree learning is a method for approximating discrete-
valued target functions, in which the learned function is represented by a decision tree. Learned
trees can also be represented as sets of if-then rules. Decision trees are a hierarchical data
structure implementing the divide and conquer strategy. It is also an efficient method that can
be used for both classification and regression.It is a hierarchical model for supervised learning
where the local region is identified in a sequence of recursive splits in a smaller number of
steps.
Key Components of Decision trees:
1. Root Node: It represents the entire population or sample and this further gets
divided into two or more homogeneous sets.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.
4. Leaf / Terminal Node: Nodes that do not split are called Leaf or Terminal nodes. A
leaf node defines a localized region in input space. Each leaf node has an output label
which is a class node in classification and a numeric value in regression.
7. Parent and Child Node: A node, which is divided into sub-nodes is called a parent
node of sub-nodes whereas sub-nodes are the child of a parent node.
8. Boundaries of region are defined by discriminants that are coded in internal nodes.
● Compared to other algorithms, decision trees require less effort for data preparation
during pre-processing.
● A decision tree does not require normalization of data.
● A decision tree does not require scaling of data as well.
● Missing values in the data also do NOT affect the process of building a decision tree to
any considerable extent.
● A Decision tree model is very intuitive and easy to explain to technical teams as well as
stakeholders.
Disadvantages of Decision Tree
● A small change in the data can cause a large change in the structure of the decision tree
causing instability.
● For a Decision tree sometimes calculation can go far more complex compared to other
algorithms.
● Decision trees often involve higher time to train the model.
● Decision tree training is relatively expensive as the complexity and time taken are more.
● The Decision Tree algorithm is inadequate for applying regression and predicting
continuous values.
Types of problems Decision trees learning is good for :
Information Gain
Construction Decision Trees
ID3
Advantages and disadvantages of ID3
C4.5
● The most significant disadvantage of Decision Trees is that they are prone to overfitting.
Decision Trees overfit because you can end up with a leaf node for every single target
value in your training data.
● Decision Trees are also locally optimized, or greedy, which just means that they don’t
think ahead when deciding how to split at any given node. Rather, splits are made to
minimize or maximize the chosen splitting (selection) criterion— gini or entropy for
classification, MSE or MAE for regression.
● Because of the greedy nature of splitting, imbalanced classes also pose a major issue for
Decision Trees when dealing with classification. At each split, the tree is deciding how to
best split up classes into the next two nodes. So when one class has very low
representation (the minority class), many of those observations can get lost in the
majority class nodes, and then prediction of the minority class will be even less likely
than it should, if any nodes predict it at all.
Ensemble methods
Ensemble methods combine several decision trees to produce better predictive performance
than utilizing a single decision tree. The main principle behind the ensemble model is that a
group of weak learners come together to form a strong learner.
Techniques are :
Bagging
Bagging is used when the goal is to reduce the variance of a decision tree classifier. The
objective is to create several subsets of data from training samples chosen randomly with
replacement. Each collection of subset data is used to train their decision trees. As a result, we
get an ensemble of different models. Average of all the predictions from different trees are used
which is more robust than a single decision tree classifier.
Bagging Steps:
● Suppose there are N observations and M features in the training data set. A sample from
the training data set is taken randomly with replacement.
● A subset of M features are selected randomly and whichever feature gives the best split
is used to split the node iteratively.
● The tree has grown to the largest.
● Above steps are repeated n times and prediction is given based on the aggregation of
predictions from n number of trees.
Advantages:
● Since final prediction is based on the mean predictions from subset trees, it won’t give
precise values for the classification and regression model.
Boosting
Gradient Boosting
● Gradient Boosting uses the gradient descent method to reduce the loss function of the
entire operation. Gradient descent is a first-order optimization algorithm that finds the
local minimum of a function (differentiable function). Gradient boosting sequentially
trains multiple models, and it can fit novel models to get a better estimate of the
response.
● Once a loss function is defined for a particular model, gradient boosting is used to
minimize the value of this function, thus minimizing the error while constructing another
tree, by modifying the weights associated with the data points.
● GradientBoostingRegressor and GradientBoostingClassifier can be used to implement
this method in Python by using the library sklearn.ensemble.
AdaBoost
XGBoost is an implementation of gradient boosted decision trees designed for speed and
performance. Gradient boosting machines are generally very slow in implementation because of
sequential model training. Hence, they are not very scalable. Thus, XGBoost is focused on
computational speed and model performance. XGBoost provides:
● Parallelization of tree construction using all of your CPU cores during training.
● Distributed Computing for training very large models using a cluster of machines.
● Out-of-Core Computing for very large datasets that don’t fit into memory.
● Cache Optimization of data structures and algorithms to make the best use of hardware.
Loss function
Loss functions play an important role in any statistical model .They define an objective which
the performance of the model is evaluated against and the parameters learned by the model
are determined by minimizing a chosen loss function. Loss functions define what a good
prediction is and isn't.
In short, we can say that the loss function is a part of the cost function. The cost function is
calculated as an average of loss functions. The loss function is a value which is calculated at
every instance.
So, for a single training cycle loss is calculated numerous times, but the cost function is only
calculated once.
The corresponding cost function is the Mean of these Squared Errors (MSE). The MSE loss
function penalizes the model for making large errors by squaring them. This property makes the
MSE cost function less robust to outliers.Therefore it should be used if our data is prone to
many outliers.
2. Absolute Error Loss
Absolute Error for each training example is the distance between the predicted and the actual
values, irrespective of the sign. Absolute Error is also known as the L1 loss:
Huber loss is more robust to outliers than MSE. It is used in Robust Regression, M-estimation
and Additive Modelling. A variant of Huber Loss is also used in classification.
Binary Classification Loss Functions
Binary Classification refers to assigning an object into one of two classes. This classification is
based on a rule applied to the input feature vector. For example, classifying an email as spam or
not spam based on, say its subject line, is binary classification.
Let’s take breast cancer as a dataset, we want to classify a tumor as ‘Malignant’ or ‘Benign’
based on features like average radius, area, perimeter, etc. For simplification, we will use only
two input features (X_1 and X_2) namely ‘worst area’ and ‘mean symmetry’ for classification.
The target value Y can be 0 (Malignant) or 1 (Benign).
Here is a scatter plot for our data:
This is also called Log-Loss. To calculate the probability p, we can use the sigmoid function.
Here, z is a function of our input features:
The range of the sigmoid function is [0, 1] which makes it suitable for calculating probability.
2. Hinge Loss
Hinge loss is primarily used with Support Vector Machine (SVM) Classifiers with class labels -1
and 1. Hinge Loss not only penalizes the wrong predictions but also the right predictions that
are not confident.
Hinge loss for an input-output pair (x, y) is given as:
Hinge Loss simplifies the mathematics for SVM while maximizing the loss (as compared to
Log-Loss). It is used when we want to make real-time decisions with not a laser-sharp focus on
accuracy.
Multi-class Classification Loss Function
Emails are not just classified as spam or not spam (this isn’t the 90s anymore!). They are
classified into various other categories – Work, Home, Social, Promotions, etc. This is a
Multi-Class Classification use case.
1. Multi-Class Cross Entropy Loss
The multi-class cross-entropy loss is a generalization of the Binary Cross Entropy loss. The loss
for input vector X_i and the corresponding one-hot encoded target vector Y_i is:
Source: Wikipedia
Softmax is implemented through a neural network layer just before the output layer. The
Softmax layer must have the same number of nodes as the output layer.
Finally, our output is the class with the maximum probability for the given input.
Neural Network Learning as Optimization
● A deep learning neural network learns to map a set of inputs to a set of outputs from
training data.
● Cannot calculate the perfect weights for a neural network; there are too many
unknowns. Instead, the problem of learning is cast as a search or optimization problem
and an algorithm is used to navigate the space of possible sets of weights the model may
use in order to make good or good enough predictions.
● a neural network model is trained using the stochastic gradient descent optimization
algorithm and weights are updated using the backpropagation of error algorithm
● “Gradient” in gradient descent refers to an error gradient. The model with a given set of
weights is used to make predictions and the error for those predictions is calculated
● The gradient descent algorithm seeks to change the weights so that the next evaluation
reduces the error, meaning the optimization algorithm is navigating down the gradient
(or slope) of error.
● In the context of an optimization algorithm, the function used to evaluate a candidate
solution (i.e. a set of weights) is referred to as the objective function.
● We may seek to maximize or minimize the objective function, meaning that we are
searching for a candidate solution that has the highest or lowest score respectively.
● Typically, with neural networks, we seek to minimize the error. As such, the objective
function is often referred to as a cost function or a loss function and the value calculated
by the loss function is referred to as simply “loss.”
● The cost or loss function has an important job in that it must faithfully distill all aspects
of the model down into a single number in such a way that improvements in that
number are a sign of a better model.
● In calculating the error of the model during the optimization process, a loss function
must be chosen.
● This can be a challenging problem as the function must capture the properties of the
problem and be motivated by concerns that are important to the project and
stakeholders.
Model Selection
Model selection is a technique for selecting the best model after the individual models are
evaluated based on the required criteria.
Techniques :
● Resampling methods are simple techniques of rearranging data samples to inspect if the
model performs well on data samples that it has not been trained on.
● Random Splits are used to randomly sample a percentage of data into training, testing,
and preferably validation sets
● time-wise split :The training set can have data for the last three years and 10 months of
the present year. The last two months can be reserved for the testing or validation set.
● K-Fold Cross-Validation: The cross-validation technique works by randomly shuffling the
dataset and then splitting it into k groups. Then on iterating over each group, the group
needs to be considered as a test set while all other groups are clubbed together into the
training set.
● BootStrap: The first step is to select a sample size (which is usually equal to the size of
the original dataset). Thereafter, a sample data point must be randomly selected from
the original dataset and added to the bootstrap sample. After the addition, the sample
needs to be put back into the original sample. This process needs to be repeated for N
times, where N is the sample size.
Model Evaluation
For every classification model prediction, a matrix called the confusion matrix can be
constructed which demonstrates the number of test cases correctly and incorrectly classified.
● Accuracy is the simplest metric and can be defined as the number of test cases correctly
classified divided by the total number of test cases.
● Recall tells us the number of positive cases correctly identified out of the total number
of positive cases.
● F1 score is the harmonic mean of Recall and Precision and therefore, balances out the
strengths of each.
● AUC-ROC: ROC curve is a plot of true positive rate (recall) against false positive rate (TN
/ (TN+FP)). AUC-ROC stands for Area Under the Receiver Operating Characteristics and
the higher the area, the better is the model performance.
● If the curve is somewhere near the 50% diagonal line, it suggests that the model
randomly predicts the output variable.
Bias and Variance
Bias occurs when a model is strictly ruled by assumptions – like the linear regression model
assumes that the relationship of the output variable with the independent variables is a straight
line. This leads to underfitting when the actual values are non-linearly related to the
independent variables.
Variance is high when a model focuses on the training set too much and learns the variations
very closely, compromising on generalization. This leads to overfitting.
An optimal model is one that has the lowest bias and variance and since these two attributes
are indirectly proportional, the only way to achieve this is through a tradeoff between the two.
Therefore, the model selection should be such that the bias and variance intersect.
Linear regression
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable. The linear regression model
provides a sloped straight line representing the relationship between the variables.
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
● Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
● Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
Logistic Regression
Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
Logistic Regression is much similar to Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for
solving the classification problems.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
We know the equation of the straight line can be written as:
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
Module 4
Dimensionality Reduction
It is the process of reducing the number of random variables under consideration via obtaining
a set of uncorrelated principle variables. Dimension reduction refers to the process of
converting a set of data having vast dimensions into data with lesser dimensions ensuring that it
conveys similar information concisely.
Methods of Dimensionality Reduction : Principal component Analysis (PCA) and Independent
Component Analysis (ICA)
Advantages of Dimensionality Reduction
Principal component analysis(PCA)
Independent Component Analysis (ICA)
Definition of ICA
(Note : Eq.3 -> x = As)
What is independence?
Uncorrelated does not mean independent
Why are Gaussian variables forbidden?
2. Whitening
Centering + Whitening = Sphering
● Centering and whitening combined is referred to as sphering, and is necessary to speed
up the ICA algorithm.
● Sphering removes the first and second-order statistics of the data; both the mean and
covariance are set to zero and the variance are equalized.