ML Mod 4
ML Mod 4
In cases where there are non-linear relationships between features and the
target variable.
When dealing with data that may have missing values or doesn’t require
much preprocessing.
For smaller datasets, as Decision Trees can perform well without needing a
vast amount of data.
Overfitting: Decision Trees can become overly complex and capture noise
instead of the underlying pattern, especially when the tree grows too deep.
High Variance: Small changes in data can lead to different splits and result
in a different model, making Decision Trees sensitive to data variability.
Bias toward dominant features: Trees can over-rely on features with more
levels (in categorical data) or wider value ranges (in continuous data).
Module 04: 1
Lack of smooth predictions: Unlike other models, Decision Trees produce
step-like predictions, which may not generalize well for continuous target
variables.
Setting Constraints: Limit the maximum depth of the tree, the minimum
number of samples required to split a node, or the minimum samples
required at a leaf node.
By managing the depth and complexity and by aggregating multiple trees, the
effectiveness and robustness of Decision Trees can be significantly improved.
Module 04: 2
analysis, astronomy, and molecular biology. Decision trees are the basis of
several commercial rule induction systems.
Module 04: 3
Q1. What are the key terminologies of Support Vector Machine?
Ans:
SUPPORT VECTOR MACHINE:
4. An SVM outputs a map of the sorted data with the margins between the two
as far apart as possible
2. SVMs are based on the idea of finding a hyperplane that best divides a
dataset into two classes/groups
5. When new testing data is added, whatever side of the hyperplane it lands
will decide the class that we assign to it
SEPARATING HYPERPLANE:
1. From figure 4.1, we can see that it is possible to separate the data
3. All the data points representing men will be above the line
4. All the data points representing women will be below the line
Module 04: 4
MARGIN:
2. The margin is calculated as the perpendicular distance from the line to only
the closest points
3. A good margin is one where this separation is larger for both the classes
5. The more width of margin is there, the more optimal hyperplane we get
SUPPORT VECTORS:
1. The vectors (cases) that define the hyperplane are the support vectors
Q4. Define Support Vector Machine (SVM) and further explain the maximum
margin linear separators concept.
Ans:
SUPPORT VECTOR MACHINE:
4. An SVM outputs a map of the sorted data with the margins between the two
as far apart as possible
MAXIMAL-MARGIN CLASSIFIER/SEPARATOR:
Module 04: 5
2. The numeric input variables (x) in your data (the columns) form an n-
dimensional space
3. For example, if you had two input variables, this would form a two-
dimensional space
6. In two-dimensions you can visualize this as a line and let's assume that all
of our input points can be completely separated by this line
8. Where the coefficients (β₁ and β₀) that determine the slope of the line and
the intercept (β₀) are found by the learning algorithm
10. By plugging in input values into the line equation, you can calculate whether
a new point is above or below the line
11. Above the line, the equation returns a value greater than 0 and the point
belongs to the first class
12. Below the line, the equation returns a value less than 0 and the point
belongs to the second class
13. A value close to the line returns a value close to zero and the point may be
difficult to classify
14. If the magnitude of the value is large, the model may have more confidence
in the prediction
15. The distance between the line and the closest data points is referred to as
the margin
16. The best or optimal line that can separate the two classes is the line that
has the largest margin
18. The margin is calculated as the perpendicular distance from the line to only
the closest points
19. Only those points are relevant in defining the line and in the construction of
the classifier
Module 04: 6
20. Those points are called the support vectors
2. Hard margin given by Boser et al 1992 in COLT and soft margin given by
Vapnik et al 1995
3. Hard margin SVM can work only when data is completely linearly separable
without any errors (noise or outliers)
4. In case of errors either the margin is smaller or hard margin SVM fails
5. On the other hand soft margin SVM was proposed by Vapnik to solve this
problem by introducing slack variables
7. The allowance of softness in margins (i.e. a low cost setting) allows for
errors to be made while helping the model learn
8. Conversely, hard margins will result in fitting of a model that allows zero
errors
10. It may produce a more generalizable model when applied to new datasets
11. Forcing rigid margins can result in a model that performs perfectly in the
training set, but is possibly over-fit/less generalizable when applied to a
new dataset
12. Identifying the best settings for 'cost' is probably related to the specific
data set you are working with
13. Currently there aren't many good solutions for simultaneously optimizing
cost, features and kernel parameters (if using a non-linear kernel)
14. In both the soft margin and hard margin case we are maximizing the margin
between support vectors, i.e. minimizing ||w||²/2
Module 04: 7
15. In soft margin case, we let the model do some relaxation to few points
16. If we consider these points our margin might reduce significantly and our
decision boundary will be poorer
18. And we give certain penalty for them which is proportional to the amount by
which each data point is violating the hard constraint
20. This variables represent the deviation of the examples from their
theoretically correct positions
21. Doing this we are relaxing the margin, we are using a soft margin
Q8. What is Kernel? How kernel can be used with SVM to classify non-linearly
separable data? Also, list standard kernel functions.
Ans:
KERNEL:
2. SVM algorithms use a set of mathematical functions that are defined as the
kernel
3. The function of kernel is to take data as input and transform it into the
required form
5. It takes two inputs and spits out how similar they are
7. For example: linear, nonlinear, polynomial, radial basis function (RBF), and
sigmoid
Module 04: 8
3. A single entry of these list of features is a data instance while the collection
of everything is the Training Data which forms the basis of your prediction
4. i.e. if you know the skin colour, body hair length, height and so on of a
particular dog, then you can predict the breed it will probably belong to
7. From the example above of trying to predict the breed of a particular dog, it
goes like this:
8. Data (all breeds of dog) + Features (skin colour, hair etc.) + Learning
algorithm
Margin:
Module 04: 9
The margin is the distance between the hyperplane and the closest data
points from each class. The margin is crucial because it defines the "buffer
zone" around the hyperplane where no points should ideally fall.
SVM aims to maximize this margin to create a robust classifier that is less
sensitive to new data points and noise. The margin can be calculated as:
\[
\text{Margin} = \frac{2}{||w||}
\]
where \( ||w|| \) is the Euclidean norm (magnitude) of the vector \( w \).
Support Vectors:
Support vectors are the data points that lie closest to the hyperplane and
influence its position and orientation. They are the critical elements of the
training set since the classifier’s margin is based directly on these points.
In other words, support vectors are the data points that, if removed, would
alter the position of the hyperplane, highlighting their importance in defining
the decision boundary.
Module 04: 10
separate non-linear data by transforming it into a higher-dimensional
space.
Types of Kernels:
1. Linear Kernel:
2. Polynomial Kernel:
Use: Widely used for non-linear data, as the RBF kernel can model more
complex relationships.
4. Sigmoid Kernel:
Module 04: 11
Use: Behaves similarly to neural networks’ activation functions and can
be applied to certain types of non-linear data.
Applications: Often used in hybrid models but less common than other
kernels due to its limited suitability.
Kernel Trick:
A linear SVM is suitable for datasets that are linearly separable, meaning
that a straight line (in two dimensions) or a plane (in higher dimensions) can
separate the classes.
It finds a linear hyperplane that maximizes the margin between classes, but
it may not perform well on data with non-linear relationships, leading to
poor classification accuracy on such data.
Non-Linear SVM:
Non-linear SVM uses kernel functions to transform the data into a higher-
dimensional space where a linear separation is possible. By mapping data
into this new space, the SVM can find a hyperplane that separates the
classes even if the original data is not linearly separable.
Module 04: 12
Choosing Between Linear and Non-Linear SVM
When to Use Linear SVM:
Module 04: 13