0% found this document useful (0 votes)
24 views

ML Mod 4

ml

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

ML Mod 4

ml

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Module 04:

Decision Tree (DT)


1. What is a Decision Tree?

A Decision Tree is a supervised machine learning algorithm used for


classification and regression tasks.

It splits data into subsets based on feature values, forming branches,


leading to decision outcomes at the end nodes (leaves).

Each internal node represents a feature (attribute), each branch represents


a decision rule, and each leaf represents the outcome.

Decision Trees are visually interpretable, allowing users to follow a


straightforward path for decision-making.

2. When to Use Decision Trees:

When interpretability is essential, as Decision Trees offer a clear


representation of decision processes.

In cases where there are non-linear relationships between features and the
target variable.

When handling both categorical and continuous data, as Decision Trees


handle different data types well.

When dealing with data that may have missing values or doesn’t require
much preprocessing.

For smaller datasets, as Decision Trees can perform well without needing a
vast amount of data.

3. Pitfalls of Decision Trees:

Overfitting: Decision Trees can become overly complex and capture noise
instead of the underlying pattern, especially when the tree grows too deep.

High Variance: Small changes in data can lead to different splits and result
in a different model, making Decision Trees sensitive to data variability.

Bias toward dominant features: Trees can over-rely on features with more
levels (in categorical data) or wider value ranges (in continuous data).

Module 04: 1
Lack of smooth predictions: Unlike other models, Decision Trees produce
step-like predictions, which may not generalize well for continuous target
variables.

4. How to Overcome Decision Tree Pitfalls:

Pruning: By trimming back the tree, unnecessary branches are removed,


reducing overfitting and simplifying the model. Common pruning methods
include cost-complexity pruning.

Ensemble Methods: Techniques like Bagging (Bootstrap Aggregation) and


Boosting (such as AdaBoost, Gradient Boosting) aggregate multiple trees to
lower variance and improve performance.

Setting Constraints: Limit the maximum depth of the tree, the minimum
number of samples required to split a node, or the minimum samples
required at a leaf node.

Cross-Validation: Use cross-validation to tune hyperparameters and check


the model’s generalizability on unseen data.

By managing the depth and complexity and by aggregating multiple trees, the
effectiveness and robustness of Decision Trees can be significantly improved.

"How are decision trees used for classification?"


Given a tuple, X, for which the associated class label is unknown, the attribute
values of the tuple are tested against the decision tree. A path is traced from
the root to a leaf node, which holds the class prediction for that tuple. Decision
trees can easily be converted to classification rules.

"Why are decision tree classifiers so popular?"


The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for exploratory
knowledge discovery. Decision trees can handle multidimensional data. Their
representation of acquired knowledge in tree form is intuitive and generally
easy to assimilate by humans. The learning and classification steps of decision
tree induction are simple and fast. In general, decision tree classifiers have
good accuracy. However, successful use may depend on the data at hand.
Decision tree induction algorithms have been used for classification in many
application areas such as medicine, manufacturing and production, financial

Module 04: 2
analysis, astronomy, and molecular biology. Decision trees are the basis of
several commercial rule induction systems.

In Section 8.2.1, we describe a basic algorithm for learning decision trees.


During tree construction, attribute selection measures are used to select the
attribute that best partitions the tuples into distinct classes. Popular measures
of attribute selection are given in Section 8.2.2. When decision trees are built,
many of the branches may reflect noise or outliers in the training data. Tree
pruning attempts to identify and remove such branches, with the goal of
improving classification accuracy on unseen data. Tree pruning is described in
Section 8.2.3. Scalability issues for the induction of decision trees

Entropy to measure discriminatory power of an attribute for classification task.


It defines the amount of
randomness in attribute for classification task.
Entropy is minimal means the attribute appears close to one class and has a
good discriminatory power for
classification
Limitations of ID3
•Decision trees are less appropriate for estimation tasks
where the goal is to predict the value of a continuous
attribute.
•Decision trees are prone to errors in classification
problems with many classes and a relatively small
number of training examples.
•Decision tree can be computationally expensive to train.
The process of growing a decision tree is computationally
expensive. At each node, each candidate splitting field
must be sorted before its best split can be found. Pruning
algorithms can also be expensive since many candidate
sub-trees must be formed and compared

Support Vector Machine (SVM) Questions


& Answers

Module 04: 3
Q1. What are the key terminologies of Support Vector Machine?
Ans:
SUPPORT VECTOR MACHINE:

1. A support vector machine is a supervised learning algorithm that sorts data


into two categories

2. A support vector machine is also known as a support vector network (SVN)

3. It is trained with a series of data already classified into two categories,


building the model as it is initially trained

4. An SVM outputs a map of the sorted data with the margins between the two
as far apart as possible

5. SVMs are used in text categorization, image classification, handwriting


recognition and in life sciences

Q2. What is SVM? Explain the following terms: hyperplane, separating


hyperplane, margin and support vectors with suitable example.
Ans:
HYPERPLANE:

1. A hyperplane is a generalization of a plane

2. SVMs are based on the idea of finding a hyperplane that best divides a
dataset into two classes/groups

3. Figure 4.1 shows the example of hyperplane

4. As a simple example, for a classification task with only two features as


shown in figure 4.1, you can think of a hyperplane as a line that linearly
separates and classifies a set of data

5. When new testing data is added, whatever side of the hyperplane it lands
will decide the class that we assign to it

SEPARATING HYPERPLANE:

1. From figure 4.1, we can see that it is possible to separate the data

2. We can use a line to separate the data

3. All the data points representing men will be above the line

4. All the data points representing women will be below the line

5. Such a line is called a separating hyperplane

Module 04: 4
MARGIN:

1. A margin is a separation of line to the closest class points

2. The margin is calculated as the perpendicular distance from the line to only
the closest points

3. A good margin is one where this separation is larger for both the classes

4. A good margin allows the points to be in their respective classes without


crossing to other class

5. The more width of margin is there, the more optimal hyperplane we get

SUPPORT VECTORS:

1. The vectors (cases) that define the hyperplane are the support vectors

2. Vectors are separated using hyperplane

3. Vectors are mostly from a group which is classified using hyperplane

4. Figure 4.2 shows the example of support vectors

Q4. Define Support Vector Machine (SVM) and further explain the maximum
margin linear separators concept.

Ans:
SUPPORT VECTOR MACHINE:

1. A support vector machine is a supervised learning algorithm that sorts data


into two categories

2. A support vector machine is also known as a support vector network (SVN)

3. It is trained with a series of data already classified into two categories,


building the model as it is initially trained

4. An SVM outputs a map of the sorted data with the margins between the two
as far apart as possible

5. SVMs are used in text categorization, image classification, handwriting


recognition and in life sciences

MAXIMAL-MARGIN CLASSIFIER/SEPARATOR:

1. The Maximal-Margin Classifier is a hypothetical classifier that best explains


how SVM works in practice

Module 04: 5
2. The numeric input variables (x) in your data (the columns) form an n-
dimensional space

3. For example, if you had two input variables, this would form a two-
dimensional space

4. A hyperplane is a line that splits the input variable space

5. In a hyperplane is selected to best separate the points in the input variable


space by their class, either class 0 or class 1

6. In two-dimensions you can visualize this as a line and let's assume that all
of our input points can be completely separated by this line

7. For example: f(x) = (β₁ * x₁) + (β₀ * x₀) = 0

8. Where the coefficients (β₁ and β₀) that determine the slope of the line and
the intercept (β₀) are found by the learning algorithm

9. You can make classifications using this line

10. By plugging in input values into the line equation, you can calculate whether
a new point is above or below the line

11. Above the line, the equation returns a value greater than 0 and the point
belongs to the first class

12. Below the line, the equation returns a value less than 0 and the point
belongs to the second class

13. A value close to the line returns a value close to zero and the point may be
difficult to classify

14. If the magnitude of the value is large, the model may have more confidence
in the prediction

15. The distance between the line and the closest data points is referred to as
the margin

16. The best or optimal line that can separate the two classes is the line that
has the largest margin

17. This is called the Maximal-Margin hyperplane

18. The margin is calculated as the perpendicular distance from the line to only
the closest points

19. Only those points are relevant in defining the line and in the construction of
the classifier

Module 04: 6
20. Those points are called the support vectors

21. The hyperplane is learned from training data using an optimization


procedure that maximizes the margin

Q7. Write short note on - Soft margin SVM


Ans:
SOFT MARGIN SVM:

1. Soft margin is extended version of hard margin SVM

2. Hard margin given by Boser et al 1992 in COLT and soft margin given by
Vapnik et al 1995

3. Hard margin SVM can work only when data is completely linearly separable
without any errors (noise or outliers)

4. In case of errors either the margin is smaller or hard margin SVM fails

5. On the other hand soft margin SVM was proposed by Vapnik to solve this
problem by introducing slack variables

6. As far as their usage is concerned since soft margin is extended version of


hard margin SVM so less attention is paid to hard margin SVM

7. The allowance of softness in margins (i.e. a low cost setting) allows for
errors to be made while helping the model learn

8. Conversely, hard margins will result in fitting of a model that allows zero
errors

9. Sometimes it can be helpful to allow for errors in the training set

10. It may produce a more generalizable model when applied to new datasets

11. Forcing rigid margins can result in a model that performs perfectly in the
training set, but is possibly over-fit/less generalizable when applied to a
new dataset

12. Identifying the best settings for 'cost' is probably related to the specific
data set you are working with

13. Currently there aren't many good solutions for simultaneously optimizing
cost, features and kernel parameters (if using a non-linear kernel)

14. In both the soft margin and hard margin case we are maximizing the margin
between support vectors, i.e. minimizing ||w||²/2

Module 04: 7
15. In soft margin case, we let the model do some relaxation to few points

16. If we consider these points our margin might reduce significantly and our
decision boundary will be poorer

17. So instead of considering them as support vectors we consider them as


error points

18. And we give certain penalty for them which is proportional to the amount by
which each data point is violating the hard constraint

19. Slack variables ξ can be added to allow misclassification of difficult or


noisy examples

20. This variables represent the deviation of the examples from their
theoretically correct positions

21. Doing this we are relaxing the margin, we are using a soft margin

Q8. What is Kernel? How kernel can be used with SVM to classify non-linearly
separable data? Also, list standard kernel functions.

Ans:
KERNEL:

1. A kernel is a similarity function

2. SVM algorithms use a set of mathematical functions that are defined as the
kernel

3. The function of kernel is to take data as input and transform it into the
required form

4. It is a function that you provide to a machine learning algorithm

5. It takes two inputs and spits out how similar they are

6. Different SVM algorithms use different types of kernel functions

7. For example: linear, nonlinear, polynomial, radial basis function (RBF), and
sigmoid

EXAMPLE ON EXPLAINING HOW KERNEL CAN BE USED FOR CLASSIFYING


NON-LINEAR SEPARABLE DATA:

1. To predict if a dog is a particular breed, we load in millions of dog


information/properties like height, skin colour, body hair length etc

2. In ML language, these properties are referred to as 'features'

Module 04: 8
3. A single entry of these list of features is a data instance while the collection
of everything is the Training Data which forms the basis of your prediction

4. i.e. if you know the skin colour, body hair length, height and so on of a
particular dog, then you can predict the breed it will probably belong to

5. In support vector machines, it looks somewhat like shown in figure 4.5


which separates the blue balls from red

6. Therefore the hyperplane of a two dimensional space below is a one


dimensional line dividing the red and blue dots

7. From the example above of trying to predict the breed of a particular dog, it
goes like this:

8. Data (all breeds of dog) + Features (skin colour, hair etc.) + Learning
algorithm

9. If we want to solve following example in Linear manner then it is not


possible to separate by straight line as we did in above steps

1. Definition of Key SVM Concepts: Hyperplane, Margin, and


Support Vectors
Hyperplane:

In the context of SVM, a hyperplane is a decision boundary that separates


different classes in a feature space. In two-dimensional space, the
hyperplane is a line, while in three-dimensional space, it becomes a plane,
and in higher dimensions, it generalizes to a hyperplane.

Mathematically, a hyperplane in \(n\)-dimensional space can be described


by the equation:
\[
w \cdot x + b = 0
\]
where \( w \) is the normal vector (perpendicular) to the hyperplane, \( x \)
represents the feature vector of data points, and \( b \) is the bias term.

In classification problems, the hyperplane serves as a boundary to separate


data points into distinct classes.

Margin:

Module 04: 9
The margin is the distance between the hyperplane and the closest data
points from each class. The margin is crucial because it defines the "buffer
zone" around the hyperplane where no points should ideally fall.

SVM aims to maximize this margin to create a robust classifier that is less
sensitive to new data points and noise. The margin can be calculated as:
\[
\text{Margin} = \frac{2}{||w||}
\]
where \( ||w|| \) is the Euclidean norm (magnitude) of the vector \( w \).

Support Vectors:

Support vectors are the data points that lie closest to the hyperplane and
influence its position and orientation. They are the critical elements of the
training set since the classifier’s margin is based directly on these points.

In other words, support vectors are the data points that, if removed, would
alter the position of the hyperplane, highlighting their importance in defining
the decision boundary.

2. Why SVM is Called a Max Margin Classifier


SVM is known as a max margin classifier because it constructs the
hyperplane that maximizes the margin between the nearest points of the
different classes. The philosophy is to increase the margin to improve
classification robustness.

Maximizing the margin has two key benefits:

1. Reduced Overfitting: Larger margins imply a greater separation


between classes, which minimizes the chance of misclassification due
to minor variations in new data.

2. Better Generalization: By maximizing the margin, SVM tends to


produce models that generalize better on unseen data.

3. Kernelized SVM and the Kernel Trick


In practice, many datasets are not linearly separable in their original feature
space, making it challenging for a linear hyperplane to separate the classes
accurately. This is where kernelized SVM becomes useful, allowing SVM to

Module 04: 10
separate non-linear data by transforming it into a higher-dimensional
space.

A kernel function is applied to map data into a higher-dimensional space


without explicitly computing the coordinates in that space. Instead, the
kernel function computes the inner product of pairs of data points in the
transformed space.

Types of Kernels:

1. Linear Kernel:

Equation: \( K(x, y) = x \cdot y \)

Use: Suitable for linearly separable data where classes can be


separated with a straight line or plane.

Applications: Text classification and scenarios with a large number of


features but fewer data points (e.g., sentiment analysis in NLP).

2. Polynomial Kernel:

Equation: \( K(x, y) = (x \cdot y + 1)^d \), where \( d \) is the degree of


the polynomial.

Use: Useful when the relationship between classes is polynomial in


nature.

Applications: Image processing tasks where features may exhibit


polynomial relationships, making them more suitable for polynomial
SVMs.

3. Radial Basis Function (RBF) Kernel:

Equation: \( K(x, y) = \exp(-\gamma ||x - y||^2) \), where \( \gamma \)


controls the width of the Gaussian function.

Use: Widely used for non-linear data, as the RBF kernel can model more
complex relationships.

Applications: Image recognition, bioinformatics, and applications


requiring high flexibility in decision boundaries.

4. Sigmoid Kernel:

Equation: \( K(x, y) = \tanh(\alpha x \cdot y + c) \), where \( \alpha \)


and \( c \) are kernel parameters.

Module 04: 11
Use: Behaves similarly to neural networks’ activation functions and can
be applied to certain types of non-linear data.

Applications: Often used in hybrid models but less common than other
kernels due to its limited suitability.

Kernel Trick:

The kernel trick enables SVM to operate in a higher-dimensional feature


space without explicitly computing the transformations. Instead, it
computes the inner product between pairs of data points using the kernel
function, which is equivalent to the dot product in the high-dimensional
space.

This approach is computationally efficient and allows SVM to handle


complex, non-linear decision boundaries.

4. Handling Non-Linear Data: Differences Between Linear and


Non-Linear SVM
Linear SVM:

A linear SVM is suitable for datasets that are linearly separable, meaning
that a straight line (in two dimensions) or a plane (in higher dimensions) can
separate the classes.

It finds a linear hyperplane that maximizes the margin between classes, but
it may not perform well on data with non-linear relationships, leading to
poor classification accuracy on such data.

Non-Linear SVM:

Non-linear SVM uses kernel functions to transform the data into a higher-
dimensional space where a linear separation is possible. By mapping data
into this new space, the SVM can find a hyperplane that separates the
classes even if the original data is not linearly separable.

Advantages: Non-linear SVM is highly effective on complex data with


intricate relationships, making it flexible enough to handle a wide variety of
classification tasks.

Drawback: Non-linear SVMs require more computation due to the


transformation of data into higher dimensions, making them less suitable
for very large datasets with limited computational resources.

Module 04: 12
Choosing Between Linear and Non-Linear SVM
When to Use Linear SVM:

If data is linearly separable or if the dataset has a high number of


features but relatively few samples.

When computational efficiency is a priority, as linear SVMs are less


resource-intensive.

When to Use Non-Linear SVM:

If data is not linearly separable, meaning classes have complex and


non-linear relationships.

For applications requiring high flexibility in decision boundaries, such as


image or speech recognition tasks.

Module 04: 13

You might also like