0% found this document useful (0 votes)
11 views

Types of Kernels in Support Vector Machines

The document discusses various machine learning concepts, focusing on Support Vector Machines (SVM) and Decision Trees. It details different types of kernels used in SVM, the structure and functioning of decision trees, and introduces Bayesian learning methods, including Naïve Bayes classifiers and Bayesian belief networks. Key challenges and advantages of each method are also highlighted, emphasizing the importance of kernel selection, overfitting, and the handling of uncertainty.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Types of Kernels in Support Vector Machines

The document discusses various machine learning concepts, focusing on Support Vector Machines (SVM) and Decision Trees. It details different types of kernels used in SVM, the structure and functioning of decision trees, and introduces Bayesian learning methods, including Naïve Bayes classifiers and Bayesian belief networks. Key challenges and advantages of each method are also highlighted, emphasizing the importance of kernel selection, overfitting, and the handling of uncertainty.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Types of Kernels in Support Vector Machines (SVM)

In SVM, kernels play a crucial role in transforming the input data into a higher-dimensional space,
enabling the algorithm to find optimal hyperplanes for classification or regression. Here are some
common types of kernels used in SVM:

1. Linear Kernel - Used when the data is linearly separable.

• Equation: K(x1, x2) = x1 · x2

• Description: The simplest kernel, calculating the dot product between two vectors.

2. Polynomial Kernel - Suitable for non-linear data.

• Equation: K(x1, x2) = (x1 · x2 + c)^d

• The degree of the polynomial determines the flexibility.

3. Radial Basis Function (RBF) Kernel : Most commonly used for non-linear problems.

• Captures the similarity between points based on distance.

• Equation: K(x1, x2) = exp(-γ||x1 - x2||2)

• The parameter 'γ' controls the width of the RBF kernel.

4. Sigmoid Kernel

• Similar to a neural network activation function.

5. Custom Kernel

• You can define your own kernel function tailored to the specific problem.

Choosing the Right Kernel

The choice of kernel depends on the nature of the data and the complexity of the problem.

• Linearly separable data: Use the linear kernel.

• Non-linear, low-dimensional data: Consider the polynomial kernel.

• Complex, high-dimensional data: The RBF kernel is often a good choice.


In the context of Support Vector Machines (SVM), a hyperplane is a decision boundary that
separates different classes in the feature space. It is defined mathematically as:

Where:

• W is the weight vector (normal to the hyperplane).

• X is the input vector (data point).

• b is the bias term (offset from the origin).

Key Features of a Hyperplane:

1. Dimensionality:

o In a 2D space, the hyperplane is a line.

o In a 3D space, the hyperplane is a plane.

o In higher dimensions, it is a generalization called a "hyperplane."

2. Maximizing Margin:

o SVM aims to find the hyperplane that maximizes the margin between the classes.

o The margin is the distance between the hyperplane and the closest data points from
each class, called support vectors.

3. Linear Separability:

o If the data is linearly separable, the SVM finds a single hyperplane.

o If the data is not linearly separable, kernel functions are used to transform the data
into a higher-dimensional space.

Hyperplane in Different Scenarios:

• For binary classification, the hyperplane separates the two classes.

• For multi-class classification, multiple hyperplanes are used (one-vs-one or one-vs-all


strategies).

Properties of SVM (Support Vector Machine):


1. Maximizes the Margin:

o SVM aims to find a hyperplane that maximizes the margin (distance between the
hyperplane and the nearest data points of any class).

2. Uses Support Vectors:

o The decision boundary is determined only by the support vectors, making SVM
computationally efficient for sparse data.
3. Effective in High Dimensions:

o SVM works well with datasets that have a high number of features, as it avoids
overfitting by maximizing the margin.

4. Kernel Trick:

o SVM can handle non-linear relationships by using kernel functions to map data into a
higher-dimensional space.

5. Regularization Parameter (C):

o The parameter CC controls the trade-off between achieving a large margin and
minimizing classification error.

6. Versatile:

o Can be used for classification, regression (SVR), and outlier detection.

Issues in SVM:
1. Choosing the Right Kernel:

o Selecting an appropriate kernel (linear, polynomial, RBF, etc.) and its parameters is
crucial but can be challenging.

2. Computational Complexity:

o Training time is high for large datasets, especially with non-linear kernels.

3. Sensitivity to Parameters:

o Performance heavily depends on hyperparameters like CC (regularization) and


γ\gamma (kernel parameter).

4. Not Suitable for Noisy Data:

o SVM is sensitive to outliers; noisy data can significantly affect the margin and
hyperplane.

5. Class Imbalance:

o Struggles with datasets where one class has significantly more samples than the
other, as the decision boundary can get skewed.

6. Interpretability:

o The results of an SVM model, especially with non-linear kernels, are less
interpretable compared to simpler models.

7. Scaling of Features:

o SVM requires proper scaling of features to perform optimally, as it is sensitive to the


magnitude of the features.
Decision Tree : A decision tree is a supervised learning model used for classification and
regression tasks. It works by splitting data into subsets based on the feature values, creating a tree-
like structure of decisions.

Key Components:

1. Root Node: Represents the entire dataset and the starting point for splitting.

2. Internal Nodes: Represent tests on features to split data further.

3. Branches: Represent the outcome of a test (feature value).

4. Leaf Nodes: Represent the final decision or prediction (class or value).

How a Decision Tree Works:

1. Splitting: At each node, the algorithm selects the feature that best splits the dataset.

o Metrics for splitting include:

▪ Gini Index: Measures impurity of a split.

▪ Information Gain: Measures reduction in entropy after the split.

▪ Variance Reduction: Used for regression tasks.

2. Stopping Criterion: Splitting stops when:

▪ All data points in a node belong to one class (pure node).

▪ No features are left for further splitting.

▪ A maximum depth or minimum samples per node is reached.

3. Prediction: In classification: Assign the most common class in the leaf node.

o In regression: Assign the mean or median of the target variable in the leaf node.

Algorithm Steps of Decision Tree


1. Input: Training dataset with features and labels.
2. Start at the Root Node: Calculate the splitting criteria (e.g., Information Gain, Gini Impurity)
for all features.
Choose the feature with the highest information gain (or lowest Gini Impurity).
3. Split the Data: Split the data into subsets based on the chosen feature. Each subset forms a
branch from the node.
4. Repeat: Apply the same process to each subset (node) until one of the stopping criteria is
met (e.g., pure nodes, max depth reached).
5. Output : A decision tree where each leaf node provides the final classification or prediction.

Termination Conditions:

1. All data points in a node belong to a single class.

2. No features are left to split on.

3. A pre-defined tree depth or minimum samples per node is reached.


ID3 Algorithm (Iterative Dichotomiser 3) : The ID3 algorithm is one of the earliest and
most well-known decision tree learning algorithms developed by Ross Quinlan. ID3 is used to create
a decision tree based on a given dataset.

Working Principle of ID3 Algorithm –

• The goal of ID3 is to construct a decision tree that can classify a set of training examples into
given classes based on the features.
• The ID3 algorithm uses Information Gain based on Entropy as the splitting criterion to
determine the best feature at each node of the tree.

Steps in ID3 Algorithm


1. Start at the Root Node: Begin with the full training dataset at the root.
2. Calculate Entropy and Information Gain for all features.
3. Choose the Best Feature: The feature with the highest Information Gain is selected to split
the data.
4. Split Data: Create branches for each possible value of the feature, and assign subsets of the
training data to these branches.
5. Repeat Recursively: Continue splitting each subset based on the next feature with the
highest information gain until all data points are classified or other stopping criteria are met
(e.g., all samples belong to the same class).
6. Create Leaf Nodes: When all data points are classified, the nodes are turned into leaf nodes,
which represent the final decision.

Features of ID3
• Attribute Selection - ID3 uses Information Gain to determine the most informative attribute
at each level.
• Works Well for Categorical Data : ID3 is suited for classification problems, particularly with
categorical data.
Inductive Bias-
Inductive Bias refers to the set of assumptions a machine learning algorithm makes to generalize
from the training data to unseen data.

Inductive Bias in Decision Trees:


The Inductive Bias of a decision tree is:

• Shorter Trees Are Preferred: Decision trees aim to create the shortest possible tree that fits
the data.
• Preference for Features with High Information Gain The ID3 algorithm selects features
based on their information gain, assuming that features with higher information gain lead to
better classification results.

Importance of Inductive Bias:

• Inductive bias helps decision trees generalize well to unseen data by preventing them from
creating unnecessarily complex models that overfit the training data.

Inductive Bias in Machine Learning Models:

1. Support Vector Machines (SVM):

o Assumes data is separable by a hyperplane or can be made separable with a kernel


function.

2. Neural Networks:

o Assumes a smooth mapping between inputs and outputs.

o Architecture (e.g., number of layers, activation functions) defines the bias.

3. Decision Trees:

o Prefers smaller trees with fewer splits (Occam's Razor).

4. k-Nearest Neighbors (k-NN):

o Assumes that data points close to each other have the same label.
Issues in Decision Tree Learning

1. Overfitting:

o Trees grow too complex, capturing noise and reducing generalization.

2. Underfitting:

o Trees are too simple, failing to capture patterns in data.

3. Bias Toward Categorical Features:

o Features with many unique levels dominate splits unfairly.

4. Instability:

o Small changes in data can lead to significantly different tree structures.

5. Scalability:

o Computationally expensive for large datasets or high-dimensional data.

6. Lack of Smooth Decision Boundaries:

o Decision trees create step-like boundaries that may not fit complex patterns.

7. Data Imbalance Sensitivity:

o Performance can drop if one class significantly outweighs others.

8. Greedy Algorithm Limitation:

o Greedy splitting may result in suboptimal trees, as it only considers local optima.

9. Difficulty in Handling Missing Data:

o Decision trees struggle to handle missing values effectively without imputation or


specific strategies.

10. Pruning Complexity:

• Pruning decisions can be challenging, and incorrect pruning may reduce tree accuracy.
Bayesian Learning - It is a probabilistic framework for machine learning that leverages
Bayes' theorem to update the probability of a hypothesis based on observed evidence or
data. Bayesian learning methods are valuable for dealing with uncertainty and making
predictions that incorporate prior knowledge.
Bayes' Theorem: It is a mathematical formula used to update the probability of an event
happening based on new evidence. It tells you how to combine what you already know
(prior knowledge) with new information to make better predictions.
Bayes Optimal Classifier: It is a theoretical model that provides the best possible
prediction for a classification problem. It predicts the class with the highest posterior
probability, considering all possible hypotheses and their probabilities.

Steps in Bayes Optimal Classification:

3. Predict the Most Probable Class:


Choose the class CC with the maximum posterior probability.
Example:
Scenario: Classify an email as spam or not spam based on words like "win" or "offer."
• Classes: C1=Spam,C2=Not Spam
• Hypotheses:
o h1: “Emails with ‘win’ and ‘money’ are spam.”
o h2h_2: “Emails with ‘offer’ are spam.”
Step 1: Compute Prior Probabilities P(C1) and P(C2)
Step 2: Calculate Likelihoods P(D∣C1) and P(D|C_2).
Step 3: Compute Posterior Probabilities using Bayes’ Theorem.
Step 4: Predict the class with the highest P(C|D).
Advantages:
1. Optimal Performance: Guarantees the best prediction if probabilities are known.
2. Accounts for Uncertainty: Considers all hypotheses and weighs their probabilities.
Challenges:
1. Computational Complexity: Requires calculating probabilities for all possible
hypotheses.
2. Dependency on Accurate Probabilities
Naïve Bayes Classifier It is a is a method used to classify things (like whether an email is
spam or not) based on probabilities. It uses a formula called Bayes' Theorem to calculate
the likelihood of something belonging to a particular category.

How It Works:
1. Look at the Features:
It checks for specific clues (like whether an email has words like "win" or "offer").
2. Assume Independence:
It assumes that each clue works independently (even if that’s not true in real life).
3. Calculate Probabilities:
It calculates the probability of the email being spam or not spam based on the clues.
4. Pick the Most Likely Category:
It predicts the category (spam or not spam) with the highest probability.\

Steps in Naïve Bayes Classification:


1. Calculate Prior Probabilities P(C):
Estimate the prior probability for each class C from the training data.
2. Calculate Likelihood P(X_i|C):
For each feature Xi and class C, compute the likelihood.
3. Apply Bayes' Theorem:
Combine prior and likelihood to calculate P(C∣X)P for each class.
4. Choose the Class with Maximum Posterior Probability:
Predict the class C that maximizes P(C∣X).
Bayesian Belief Networks (BBNs): It is also called a Bayesian Network, is a type of
graphical model that represents a set of variables and their probabilistic dependencies using
a directed acyclic graph (DAG). It is based on Bayesian probability theory.
Key Components:
1. Nodes: Each node represents a variable (e.g., weather, traffic). Variables can be
observable, hidden, or query variables.
2. Edges: Directed edges (arrows) between nodes represent dependencies. For
example, if rain affects traffic, an arrow will go from "Rain" to "Traffic."
3. Conditional Probabilities: Each node has a conditional probability table (CPT) that
quantifies the effect of parent nodes on it.
o Example: The probability of traffic given it is raining
4. Independence:
A node is conditionally independent of its non-descendants, given its parent nodes.
How It Works:
1. Nodes = Things to Know: Each circle (node) in the map represents something you
care about. For example:
o "Rain" (Is it raining?)
o "Traffic" (Is there heavy traffic?)
2. Arrows = Connections: Arrows between nodes show how one thing affects another.
For example:
o If it rains, it might cause traffic, so there's an arrow from "Rain" to "Traffic."
3. Probabilities = How Likely Things Are:
Each connection has a table of probabilities. For example:
o If it rains, the chance of traffic might be 90%.
o If it doesn’t rain, the chance of traffic might only be 30%.
4. Make Predictions: If you know one thing (like "It’s raining"), the network can
calculate how likely other things are (like "There will be traffic").
Example: Imagine you want to predict if you'll be late for work:
• Nodes: "Rain," "Traffic," "Late for Work."
• Connections: Rain → Traffic (Rain can cause traffic).
o Traffic → Late for Work (Traffic can make you late).
If you know it's raining, the network can predict the chances of heavy traffic and whether
you might be late for work.
Advantages:
1. Handles Uncertainty:
2. Intuitive Representation:
3. Flexible Querying:
Challenges:
1. Complexity: Computationally expensive for large networks.
2. Requires Probabilities: Needs accurate conditional probabilities for all variables.

You might also like