0% found this document useful (0 votes)

36 views28 pages

Introduction to Support Vector Regression (SVR)

Support Vector Regression (SVR) extends Support Vector Machines (SVM) to predict continuous outcomes by minimizing deviations from target values while maintaining a flat function. The dual formulation of SVR, derived using Lagrange multipliers, allows for efficient computation and the application of the kernel trick, enabling regression in high-dimensional spaces. Key components include the optimization of weight vectors, bias terms, and the use of support vectors to define the regression function.

Uploaded by

Daniel Solomon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views28 pages

Introduction to Support Vector Regression (SVR)

Uploaded by

Daniel Solomon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Introduction to Support Vector Regression (SVR)

Support Vector Regression is an extension of Support Vector Machines (SVM)

for regression tasks, where the goal is to predict continuous outcomes. SVR
seeks to find a function that has at most a certain deviation (epsilon) from
the actual target values for all training data, while being as flat as possible.

Primal Formulation of SVR

Objective Function
In the primal form, SVR aims to solve the following optimization problem:
n
1
minimize ¿ ∥ w ∥ 2+C ∑ ( ξ i +ξ ¿i )
w ,b ,ξ ,ξ ¿ 2 i=1

{
y i−( w ⊤ ϕ ( x i ) +b ) ≤ ϵ+ ξi ,
subject to ¿ ( w ⊤ ϕ ( x i ) +b )− y i ≤ ϵ +ξ ¿i ,
¿
ξ i , ξ i ≥ 0 , ∀ i∈ {1 , 2 ,… ,n },

Explanation:
 w : Weight vector in the feature space.

 b : Bias term.

 ϕ ( x i ): Mapping of input data to a higher-dimensional feature space.

 C> 0: Regularization parameter that controls the trade-off between

flatness and allowed deviations.

 ϵ : Epsilon margin specifying the acceptable error without penalty.

¿
 ξ i , ξi : Slack variables for data points outside the epsilon margin.

The objective is to minimize the norm of the weight vector (to ensure
flatness of the function) while penalizing deviations beyond ϵ .

Dual Formulation of SVR

To solve the primal problem efficiently and apply the kernel trick, we
formulate the dual problem using Lagrange multipliers.
Deriving the Dual Problem
¿ ¿
Introducing Lagrange multipliers α i ,α i ≥ 0 and ηi , ηi ≥ 0 for the inequality
constraints, and λ i for the equality constraints, we construct the Lagrangian:
n
1
L=¿ ∥ w∥ 2+ C ∑ ( ξi +ξ ¿i )
2 i=1
n
−∑ α i ( ϵ +ξ i− y i +w ϕ ( x i ) +b )
⊤

i=1
n
−∑ α ¿i ( ϵ+ ξi¿ + y i−w⊤ ϕ ( x i ) −b )
i=1
n
−∑ ( ηi ξ i+ η¿i ξ ¿i ) .
i=1

¿
Setting the partial derivatives of L with respect to w ,b ,ξ i , ξi to zero, we
obtain:
1. Derivative with respect to w :

n
∂L
=w−∑ ( α i−α ¿i ) ϕ ( x ) =0
∂w i=1

This implies,
n
w=∑ ( α i−α ¿i ) ϕ ( x )
i=1

2. Derivative with respect to b :

n
∂L
=−∑ ( α i−α ¿i ) =0
∂b i=1

This leads to,

∑ ( αi −αi¿ )=0
i=1

¿
3. Derivatives with respect to ξ i and ξ i :

¿ ¿
C−α i−α i =0 ¿ α i−α i =C
¿ ¿
Since ηi , ηi ≥ 0, it follows that 0 ≤ α i , α i ≤C .

Dual Objective Function

Substituting w back into the Lagrangian and simplifying, we derive the dual
problem:
n n
maximize ¿ ∑ ( α −α i ) y i−ϵ ∑ ( α i +α ¿i )
¿ n
α, α ¿
i=1
i
i=1 ¿ ∑ ( α i−α ¿i )=0 , ¿ 0 ≤ α i , α ¿i ≤C , ∀ i∈ {1 , 2 , … ,n }.¿
¿ ¿ i=1

Simplifying the Objective Function:

To make it consistent with standard formulations, we often rearrange terms:
n n
maximize ¿ ∑ ( α i−α ¿i ) y i−ϵ ∑ ( α i +α ¿i )
α, α ¿ i=1 i=1
¿ ¿

Understanding the Dual Variables

 α i and α ¿i are the Lagrange multipliers associated with the constraints
that the error exceeds ϵ in the positive and negative directions,
respectively.
¿
 Points where α i or α i are greater than zero are the support vectors—
they define the regression function.

The Regression Function

¿
After solving the dual problem and finding the optimal α i ,α i , we can construct
the regression function:
n
f ( x )=∑ ( α i−α ¿i ) K ( x i , x ) +b .
i=1

Bias Term b
To find b , we use the fact that for any support vector x k with 0< α k <C or
¿
0< α k <C , the following holds:
n
b= y k −∑ ( α i−α ¿i ) K ( x i , x k ) ± ϵ .
i=1

¿
Depending on whether α k or α k is in use, we adjust ± ϵ accordingly.
The Kernel Trick
Principle
The kernel trick allows us to compute the inner products in the high-
dimensional feature space without explicitly performing the transformation
ϕ ( x ). Instead, we use a kernel function K ( x i , x j ) that computes the inner
product directly:
K ( x i , x j ) =⟨ ϕ ( x i ) , ϕ ( x j ) ⟩.

Common Kernel Functions

1. Linear Kernel:
⊤
K ( x i , x j ) =x i x j .

2. Polynomial Kernel:
d
K ( x i , x j ) =( γ x i x j +r ) ,
⊤

where γ >0 , r ≥ 0, and d is the degree of the polynomial.

3. Radial Basis Function (RBF) Kernel:

K ( x i , x j ) =exp (−γ ∥ x i−x j ∥2 ) ,

with γ >0 .
4. Sigmoid Kernel:

K ( x i , x j ) =tanh ( γ x ⊤
i x j+ r ) ,

parameters γ and r are chosen appropriately.

Properties of the Dual SVR

1. Convexity:

o The dual problem is a convex quadratic optimization problem,

ensuring a global optimum.
2. Sparsity:

o Only a subset of training data (support vectors) have non-zero

¿
α i−α i .

o This sparsity leads to efficient prediction since only support

vectors contribute to the regression function.
3. Robustness to Outliers:

o The parameters C and ϵ control the trade-off between the

flatness of f ( x ) and the amount up to which deviations larger than
ϵ are tolerated.

o A larger ϵ creates a wider “tube” around the regression function

where no penalty is given to errors.

4. Kernel-Induced Feature Space:

o The kernel function implicitly maps data into a higher-

dimensional space where linear regression is performed.

Concrete Example
Let’s consider a simple dataset to illustrate SVR with a linear kernel.

Dataset:
Suppose we have three data points:

{
x 1=[ 1 ] , y 1=2 .
x 2=[ 2 ] , y 2=3 .
x 3=[ 3 ] , y 3=2 .

Parameters:
 ϵ=0.5

 C=1.0
⊤
 Kernel: Linear K ( x i , x j ) =x i x j

Setting Up the Dual Problem:

¿
Our goal is to find α i ,α i that maximize:
3 3

∑ ( α i−α¿i ) y i−ϵ ∑ ( αi + α¿i )

i=1 i=1
3 3
−1
∑ ∑
2 i=1 j=1
(
¿ ¿
α i−α i )( α j−α j) K ( x i , x j ) .

Subject to:
{
3

∑ ( αi−α¿i )=0 ,
i=1
¿
0 ≤ αi , αi ≤ 1 , ∀ i .

Solving the Dual Problem:

Due to the simplicity of the dataset, we can use numerical optimization tools
¿
(like quadratic programming solvers) to find the optimal α i ,α i . For illustration,
let’s assume the solution yields:

{
α 1−α ¿1=0.6 ,
¿
α 2−α 2=0 ,
¿
α 3−α 3=−0.6 .

Constructing the Regression Function:

1. Compute w (since the kernel is linear):
3
w=∑ ( α i−α ¿i ) xi =( 0.6 ) ( 1 ) + ( 0 ) ( 2 ) + (−0.6 )( 3 ) =−1.2.
i=1

2. Compute b :
Using a support vector (e.g., x 1):
⊤
b ¿ y 1−w x 1 ± ϵ
¿ ¿ 2+1.2 ± 0.5
¿ ¿
Since ϵ=0.5, b lies between 2.7 and 3.7 . We can choose b=3.2 for simplicity.
3. Final Regression Function:
f ( x )=−1.2 x +3.2 .

Making Predictions:
 For x=1.5 :

f ( 1.5 )=−1.2 ×1.5+3.2=−1.8+3.2=1.4 .

 For x=2.5 :

f ( 2.5 )=−1.2 ×2.5+3.2=−3+3.2=0.2 .

Visualization:
The regression function f ( x ) is a straight line with negative slope, fitting the
data within the ϵ -tube (± 0.5 around the regression line), except possibly at
the support vectors.
Connecting the Dots
 From Primal to Dual:

o The dual formulation allows leveraging the kernel trick, making it

feasible to work with high-dimensional feature spaces without
explicit computation.
 Role of Support Vectors:

o Data points with errors exceeding ϵ or lying exactly on the

¿
margin contribute to the model (non-zero α i−α i ).
 Model Complexity Control:

o Parameters C and ϵ help balance the trade-off between model

complexity and fitting accuracy.
 Kernel Choice:

o The kernel function determines the shape of the decision

boundary (or regression function) in the input space.

Conclusion
The dual form of SVR, combined with the kernel trick, provides a powerful
framework for performing regression in high-dimensional or infinite-
dimensional spaces efficiently. By understanding the role of each component
in the dual formulation and how they contribute to the final regression
function, we can better interpret and apply SVR to complex real-world
problems.

1. Introduction to Support Vector

Regression (SVR)
1.1 Overview
Support Vector Regression (SVR) extends Support Vector Machines (SVM) to
regression tasks. Instead of finding a maximum margin separator, SVR
aims to find a function that predicts real-valued outputs while allowing for
some deviation (controlled by ε-insensitivity).
1.2 Why Use the Dual Form?
The primal optimization problem in SVR involves minimizing a loss function
with constraints. However, solving it directly is computationally expensive,
especially in high dimensions. Instead, we derive a dual formulation using
Lagrange multipliers, which enables:
 Efficient computation using quadratic programming.

 Implicit transformation to higher-dimensional spaces via the kernel

trick.

 Use of only a subset of the data (support vectors) to make

predictions.

2. Mathematical Formulation of SVR

2.1 The Primal Optimization Problem
n
Given a dataset {( xi , y i ) }i=1, where:
d
 x i ∈ R are input feature vectors.

 y i ∈ R are target values.

SVR aims to find a function:

⊤
f ( x )=w ϕ ( x )+ b
where ϕ ( x ) maps input data into a high-dimensional feature space.
Objective Function (Primal Form)
n
1
min ||w||2 +C ∑ ( ξ i+ ξ ¿i )
w , b ,ξ , ξ 2
i
¿
i i=1

subject to:
⊤
y i−w ϕ ( xi ) −b ≤ ϵ+ ξ i
⊤ ¿
w ϕ ( x i ) +b− y i ≤ ϵ+ ξi
¿
ξ i , ξi ≥ 0 , ∀ i

where:
 C controls the tradeoff between model complexity and tolerance to
errors.
 ϵ defines the margin within which errors are ignored.
¿
 ξ i , ξi are slack variables for points outside the margin.

2.2 The Dual Formulation

¿
To solve this efficiently, we introduce Lagrange multipliers α i ,α i and derive
the Lagrangian:
n n
1
L ( w , b , ξ ,ξ ¿ , α , α ¿ )= ||w||2+C ∑ ( ξ i+ ξ ¿i )−∑ α i ( ϵ+ ξi − y i+ w⊤ ϕ ( x i ) +b )
2 i=1 i=1

$$
 _{i=1}^{n} _i^* ( + i^* + y_i - w^(x_i) - b ) - {i=1}^{n} (_i _i + _i^*
_i^*)
¿ ¿
$$ where ηi , ηi ≥ 0 are multipliers enforcing ξ i , ξi ≥ 0.
Dual Optimization Problem
By setting the derivatives to zero and eliminating w , we obtain the dual
problem:
n n n
1
max ∑ ( α i −α ) y i− ∑ ∑ ( α i−α ¿i )( α j−α ¿j ) K ( x i , x j )
¿
i
¿
α , α i=1 2 i=1 j=1
subject to:
n

∑ ( αi −αi¿ )=0
i=1

¿
0 ≤ α i , α i ≤C , ∀ i
⊤
where K ( x i , x j ) =ϕ ( xi ) ϕ ( x j ) is the kernel function.

3. Regression Function in Dual Form

¿
Once we solve for α i ,α i , the regression function is:
n
f ( x )=∑ ( α i−α ¿i ) K ( x i , x ) +b
i=1

where b is computed using support vectors:

n
b= y i−∑ ( α j−α ¿j ) K ( x j , xi )
j=1

4. The Kernel Trick

4.1 Principle
Instead of explicitly mapping data into a high-dimensional space, we use a
kernel function K ( x i , x j ) that directly computes inner products in that space.
⊤
K ( x i , x j ) =ϕ ( xi ) ϕ ( x j )

4.2 Common Kernels

Kernel Type Formula Characteristics
Linear ⊤
K ( x i , x j ) =x x j
i
No transformation, simple and
efficient
Polynomial K ( x i , x j ) =( x i x j +c ) Captures polynomial
⊤ d

relationships
Radial Basis K ( x i , x j ) =exp (−γ||x i−x j || )
Models2 complex, nonlinear
Function (RBF) patterns
Sigmoid K ( x i , x j ) =tanh ( β x i x j+Inspired
⊤
c) by neural networks

5. Logical Reasoning and Theoretical

Framework
5.1 Why the Dual Form Works
1. Transformation to High-Dimensional Space: The kernel trick
allows SVR to operate in high dimensions without explicitly computing
feature mappings.
2. Convex Optimization: The dual formulation is a quadratic
programming problem, ensuring a unique global solution.
¿
3. Sparsity: Only support vectors (points with nonzero α i or α i ) affect
the final model.
5.2 Statistical Learning Theory
 Structural Risk Minimization (SRM): Balances model complexity
and generalization.
 Reproducing Kernel Hilbert Space (RKHS): Provides a theoretical
basis for kernel methods.

6. Example: A Simple SVR Model

Problem Statement
Given data points:
X ={1 , 2 ,3 , 4 ,5 },Y ={1.1 , 1.9 ,3.2 , 4.0 , 5.1 }
we fit an SVR with a linear kernel.

Steps
1. Compute Kernel Matrix:

[ ]
1 2 3 4 5
2 4 6 8 10
⊤
K ( X , X )= X X= 3 6 9 12 15
4 8 12 16 20
5 10 15 20 25

2. Solve the Dual Optimization Problem.

3. Compute f ( x ) using support vectors.

4. Predict f ( 6 ) using:
n
f ( 6 )=∑ ( α i−α ¿i ) K ( x i , 6 ) +b
i=1

This provides an estimated value.

7. Conclusion
 SVR in dual form efficiently handles nonlinear regression using
kernels.

 Support vectors determine predictions.

 Choosing an appropriate kernel and tuning hyperparameters are

critical for performance.
Dual Form of Support Vector Regression
(SVR) and the Kernel Trick
1. Introduction
Support Vector Regression (SVR) is an extension of Support Vector Machines
(SVM) for regression tasks. The dual formulation of SVR enables efficient
optimization by leveraging the kernel trick, allowing nonlinear regression in
high-dimensional feature spaces without explicit transformations.

Key Objectives of Dual SVR

 To transform a nonlinear problem into a higher-dimensional space
where it becomes linearly solvable.
 To optimize a dual objective function using Lagrange multipliers.

 To apply kernel functions for implicit feature mapping.

 To control model complexity using regularization parameters.

2. Mathematical Formulation of SVR in Dual Form

2.1. The Dual Optimization Problem
The primal formulation of SVR aims to minimize a regularized risk function.
The corresponding dual problem is:
n n n
1
max ∑ ( α i −α ¿i ) y i− ∑ ∑ ( α −α¿ )( α −α ¿j ) K ( x i , x j )
α , α ¿ i=1 2 i=1 j=1 i i j
Subject to:
n

∑ ( αi −αi¿ )=0
i=1

¿
0 ≤ α i , α i ≤C , ∀ i

where:
¿
 α i ,α i are Lagrange multipliers (dual variables).

 y i is the target value for data point x i.

 K ( x i , x j ) is the kernel function.

 C is the regularization parameter controlling model flexibility.

2.2. Regression Function from Dual Form
¿
After solving for optimal α i ,α i , the regression function is:
n
f ( x )=∑ ( α i−α ¿i ) K ( x i , x ) +b
i=1

where b is the bias term computed as:

n
b= y i−∑ ( α j−α ¿j ) K ( x j , xi )
j=1

¿
Interpretation: Only support vectors (data points with nonzero α i or α i )
contribute to the final prediction.

3. Properties and Classification of Properties

3.1. Key Properties
1. Convexity:

o The optimization problem is convex, ensuring a global

minimum.
2. Sparsity:

o Only a subset of data points (support vectors) determines the

regression function.
3. Robustness to Outliers:

o Controlled by C and epsilon (ϵ ), allowing some margin for noise.

4. Generalization Ability:
o SVR prevents overfitting through regularization.

3.2. Categorization of Properties

Mathematical Properties
 Quadratic Programming: Optimization is a constrained quadratic
problem.

 Dual Space Representation: Works in dual space using Lagrange

multipliers.

Statistical Properties
 Bias-Variance Tradeoff: Controlled via C and kernel parameters.

 Kernel-Induced Feature Space: Data is implicitly mapped to a

high-dimensional space.
Computational Properties
 Kernel Matrix Complexity: Computation of K ( x i , x j ) affects
performance.

 Scalability: Complexity grows quadratically with dataset size.

4. The Kernel Trick

4.1. Principle of the Kernel Trick
The kernel trick implicitly maps input data x to a higher-dimensional space
via a function ϕ ( x ), without explicitly computing ϕ ( x ).
K ( x i , x j ) =ϕ ( xi ) ⋅ ϕ ( x j )

4.2. Why It Works

 In many problems, linear regression in transformed feature space
performs well.

 Computing ϕ ( x ) directly is computationally expensive.

 Kernel functions allow computing inner products without explicit

transformations.

4.3. Common Kernel Functions

Kernel Type Formula Characteristics
Linear Kernel K ( x i , x j ) =x i ⋅ x j No feature transformation
Polynomial Kernel K ( x i , x j ) =( x i ⋅ x j+ c )
d
Captures polynomial relations
2
Radial Basis K ( x i , x j ) =e−γ||x −x ||
i j Handles complex relationships
Function (RBF)
Sigmoid Kernel K ( x i , x j ) =tanh ( β x i ⋅ x j +c )
Inspired by neural networks

5. Logical Reasoning and Theoretical Framework

5.1. Why Dual Form Works
 The primal problem has constraints making direct optimization
challenging.

 Using Lagrange multipliers, we transform it into an equivalent dual

problem.
 This formulation allows applying kernel functions, making nonlinear
problems solvable.

5.2. Theoretical Framework

 Based on Statistical Learning Theory.

 Uses Structural Risk Minimization to balance bias-variance tradeoff.

 Relies on Reproducing Kernel Hilbert Space (RKHS) for kernel-

based learning.

6. Historical Development and Expert Insights

 Vladimir Vapnik and Alexey Chervonenkis (1995) introduced
Support Vector Machines.

 Smola and Schölkopf (1998) extended SVMs to SVR for

regression.

 Recent research explores deep learning-SVR hybrids and efficient

solvers.

7. Practical Applications
7.1. Real-World Use Cases
 Financial Forecasting: Stock price predictions.

 Engineering: Predictive maintenance models.

 Medical Diagnostics: Disease progression modeling.

 Energy Consumption Prediction: Smart grid forecasting.

7.2. Implementation Steps

1. Prepare Data: Normalize and split dataset.
2. Choose Kernel: Based on domain knowledge.
3. Optimize Hyperparameters: Tune C, ϵ , and kernel parameters.
4. Train SVR Model: Use Quadratic Programming Solver.
5. Evaluate Performance: Use Mean Squared Error (MSE).

8. Conclusion
 SVR in dual form allows solving complex regression problems via
kernel tricks.
 The method efficiently finds optimal regression functions using
support vectors.

 Choosing the right kernel and tuning hyperparameters is

critical for model performance.

Further Exploration:
 Advanced Kernels: Exploring graph-based kernels.

 Optimization Techniques: Faster solvers like Sequential Minimal

Optimization (SMO).

 Hybrid Models: Combining SVR with deep learning.

Introduction to Support Vector Regression (SVR)

Support Vector Regression (SVR) is a type of regression analysis that
extends the principles of Support Vector Machines (SVM) to predict
continuous outcomes. It is particularly powerful for dealing with nonlinear
data.

The Challenge with Nonlinear Data

 Nonlinearity: Real-world data often exhibit nonlinear relationships
between independent variables (features) and the dependent variable
(target).

 Linear Models Limitation: Linear regression models struggle to

capture these complex patterns.

The Kernel Trick

 Purpose: The Kernel Trick allows SVR to handle nonlinear data by
implicitly mapping input data into a higher-dimensional feature space
where linear relationships can be found.

 How It Works: Instead of transforming the data explicitly, it uses

kernel functions to compute inner products in the higher-dimensional
space efficiently.

The Dual Form of SVR

The dual form of SVR is an optimization problem expressed as:

[ ]
n n n
1
max ∑ ( α i−α ) y i− ∑ ∑ ( α i −α ¿i )( α j−α ¿j ) K ( x i , x j )
¿
i
α, α¿
i=1 2 i=1 j=1
Where:

 max α ,α indicates maximization with respect to the vectors α =[ α 1 , α 2 , … , α n ]

and α =[ α 1 , α 2 , … , α n ].
¿ ¿ ¿ ¿

Detailed Breakdown of Each Component

Let’s dissect the formula step by step, accounting for every variable and
operation.

1. Variables and Notations

Data Variables
 n : Total number of data points in the dataset.

 i , j: Indices representing data points, where i , j∈ {1 , 2, … , n }.

 x i: The input feature vector of the i-th data point.

 y i: The target value (actual output) corresponding to x i.

Optimization Variables
¿
 α i ,α i : Lagrange multipliers (dual variables) associated with each data
point i.

o These variables are introduced during the dual formulation of the

optimization problem.

o They adjust the influence of each data point in the regression

function.
¿ ¿
o Both α i and α i are non-negative real numbers, i.e., α i ,α i ≥ 0.

Kernel Function
 K ( x i , x j ) : The kernel function computes the inner product of the images
of x i and x j in the feature space.

o Definition: K ( x i , x j ) =⟨ ϕ ( x i ) , ϕ ( x j ) ⟩

o ϕ : The mapping function that transforms input data into a higher-

dimensional feature space.

2. Maximization Operator
 max α ,α : Indicates that we aim to find the values of α and α ¿ that
¿

maximize the objective function.

3. Objective Function
The objective function consists of two main terms:
First Term: Linear Part
n

∑ ( αi −αi¿ ) y i
i=1

 Operation:

o For each data point i :

¿
 Compute ( α i−α i ).

 Multiply by the corresponding target value y i.

o Sum this product over all data points from i=1 to n.

 Interpretation:

o This term measures how well the model’s predictions align with
the actual targets.
¿
o ( α i−α i ) represents the net influence of each data point.
o A positive value indicates a direct relationship, while a negative
value indicates an inverse relationship.

Second Term: Quadratic Part

$$
 {i=1}^{n} {j=1}^{n} (_i - _i^) (_j - _j^) K(_i, _j)
$$
 Operations:

o Double Summation:

 For all pairs of data points ( i , j ):

¿ ¿
 Compute ( α i−α i ) and ( α j−α j ).
¿ ¿
 Multiply these differences together: ( α i−α i )( α j−α j ).

 Compute the kernel function K ( x i , x j ) .

 Multiply the above results.

−1
o Sum over all i and j , then multiply by .
2
 Interpretation:

o This term accounts for the interactions between data points in

the transformed feature space.
¿
o It penalizes large values of ( α i−α i ) when the corresponding data
points are similar (i.e., when K ( x i , x j ) is large).

o Acts as a regularization term to prevent overfitting.

4. Constraints (Not Explicitly Shown)

While not present in the given formula, the optimization is subject to several
constraints:
Equality Constraint
n

∑ ( αi −αi¿ )=0
i=1

 Ensures that the net effect of the dual variables balances out.
Inequality Constraints
¿
0 ≤ α i , α i ≤C , ∀ i∈ {1 , 2 , … ,n }

 C : A regularization parameter that controls the trade-off between

model complexity and training error.
¿
 These constraints limit the values of α i and α i to be within a specified
range.

5. The Role of ϵ -Insensitive Loss

Although not explicitly shown in the dual form, the SVR utilizes an ϵ -
insensitive loss function:
 Definition:

{ y −f ¿( x )|≤ϵ otherwise¿
Lϵ ( y i , f ( xi ) ) = | i i

 Purpose:

o Allows for a margin ϵ where errors are not penalized.

o Encourages the model to be as flat as possible (i.e., less

complex) while fitting the data within the ϵ -tube.
6. Relationship Between Primal and Dual Forms
Primal Formulation
In the primal form, the SVR optimization problem is:
n
1
min ||w||2+C ∑ ( ξ i +ξ ¿i )
w , b ,ξ ,ξ ¿ 2 i=1

Subject to:

{
y i−( w ϕ ( x i ) + b ) ≤ ϵ +ξ i
⊤

( w⊤ ϕ ( x i ) +b )− y i ≤ ϵ +ξ ¿i
¿
ξ i ,ξ i ≥ 0

 Variables:

o w : Weight vector in the feature space.

o b : Bias term.

o ξ , ξ¿ : Slack variables for handling errors beyond ϵ .

From Primal to Dual

 By applying the method of Lagrange multipliers and leveraging the KKT
(Karush-Kuhn-Tucker) conditions, we derive the dual formulation.
¿
 The dual variables α i ,α i correspond to the Lagrange multipliers
associated with the inequality constraints in the primal problem.

Step-by-Step Detailed Explanation

Let’s now perform a meticulous examination of each operation and variable,
ensuring every detail is accounted for.

A. Maximization Operator
max
¿
α, α

 Objective:

o Find the set of dual variables α and α ¿ that maximize the

objective function under the given constraints.

o This is a convex optimization problem due to the quadratic

nature of the objective function and the linear constraints.
B. First Term: Linear Sum
n

∑ ( αi −αi¿ ) y i
i=1

Breaking It Down
¿
1. Compute ( α i−α i ):
¿
o For each i, subtract α i from α i.

o This difference reflects the net contribution of data point i to the

model.

2. Multiply by y i:
¿
o Multiply the difference ( α i−α i ) by the target value y i.

o This aligns the contribution with the actual output.

3. Sum Over All i :

o Aggregate the products from step 2 for all data points.

Example Calculation
Suppose n=3:
¿
 ( α 1−α 1 ) y 1
¿
 ( α 2−α 2 ) y 2
¿
 ( α 3−α 3 ) y 3
Sum:
Sum=( α 1−α ¿1 ) y 1 + ( α 2−α ¿2 ) y 2+ ( α 3 −α 3¿ ) y 3

Interpretation
 Positive Contribution:
¿
o If ( α i−α i ) and y i have the same sign, the term contributes
positively to the objective function.
 Negative Contribution:
¿
o If ( α i−α i ) and y i have opposite signs, the term reduces the
objective function.
C. Second Term: Quadratic Double Sum
$$
 {i=1}^{n} {j=1}^{n} (_i - _i^)(_j - _j^) K(_i, _j)
$$
Breaking It Down
¿ ¿
1. Compute ( α i−α i ) and ( α j−α j ):

o For each pair ( i , j ), compute these differences independently.

2. Multiply the Differences:
¿ ¿
o Multiply ( α i−α i ) by ( α j−α j ) for each pair ( i , j ).
3. Compute the Kernel Function K ( x i , x j ) :

o Evaluate the kernel function for the input vectors x i and x j .

4. Multiply All Components:

o Multiply the result from step 2 by the kernel value from step 3.
5. Sum Over All Pairs ( i , j ):

o Aggregate the products for all combinations of i and j .

−1
6. Multiply by :
2
−1
o Scale the entire sum by .
2
Example Calculation
For n=2:
 Pairs: ( 1 ,1 ) , ( 1 ,2 ) , ( 2 ,1 ) , ( 2 , 2 )
Calculations:
¿ ¿
 ( α 1−α 1 )( α 1−α 1) K ( x 1 , x 1 )
¿ ¿
 ( α 1−α 1 )( α 2−α 2 ) K ( x 1 , x 2 )
¿ ¿
 ( α 2−α 2 )( α 1−α 1 ) K ( x 2 , x 1 )
¿ ¿
 ( α 2−α 2 )( α 2−α 2 ) K ( x 2 , x 2 )
Sum:
2 2
Sum=( α 1−α ¿1 ) K 11 +2 ( α 1−α ¿1 )( α 2 −α ¿2 ) K 12 + ( α 2−α ¿2 ) K 22

(Note: K ij =K ( x i , x j ) , and K 12=K 21 due to kernel symmetry.)

Interpretation
 The term penalizes the objective function when similar data points
(high K ( x i , x j ) ) have large differences in their dual variables.

 Encourages smoothness in the function by controlling the contributions

of similar data points.

D. The Kernel Function K ( x i , x j )

Definition
 General Form:

K ( x i , x j ) =⟨ ϕ ( x i ) , ϕ ( x j ) ⟩

 ϕ ( ⋅ ) : Mapping function transforming input data into a higher-

dimensional feature space.
Common Kernel Functions
1. Linear Kernel:
⊤
K ( x i , x j ) =x i x j

o No transformation; equivalent to standard dot product.

2. Polynomial Kernel:
d
K ( x i , x j ) =( γ x i x j +r )
⊤

o Parameters:

 γ : Scaling factor.

 r : Constant term.

 d : Degree of the polynomial.

3. Radial Basis Function (RBF) Kernel:

K ( x i , x j ) =exp (−γ||x i−x j||2 )

o γ : Controls the spread of the kernel.

4. Sigmoid Kernel:

K ( x i , x j ) =tanh ( γ x ⊤
i x j+ r )

Properties
 Symmetric: K ( x i , x j ) =K ( x j , x i ).

 Positive Definite: Ensures that the optimization problem remains

convex.
Role in SVR
 Enables the algorithm to operate in high-dimensional (possibly infinite-
dimensional) spaces without explicit computations in those spaces.

 Captures nonlinear relationships by measuring similarity in the

transformed feature space.

E. Constraints Detailed
Equality Constraint
n

∑ ( αi −αi¿ )=0
i=1

 Explanation:

o Ensures that the weighted sum of dual variables balances to

zero.

o Derived from the requirement that the regression function must

be unbiased.

Inequality Constraints
¿
0 ≤ α i , α i ≤C , ∀ i

 C : Regularization parameter.

o Controls the penalty for errors exceeding ϵ .

o Larger C allows for less violation of the ϵ -insensitive zone but

may lead to overfitting.

 Purpose:

o Keeps the dual variables within reasonable bounds.

o Prevents any single data point from dominating the model.

F. Constructing the Regression Function

After solving the dual problem, we derive the regression function:
n
f ( x )=∑ ( α i−α ¿i ) K ( x i , x ) +b
i=1

Components
1. Sum Over Support Vectors:
¿
o Only data points where ( α i−α i ) ≠ 0 contribute to the sum.
o These points are known as support vectors.

2. Kernel Function K ( x i , x ):

o Computes the similarity between a training data point x i and a

new input x .
3. Bias Term b :

o Calculated from the KKT conditions:

n
b= y i−∑ ( α j−α ¿j ) K ( x j , xi ) −ϵ for any α i such that 0< α i <C
j=1

o Ensures that the regression function aligns properly with the

data.

Interpretation
 The regression function is a weighted sum of kernel evaluations
between the support vectors and the input x , adjusted by the bias term
b.

 It predicts the target value for x based on its similarity to the training
data.

Top-Down Perspective
Let’s consider the broader picture and see how all these pieces fit together.

1. Goal of SVR
 Objective: Find a function f ( x ) that has at most ϵ deviation from the
actual target values y i for all training data, and is as flat as possible.

2. Utilizing the Kernel Trick

 By applying the kernel trick, we extend SVR to handle nonlinear
relationships in the data without explicitly performing nonlinear
transformations.
3. Dual Formulation Advantages
 Computational Efficiency:

o Solving the dual problem allows us to exploit the kernel function,

making computations feasible even in very high-dimensional
spaces.
 Sparsity:
o Solutions tend to be sparse, involving only a subset of the data
points (support vectors), leading to efficient prediction
computations.
4. Optimization Process Overview
1. Set Up the Dual Problem:

o Define the objective function and constraints based on the

dataset and chosen kernel.
2. Compute the Kernel Matrix:

o Evaluate K ( x i , x j ) for all i , j.

3. Solve the Quadratic Programming Problem:

o Use numerical optimization techniques (e.g., Sequential Minimal

Optimization) to find the optimal α and α ¿.
4. Determine the Bias Term b :

o Use the support vectors and KKT conditions to compute b .

5. Construct the Regression Function:

o Combine the dual variables, kernel evaluations, and bias term to

form f ( x ) .

5. Prediction
 For a new input x :

o Compute f ( x ) using the regression function.

o The predicted value is based on the weighted similarities

between x and the support vectors.

Bottom-Up Perspective
Starting from the fundamental elements, we see how they build up to form
the complete SVR model.

1. Data Points
 Each data point ( x i , y i ) contributes to the model through the dual
¿
variables α i and α i .

2. Dual Variables
 Reflect the influence of each data point.

 Constrained values ensure a balanced and generalizable model.

3. Kernel Evaluations
 Measure similarity in feature space.

 Allow the capture of complex, nonlinear patterns.

4. Objective Function
 Balances fitting the data (first term) with controlling model complexity
(second term).

 Maximization leads to the optimal set of dual variables under

constraints.

5. Regression Function
 Aggregates the contributions from support vectors.
 Provides predictions for new inputs based on learned patterns.

Additional Considerations
Hyperparameters
 C : Regularization parameter.

 ϵ : Width of the ϵ -insensitive zone.

 Kernel Parameters: Specific to the chosen kernel function (e.g., γ in

the RBF kernel).

Model Selection
 Cross-Validation: Used to select optimal hyperparameters by
evaluating model performance on validation sets.

 Kernel Choice: Impacts the ability of the SVR to capture underlying

data patterns.

Computational Complexity
 Training Time: Depends on the size of the dataset due to kernel
matrix computations.

 Prediction Time: Proportional to the number of support vectors.

Advantages of SVR
 Effective in high-dimensional spaces.

 Memory efficient due to support vectors.

 Versatile with different kernel functions.

Limitations
 Choice of hyperparameters can be non-trivial.

 Not well suited for very large datasets without using approximations.

Recap and Conclusion

We have meticulously dissected every component of the dual form of
Support Vector Regression using the Kernel Trick. By accounting for each
variable, operation, and concept, we’ve gained an in-depth understanding of
how SVR constructs a robust regression model capable of handling nonlinear
data.
From the maximization of the objective function to the constraints ensuring
model generalizability, every element plays a crucial role in the performance
of the SVR. The kernel function serves as a cornerstone, enabling the
algorithm to operate in complex feature spaces efficiently.

Multiple Choice Test Bank Questions No Feedback - Chapter 3
100% (1)
Multiple Choice Test Bank Questions No Feedback - Chapter 3
5 pages
2011-Modeling of Quadruple Tank System Using Support Vector Regression
No ratings yet
2011-Modeling of Quadruple Tank System Using Support Vector Regression
7 pages
lecture6
No ratings yet
lecture6
17 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
Support-Vector-Regression
No ratings yet
Support-Vector-Regression
5 pages
Support Vector Machines & Kernels: David Sontag New York University
No ratings yet
Support Vector Machines & Kernels: David Sontag New York University
19 pages
21 SVR
No ratings yet
21 SVR
22 pages
Linear SVR Fast
No ratings yet
Linear SVR Fast
27 pages
LIBSVM A Library For Support Vector Machines
No ratings yet
LIBSVM A Library For Support Vector Machines
25 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
Support Vector Machines For Prediction of Futures Prices in Indian Stock Market
No ratings yet
Support Vector Machines For Prediction of Futures Prices in Indian Stock Market
5 pages
A Tutorial On Support Vector Regression
No ratings yet
A Tutorial On Support Vector Regression
24 pages
A Tutorial On Support Vector Regression
No ratings yet
A Tutorial On Support Vector Regression
77 pages
1238 Support Vector Regression Machines
No ratings yet
1238 Support Vector Regression Machines
7 pages
Ds 5
No ratings yet
Ds 5
21 pages
L5 SVM
No ratings yet
L5 SVM
61 pages
Support vector machine
No ratings yet
Support vector machine
49 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
1632118884_ML-TCS-Lecture-15 (1)
No ratings yet
1632118884_ML-TCS-Lecture-15 (1)
46 pages
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
No ratings yet
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
39 pages
SVM Set-2
No ratings yet
SVM Set-2
5 pages
cddual
No ratings yet
cddual
14 pages
Lecture 9 - SVM
No ratings yet
Lecture 9 - SVM
42 pages
CSL0777 L19
No ratings yet
CSL0777 L19
23 pages
lec4
No ratings yet
lec4
19 pages
Machine Learning 3
No ratings yet
Machine Learning 3
35 pages
Ml -5 Sovan Lr Svm 1
No ratings yet
Ml -5 Sovan Lr Svm 1
59 pages
Another Introduction SVM
No ratings yet
Another Introduction SVM
4 pages
SVM
No ratings yet
SVM
28 pages
Survey Piccialli sciandrone4OR
No ratings yet
Survey Piccialli sciandrone4OR
29 pages
MLp
No ratings yet
MLp
28 pages
Ds 3
No ratings yet
Ds 3
25 pages
Lecture-6---Support-Vector-Regression-imran-07032025-114229am
No ratings yet
Lecture-6---Support-Vector-Regression-imran-07032025-114229am
30 pages
Vertopal.com Lab 2 SVM
No ratings yet
Vertopal.com Lab 2 SVM
23 pages
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
No ratings yet
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
14 pages
Support_Vector_Regression_Introduction
No ratings yet
Support_Vector_Regression_Introduction
10 pages
An Introduction Of: Support Vector Machine
No ratings yet
An Introduction Of: Support Vector Machine
36 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
support_vector_machines
No ratings yet
support_vector_machines
12 pages
Support Vector Machines For Classification and Regression
No ratings yet
Support Vector Machines For Classification and Regression
8 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
An Improved Training Algorithm For Support Vector Machines
No ratings yet
An Improved Training Algorithm For Support Vector Machines
10 pages
UNIT3 Machine Learning
No ratings yet
UNIT3 Machine Learning
53 pages
SVM-1
No ratings yet
SVM-1
36 pages
Comparison of SVM and LS SVM For Regress
No ratings yet
Comparison of SVM and LS SVM For Regress
5 pages
A Tutorial On Support Vector Regression
No ratings yet
A Tutorial On Support Vector Regression
3 pages
Svm
No ratings yet
Svm
21 pages
ISKE2007 Wu Hongliang
No ratings yet
ISKE2007 Wu Hongliang
7 pages
Sequential Minimal Optimization Method To Solve The Support Vector Machine Problem
No ratings yet
Sequential Minimal Optimization Method To Solve The Support Vector Machine Problem
5 pages
Support Vector Machines (SVM) : N I y X D
No ratings yet
Support Vector Machines (SVM) : N I y X D
5 pages
Support Vector Regression
No ratings yet
Support Vector Regression
14 pages
Support Vector Machines Jie Tang
No ratings yet
Support Vector Machines Jie Tang
28 pages
SVM-ML-AI_lecturenotes_cs725
No ratings yet
SVM-ML-AI_lecturenotes_cs725
9 pages
Lect 3
No ratings yet
Lect 3
14 pages
Support Vector Machines (SVM) : Y.H. Hu
No ratings yet
Support Vector Machines (SVM) : Y.H. Hu
25 pages
Least Squares Support Vector Machine Classifiers: Neural Processing Letters 9: 293-300, 1999
No ratings yet
Least Squares Support Vector Machine Classifiers: Neural Processing Letters 9: 293-300, 1999
8 pages
Rakotomamonjy SVR paper
No ratings yet
Rakotomamonjy SVR paper
13 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
From Everand
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Constructive Hoare Logic
No ratings yet
Constructive Hoare Logic
5 pages
the real-symmetric Spectral Theorem
No ratings yet
the real-symmetric Spectral Theorem
5 pages
Comprehensive Technical Note on Symmetric Matrices, Eigenvectors, Eigenvalues, and Principal Component Analysis (PCA)
No ratings yet
Comprehensive Technical Note on Symmetric Matrices, Eigenvectors, Eigenvalues, and Principal Component Analysis (PCA)
6 pages
Simulation and Modelling PQ 2025
No ratings yet
Simulation and Modelling PQ 2025
3 pages
Least Squares Method
No ratings yet
Least Squares Method
36 pages
Information Content and Surprise in Probability
No ratings yet
Information Content and Surprise in Probability
3 pages
Comprehensive Breakdown of a Logical Data Flow Diagram (LDFD)
No ratings yet
Comprehensive Breakdown of a Logical Data Flow Diagram (LDFD)
4 pages
Changing of Integrals
No ratings yet
Changing of Integrals
1 page
Lucas Numbers
No ratings yet
Lucas Numbers
3 pages
Polygonal Numbers_ A Comprehensive Study
No ratings yet
Polygonal Numbers_ A Comprehensive Study
9 pages
Mobile UI Design A Detailed Exploration of Properties, Principles, and Practical Applications
No ratings yet
Mobile UI Design A Detailed Exploration of Properties, Principles, and Practical Applications
67 pages
Solutions of Second Order Ordinary Differential Equations
No ratings yet
Solutions of Second Order Ordinary Differential Equations
9 pages
Unreal Engine
No ratings yet
Unreal Engine
1 page
CSC 120 Projec1
No ratings yet
CSC 120 Projec1
10 pages
StatProb11 Q4 Mod3 RegressionAnalysis v4
100% (5)
StatProb11 Q4 Mod3 RegressionAnalysis v4
21 pages
Company Overview: Accenture Hierarchy
No ratings yet
Company Overview: Accenture Hierarchy
32 pages
Data Presentation and Analysis
100% (1)
Data Presentation and Analysis
22 pages
Chapter 15
No ratings yet
Chapter 15
48 pages
Practice Sheet - Stats
No ratings yet
Practice Sheet - Stats
5 pages
Sales Forecasting
No ratings yet
Sales Forecasting
39 pages
Fuzzy Prediction and Pattern Analysis of Poultry Egg Production
No ratings yet
Fuzzy Prediction and Pattern Analysis of Poultry Egg Production
9 pages
Medical Statistics at a Glance, 4th Edition PDF ebook with Full Chapters
100% (15)
Medical Statistics at a Glance, 4th Edition PDF ebook with Full Chapters
17 pages
Factors Related To Cyber Security Behavior
No ratings yet
Factors Related To Cyber Security Behavior
9 pages
The ASSESSMENT of TAXATION On PERFORMANCE of MICRO and SMALL ENTERPRISE in The CASE of NEDJO TOWN, ETHIOPIA
No ratings yet
The ASSESSMENT of TAXATION On PERFORMANCE of MICRO and SMALL ENTERPRISE in The CASE of NEDJO TOWN, ETHIOPIA
29 pages
Forecasting and Trading Cryptocurrencies With Machine Learning Under Changing Market Conditions
No ratings yet
Forecasting and Trading Cryptocurrencies With Machine Learning Under Changing Market Conditions
31 pages
(eBook PDF) Statistics for Business Decision Making Analysis 2nd instant download
100% (2)
(eBook PDF) Statistics for Business Decision Making Analysis 2nd instant download
45 pages
Quantile Regression Explained
No ratings yet
Quantile Regression Explained
4 pages
Rheology of Supersaturated Sucrose Solutions
No ratings yet
Rheology of Supersaturated Sucrose Solutions
9 pages
Machine Learning Doubts
No ratings yet
Machine Learning Doubts
4 pages
Data Analytics & Business Intelligence
No ratings yet
Data Analytics & Business Intelligence
15 pages
Descriptive Statistics & Korelasi: Dr. Muhammad Ikhsan Sulaiman
No ratings yet
Descriptive Statistics & Korelasi: Dr. Muhammad Ikhsan Sulaiman
17 pages
Sta211 2016 2017
No ratings yet
Sta211 2016 2017
5 pages
Module 04
No ratings yet
Module 04
75 pages
TI89 Advanced ST
No ratings yet
TI89 Advanced ST
95 pages
Curve Fitting (Print)
No ratings yet
Curve Fitting (Print)
3 pages
The Role of Parenting
No ratings yet
The Role of Parenting
8 pages
36-401 Modern Regression HW #5 Solutions: Air - Flow
No ratings yet
36-401 Modern Regression HW #5 Solutions: Air - Flow
7 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
CausalInference w7 Panel
No ratings yet
CausalInference w7 Panel
30 pages
Influence of Tourism Development and Community Participation Factors On Opportunity of Gashaka Gumti National Park
No ratings yet
Influence of Tourism Development and Community Participation Factors On Opportunity of Gashaka Gumti National Park
15 pages
PDF Understandable Statistics 12th Edition Brase C.H. download
100% (4)
PDF Understandable Statistics 12th Edition Brase C.H. download
41 pages
Developing and Deploying A Machine Learning Scenario For SAP HANA
No ratings yet
Developing and Deploying A Machine Learning Scenario For SAP HANA
29 pages
Practical 3 - Cropyield Forecasting - Exercise 5
No ratings yet
Practical 3 - Cropyield Forecasting - Exercise 5
7 pages