0% found this document useful (0 votes)
36 views28 pages

Introduction to Support Vector Regression (SVR)

Support Vector Regression (SVR) extends Support Vector Machines (SVM) to predict continuous outcomes by minimizing deviations from target values while maintaining a flat function. The dual formulation of SVR, derived using Lagrange multipliers, allows for efficient computation and the application of the kernel trick, enabling regression in high-dimensional spaces. Key components include the optimization of weight vectors, bias terms, and the use of support vectors to define the regression function.

Uploaded by

Daniel Solomon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views28 pages

Introduction to Support Vector Regression (SVR)

Support Vector Regression (SVR) extends Support Vector Machines (SVM) to predict continuous outcomes by minimizing deviations from target values while maintaining a flat function. The dual formulation of SVR, derived using Lagrange multipliers, allows for efficient computation and the application of the kernel trick, enabling regression in high-dimensional spaces. Key components include the optimization of weight vectors, bias terms, and the use of support vectors to define the regression function.

Uploaded by

Daniel Solomon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Introduction to Support Vector Regression (SVR)

Support Vector Regression is an extension of Support Vector Machines (SVM)


for regression tasks, where the goal is to predict continuous outcomes. SVR
seeks to find a function that has at most a certain deviation (epsilon) from
the actual target values for all training data, while being as flat as possible.

Primal Formulation of SVR


Objective Function
In the primal form, SVR aims to solve the following optimization problem:
n
1
minimize ¿ ∥ w ∥ 2+C ∑ ( ξ i +ξ ¿i )
w ,b ,ξ ,ξ ¿ 2 i=1

{
y i−( w ⊤ ϕ ( x i ) +b ) ≤ ϵ+ ξi ,
subject to ¿ ( w ⊤ ϕ ( x i ) +b )− y i ≤ ϵ +ξ ¿i ,
¿
ξ i , ξ i ≥ 0 , ∀ i∈ {1 , 2 ,… ,n },

Explanation:
 w : Weight vector in the feature space.

 b : Bias term.

 ϕ ( x i ): Mapping of input data to a higher-dimensional feature space.

 C> 0: Regularization parameter that controls the trade-off between


flatness and allowed deviations.

 ϵ : Epsilon margin specifying the acceptable error without penalty.


¿
 ξ i , ξi : Slack variables for data points outside the epsilon margin.

The objective is to minimize the norm of the weight vector (to ensure
flatness of the function) while penalizing deviations beyond ϵ .

Dual Formulation of SVR


To solve the primal problem efficiently and apply the kernel trick, we
formulate the dual problem using Lagrange multipliers.
Deriving the Dual Problem
¿ ¿
Introducing Lagrange multipliers α i ,α i ≥ 0 and ηi , ηi ≥ 0 for the inequality
constraints, and λ i for the equality constraints, we construct the Lagrangian:
n
1
L=¿ ∥ w∥ 2+ C ∑ ( ξi +ξ ¿i )
2 i=1
n
−∑ α i ( ϵ +ξ i− y i +w ϕ ( x i ) +b )

i=1
n
−∑ α ¿i ( ϵ+ ξi¿ + y i−w⊤ ϕ ( x i ) −b )
i=1
n
−∑ ( ηi ξ i+ η¿i ξ ¿i ) .
i=1

¿
Setting the partial derivatives of L with respect to w ,b ,ξ i , ξi to zero, we
obtain:
1. Derivative with respect to w :

n
∂L
=w−∑ ( α i−α ¿i ) ϕ ( x ) =0
∂w i=1

This implies,
n
w=∑ ( α i−α ¿i ) ϕ ( x )
i=1

2. Derivative with respect to b :

n
∂L
=−∑ ( α i−α ¿i ) =0
∂b i=1

This leads to,


n

∑ ( αi −αi¿ )=0
i=1

¿
3. Derivatives with respect to ξ i and ξ i :

¿ ¿
C−α i−α i =0 ¿ α i−α i =C
¿ ¿
Since ηi , ηi ≥ 0, it follows that 0 ≤ α i , α i ≤C .

Dual Objective Function


Substituting w back into the Lagrangian and simplifying, we derive the dual
problem:
n n
maximize ¿ ∑ ( α −α i ) y i−ϵ ∑ ( α i +α ¿i )
¿ n
α, α ¿
i=1
i
i=1 ¿ ∑ ( α i−α ¿i )=0 , ¿ 0 ≤ α i , α ¿i ≤C , ∀ i∈ {1 , 2 , … ,n }.¿
¿ ¿ i=1

Simplifying the Objective Function:


To make it consistent with standard formulations, we often rearrange terms:
n n
maximize ¿ ∑ ( α i−α ¿i ) y i−ϵ ∑ ( α i +α ¿i )
α, α ¿ i=1 i=1
¿ ¿

Understanding the Dual Variables


 α i and α ¿i are the Lagrange multipliers associated with the constraints
that the error exceeds ϵ in the positive and negative directions,
respectively.
¿
 Points where α i or α i are greater than zero are the support vectors—
they define the regression function.

The Regression Function


¿
After solving the dual problem and finding the optimal α i ,α i , we can construct
the regression function:
n
f ( x )=∑ ( α i−α ¿i ) K ( x i , x ) +b .
i=1

Bias Term b
To find b , we use the fact that for any support vector x k with 0< α k <C or
¿
0< α k <C , the following holds:
n
b= y k −∑ ( α i−α ¿i ) K ( x i , x k ) ± ϵ .
i=1

¿
Depending on whether α k or α k is in use, we adjust ± ϵ accordingly.
The Kernel Trick
Principle
The kernel trick allows us to compute the inner products in the high-
dimensional feature space without explicitly performing the transformation
ϕ ( x ). Instead, we use a kernel function K ( x i , x j ) that computes the inner
product directly:
K ( x i , x j ) =⟨ ϕ ( x i ) , ϕ ( x j ) ⟩.

Common Kernel Functions


1. Linear Kernel:

K ( x i , x j ) =x i x j .

2. Polynomial Kernel:
d
K ( x i , x j ) =( γ x i x j +r ) ,

where γ >0 , r ≥ 0, and d is the degree of the polynomial.


3. Radial Basis Function (RBF) Kernel:

K ( x i , x j ) =exp (−γ ∥ x i−x j ∥2 ) ,

with γ >0 .
4. Sigmoid Kernel:

K ( x i , x j ) =tanh ( γ x ⊤
i x j+ r ) ,

parameters γ and r are chosen appropriately.

Properties of the Dual SVR


1. Convexity:

o The dual problem is a convex quadratic optimization problem,


ensuring a global optimum.
2. Sparsity:

o Only a subset of training data (support vectors) have non-zero


¿
α i−α i .

o This sparsity leads to efficient prediction since only support


vectors contribute to the regression function.
3. Robustness to Outliers:

o The parameters C and ϵ control the trade-off between the


flatness of f ( x ) and the amount up to which deviations larger than
ϵ are tolerated.

o A larger ϵ creates a wider “tube” around the regression function


where no penalty is given to errors.

4. Kernel-Induced Feature Space:

o The kernel function implicitly maps data into a higher-


dimensional space where linear regression is performed.

Concrete Example
Let’s consider a simple dataset to illustrate SVR with a linear kernel.

Dataset:
Suppose we have three data points:

{
x 1=[ 1 ] , y 1=2 .
x 2=[ 2 ] , y 2=3 .
x 3=[ 3 ] , y 3=2 .

Parameters:
 ϵ=0.5

 C=1.0

 Kernel: Linear K ( x i , x j ) =x i x j

Setting Up the Dual Problem:


¿
Our goal is to find α i ,α i that maximize:
3 3

∑ ( α i−α¿i ) y i−ϵ ∑ ( αi + α¿i )


i=1 i=1
3 3
−1
∑ ∑
2 i=1 j=1
(
¿ ¿
α i−α i )( α j−α j) K ( x i , x j ) .

Subject to:
{
3

∑ ( αi−α¿i )=0 ,
i=1
¿
0 ≤ αi , αi ≤ 1 , ∀ i .

Solving the Dual Problem:


Due to the simplicity of the dataset, we can use numerical optimization tools
¿
(like quadratic programming solvers) to find the optimal α i ,α i . For illustration,
let’s assume the solution yields:

{
α 1−α ¿1=0.6 ,
¿
α 2−α 2=0 ,
¿
α 3−α 3=−0.6 .

Constructing the Regression Function:


1. Compute w (since the kernel is linear):
3
w=∑ ( α i−α ¿i ) xi =( 0.6 ) ( 1 ) + ( 0 ) ( 2 ) + (−0.6 )( 3 ) =−1.2.
i=1

2. Compute b :
Using a support vector (e.g., x 1):

b ¿ y 1−w x 1 ± ϵ
¿ ¿ 2+1.2 ± 0.5
¿ ¿
Since ϵ=0.5, b lies between 2.7 and 3.7 . We can choose b=3.2 for simplicity.
3. Final Regression Function:
f ( x )=−1.2 x +3.2 .

Making Predictions:
 For x=1.5 :

f ( 1.5 )=−1.2 ×1.5+3.2=−1.8+3.2=1.4 .


 For x=2.5 :

f ( 2.5 )=−1.2 ×2.5+3.2=−3+3.2=0.2 .

Visualization:
The regression function f ( x ) is a straight line with negative slope, fitting the
data within the ϵ -tube (± 0.5 around the regression line), except possibly at
the support vectors.
Connecting the Dots
 From Primal to Dual:

o The dual formulation allows leveraging the kernel trick, making it


feasible to work with high-dimensional feature spaces without
explicit computation.
 Role of Support Vectors:

o Data points with errors exceeding ϵ or lying exactly on the


¿
margin contribute to the model (non-zero α i−α i ).
 Model Complexity Control:

o Parameters C and ϵ help balance the trade-off between model


complexity and fitting accuracy.
 Kernel Choice:

o The kernel function determines the shape of the decision


boundary (or regression function) in the input space.

Conclusion
The dual form of SVR, combined with the kernel trick, provides a powerful
framework for performing regression in high-dimensional or infinite-
dimensional spaces efficiently. By understanding the role of each component
in the dual formulation and how they contribute to the final regression
function, we can better interpret and apply SVR to complex real-world
problems.

1. Introduction to Support Vector


Regression (SVR)
1.1 Overview
Support Vector Regression (SVR) extends Support Vector Machines (SVM) to
regression tasks. Instead of finding a maximum margin separator, SVR
aims to find a function that predicts real-valued outputs while allowing for
some deviation (controlled by ε-insensitivity).
1.2 Why Use the Dual Form?
The primal optimization problem in SVR involves minimizing a loss function
with constraints. However, solving it directly is computationally expensive,
especially in high dimensions. Instead, we derive a dual formulation using
Lagrange multipliers, which enables:
 Efficient computation using quadratic programming.

 Implicit transformation to higher-dimensional spaces via the kernel


trick.

 Use of only a subset of the data (support vectors) to make


predictions.

2. Mathematical Formulation of SVR


2.1 The Primal Optimization Problem
n
Given a dataset {( xi , y i ) }i=1, where:
d
 x i ∈ R are input feature vectors.

 y i ∈ R are target values.

SVR aims to find a function:



f ( x )=w ϕ ( x )+ b
where ϕ ( x ) maps input data into a high-dimensional feature space.
Objective Function (Primal Form)
n
1
min ||w||2 +C ∑ ( ξ i+ ξ ¿i )
w , b ,ξ , ξ 2
i
¿
i i=1

subject to:

y i−w ϕ ( xi ) −b ≤ ϵ+ ξ i
⊤ ¿
w ϕ ( x i ) +b− y i ≤ ϵ+ ξi
¿
ξ i , ξi ≥ 0 , ∀ i

where:
 C controls the tradeoff between model complexity and tolerance to
errors.
 ϵ defines the margin within which errors are ignored.
¿
 ξ i , ξi are slack variables for points outside the margin.

2.2 The Dual Formulation


¿
To solve this efficiently, we introduce Lagrange multipliers α i ,α i and derive
the Lagrangian:
n n
1
L ( w , b , ξ ,ξ ¿ , α , α ¿ )= ||w||2+C ∑ ( ξ i+ ξ ¿i )−∑ α i ( ϵ+ ξi − y i+ w⊤ ϕ ( x i ) +b )
2 i=1 i=1

$$
 _{i=1}^{n} _i^* ( + i^* + y_i - w^(x_i) - b ) - {i=1}^{n} (_i _i + _i^*
_i^*)
¿ ¿
$$ where ηi , ηi ≥ 0 are multipliers enforcing ξ i , ξi ≥ 0.
Dual Optimization Problem
By setting the derivatives to zero and eliminating w , we obtain the dual
problem:
n n n
1
max ∑ ( α i −α ) y i− ∑ ∑ ( α i−α ¿i )( α j−α ¿j ) K ( x i , x j )
¿
i
¿
α , α i=1 2 i=1 j=1
subject to:
n

∑ ( αi −αi¿ )=0
i=1

¿
0 ≤ α i , α i ≤C , ∀ i

where K ( x i , x j ) =ϕ ( xi ) ϕ ( x j ) is the kernel function.

3. Regression Function in Dual Form


¿
Once we solve for α i ,α i , the regression function is:
n
f ( x )=∑ ( α i−α ¿i ) K ( x i , x ) +b
i=1

where b is computed using support vectors:


n
b= y i−∑ ( α j−α ¿j ) K ( x j , xi )
j=1

4. The Kernel Trick


4.1 Principle
Instead of explicitly mapping data into a high-dimensional space, we use a
kernel function K ( x i , x j ) that directly computes inner products in that space.

K ( x i , x j ) =ϕ ( xi ) ϕ ( x j )

4.2 Common Kernels


Kernel Type Formula Characteristics
Linear ⊤
K ( x i , x j ) =x x j
i
No transformation, simple and
efficient
Polynomial K ( x i , x j ) =( x i x j +c ) Captures polynomial
⊤ d

relationships
Radial Basis K ( x i , x j ) =exp (−γ||x i−x j || )
Models2 complex, nonlinear
Function (RBF) patterns
Sigmoid K ( x i , x j ) =tanh ( β x i x j+Inspired

c) by neural networks

5. Logical Reasoning and Theoretical


Framework
5.1 Why the Dual Form Works
1. Transformation to High-Dimensional Space: The kernel trick
allows SVR to operate in high dimensions without explicitly computing
feature mappings.
2. Convex Optimization: The dual formulation is a quadratic
programming problem, ensuring a unique global solution.
¿
3. Sparsity: Only support vectors (points with nonzero α i or α i ) affect
the final model.
5.2 Statistical Learning Theory
 Structural Risk Minimization (SRM): Balances model complexity
and generalization.
 Reproducing Kernel Hilbert Space (RKHS): Provides a theoretical
basis for kernel methods.

6. Example: A Simple SVR Model


Problem Statement
Given data points:
X ={1 , 2 ,3 , 4 ,5 },Y ={1.1 , 1.9 ,3.2 , 4.0 , 5.1 }
we fit an SVR with a linear kernel.

Steps
1. Compute Kernel Matrix:

[ ]
1 2 3 4 5
2 4 6 8 10

K ( X , X )= X X= 3 6 9 12 15
4 8 12 16 20
5 10 15 20 25

2. Solve the Dual Optimization Problem.

3. Compute f ( x ) using support vectors.

4. Predict f ( 6 ) using:
n
f ( 6 )=∑ ( α i−α ¿i ) K ( x i , 6 ) +b
i=1

This provides an estimated value.

7. Conclusion
 SVR in dual form efficiently handles nonlinear regression using
kernels.

 Support vectors determine predictions.

 Choosing an appropriate kernel and tuning hyperparameters are


critical for performance.
Dual Form of Support Vector Regression
(SVR) and the Kernel Trick
1. Introduction
Support Vector Regression (SVR) is an extension of Support Vector Machines
(SVM) for regression tasks. The dual formulation of SVR enables efficient
optimization by leveraging the kernel trick, allowing nonlinear regression in
high-dimensional feature spaces without explicit transformations.

Key Objectives of Dual SVR


 To transform a nonlinear problem into a higher-dimensional space
where it becomes linearly solvable.
 To optimize a dual objective function using Lagrange multipliers.

 To apply kernel functions for implicit feature mapping.

 To control model complexity using regularization parameters.

2. Mathematical Formulation of SVR in Dual Form


2.1. The Dual Optimization Problem
The primal formulation of SVR aims to minimize a regularized risk function.
The corresponding dual problem is:
n n n
1
max ∑ ( α i −α ¿i ) y i− ∑ ∑ ( α −α¿ )( α −α ¿j ) K ( x i , x j )
α , α ¿ i=1 2 i=1 j=1 i i j
Subject to:
n

∑ ( αi −αi¿ )=0
i=1

¿
0 ≤ α i , α i ≤C , ∀ i

where:
¿
 α i ,α i are Lagrange multipliers (dual variables).

 y i is the target value for data point x i.

 K ( x i , x j ) is the kernel function.

 C is the regularization parameter controlling model flexibility.


2.2. Regression Function from Dual Form
¿
After solving for optimal α i ,α i , the regression function is:
n
f ( x )=∑ ( α i−α ¿i ) K ( x i , x ) +b
i=1

where b is the bias term computed as:


n
b= y i−∑ ( α j−α ¿j ) K ( x j , xi )
j=1

¿
Interpretation: Only support vectors (data points with nonzero α i or α i )
contribute to the final prediction.

3. Properties and Classification of Properties


3.1. Key Properties
1. Convexity:

o The optimization problem is convex, ensuring a global


minimum.
2. Sparsity:

o Only a subset of data points (support vectors) determines the


regression function.
3. Robustness to Outliers:

o Controlled by C and epsilon (ϵ ), allowing some margin for noise.


4. Generalization Ability:
o SVR prevents overfitting through regularization.

3.2. Categorization of Properties


Mathematical Properties
 Quadratic Programming: Optimization is a constrained quadratic
problem.

 Dual Space Representation: Works in dual space using Lagrange


multipliers.

Statistical Properties
 Bias-Variance Tradeoff: Controlled via C and kernel parameters.

 Kernel-Induced Feature Space: Data is implicitly mapped to a


high-dimensional space.
Computational Properties
 Kernel Matrix Complexity: Computation of K ( x i , x j ) affects
performance.

 Scalability: Complexity grows quadratically with dataset size.

4. The Kernel Trick


4.1. Principle of the Kernel Trick
The kernel trick implicitly maps input data x to a higher-dimensional space
via a function ϕ ( x ), without explicitly computing ϕ ( x ).
K ( x i , x j ) =ϕ ( xi ) ⋅ ϕ ( x j )

4.2. Why It Works


 In many problems, linear regression in transformed feature space
performs well.

 Computing ϕ ( x ) directly is computationally expensive.

 Kernel functions allow computing inner products without explicit


transformations.

4.3. Common Kernel Functions


Kernel Type Formula Characteristics
Linear Kernel K ( x i , x j ) =x i ⋅ x j No feature transformation
Polynomial Kernel K ( x i , x j ) =( x i ⋅ x j+ c )
d
Captures polynomial relations
2
Radial Basis K ( x i , x j ) =e−γ||x −x ||
i j Handles complex relationships
Function (RBF)
Sigmoid Kernel K ( x i , x j ) =tanh ( β x i ⋅ x j +c )
Inspired by neural networks

5. Logical Reasoning and Theoretical Framework


5.1. Why Dual Form Works
 The primal problem has constraints making direct optimization
challenging.

 Using Lagrange multipliers, we transform it into an equivalent dual


problem.
 This formulation allows applying kernel functions, making nonlinear
problems solvable.

5.2. Theoretical Framework


 Based on Statistical Learning Theory.

 Uses Structural Risk Minimization to balance bias-variance tradeoff.

 Relies on Reproducing Kernel Hilbert Space (RKHS) for kernel-


based learning.

6. Historical Development and Expert Insights


 Vladimir Vapnik and Alexey Chervonenkis (1995) introduced
Support Vector Machines.

 Smola and Schölkopf (1998) extended SVMs to SVR for


regression.

 Recent research explores deep learning-SVR hybrids and efficient


solvers.

7. Practical Applications
7.1. Real-World Use Cases
 Financial Forecasting: Stock price predictions.

 Engineering: Predictive maintenance models.

 Medical Diagnostics: Disease progression modeling.


 Energy Consumption Prediction: Smart grid forecasting.

7.2. Implementation Steps


1. Prepare Data: Normalize and split dataset.
2. Choose Kernel: Based on domain knowledge.
3. Optimize Hyperparameters: Tune C, ϵ , and kernel parameters.
4. Train SVR Model: Use Quadratic Programming Solver.
5. Evaluate Performance: Use Mean Squared Error (MSE).

8. Conclusion
 SVR in dual form allows solving complex regression problems via
kernel tricks.
 The method efficiently finds optimal regression functions using
support vectors.

 Choosing the right kernel and tuning hyperparameters is


critical for model performance.

Further Exploration:
 Advanced Kernels: Exploring graph-based kernels.

 Optimization Techniques: Faster solvers like Sequential Minimal


Optimization (SMO).

 Hybrid Models: Combining SVR with deep learning.

Introduction to Support Vector Regression (SVR)


Support Vector Regression (SVR) is a type of regression analysis that
extends the principles of Support Vector Machines (SVM) to predict
continuous outcomes. It is particularly powerful for dealing with nonlinear
data.

The Challenge with Nonlinear Data


 Nonlinearity: Real-world data often exhibit nonlinear relationships
between independent variables (features) and the dependent variable
(target).

 Linear Models Limitation: Linear regression models struggle to


capture these complex patterns.

The Kernel Trick


 Purpose: The Kernel Trick allows SVR to handle nonlinear data by
implicitly mapping input data into a higher-dimensional feature space
where linear relationships can be found.

 How It Works: Instead of transforming the data explicitly, it uses


kernel functions to compute inner products in the higher-dimensional
space efficiently.

The Dual Form of SVR


The dual form of SVR is an optimization problem expressed as:

[ ]
n n n
1
max ∑ ( α i−α ) y i− ∑ ∑ ( α i −α ¿i )( α j−α ¿j ) K ( x i , x j )
¿
i
α, α¿
i=1 2 i=1 j=1
Where:

 max α ,α indicates maximization with respect to the vectors α =[ α 1 , α 2 , … , α n ]


¿

and α =[ α 1 , α 2 , … , α n ].
¿ ¿ ¿ ¿

Detailed Breakdown of Each Component


Let’s dissect the formula step by step, accounting for every variable and
operation.

1. Variables and Notations


Data Variables
 n : Total number of data points in the dataset.

 i , j: Indices representing data points, where i , j∈ {1 , 2, … , n }.

 x i: The input feature vector of the i-th data point.

 y i: The target value (actual output) corresponding to x i.

Optimization Variables
¿
 α i ,α i : Lagrange multipliers (dual variables) associated with each data
point i.

o These variables are introduced during the dual formulation of the


optimization problem.

o They adjust the influence of each data point in the regression


function.
¿ ¿
o Both α i and α i are non-negative real numbers, i.e., α i ,α i ≥ 0.

Kernel Function
 K ( x i , x j ) : The kernel function computes the inner product of the images
of x i and x j in the feature space.

o Definition: K ( x i , x j ) =⟨ ϕ ( x i ) , ϕ ( x j ) ⟩

o ϕ : The mapping function that transforms input data into a higher-


dimensional feature space.

2. Maximization Operator
 max α ,α : Indicates that we aim to find the values of α and α ¿ that
¿

maximize the objective function.


3. Objective Function
The objective function consists of two main terms:
First Term: Linear Part
n

∑ ( αi −αi¿ ) y i
i=1

 Operation:

o For each data point i :


¿
 Compute ( α i−α i ).

 Multiply by the corresponding target value y i.

o Sum this product over all data points from i=1 to n.

 Interpretation:

o This term measures how well the model’s predictions align with
the actual targets.
¿
o ( α i−α i ) represents the net influence of each data point.
o A positive value indicates a direct relationship, while a negative
value indicates an inverse relationship.

Second Term: Quadratic Part


$$
 {i=1}^{n} {j=1}^{n} (_i - _i^) (_j - _j^) K(_i, _j)
$$
 Operations:

o Double Summation:

 For all pairs of data points ( i , j ):


¿ ¿
 Compute ( α i−α i ) and ( α j−α j ).
¿ ¿
 Multiply these differences together: ( α i−α i )( α j−α j ).

 Compute the kernel function K ( x i , x j ) .

 Multiply the above results.


−1
o Sum over all i and j , then multiply by .
2
 Interpretation:

o This term accounts for the interactions between data points in


the transformed feature space.
¿
o It penalizes large values of ( α i−α i ) when the corresponding data
points are similar (i.e., when K ( x i , x j ) is large).

o Acts as a regularization term to prevent overfitting.

4. Constraints (Not Explicitly Shown)


While not present in the given formula, the optimization is subject to several
constraints:
Equality Constraint
n

∑ ( αi −αi¿ )=0
i=1

 Ensures that the net effect of the dual variables balances out.
Inequality Constraints
¿
0 ≤ α i , α i ≤C , ∀ i∈ {1 , 2 , … ,n }

 C : A regularization parameter that controls the trade-off between


model complexity and training error.
¿
 These constraints limit the values of α i and α i to be within a specified
range.

5. The Role of ϵ -Insensitive Loss


Although not explicitly shown in the dual form, the SVR utilizes an ϵ -
insensitive loss function:
 Definition:

{ y −f ¿( x )|≤ϵ otherwise¿
Lϵ ( y i , f ( xi ) ) = | i i

 Purpose:

o Allows for a margin ϵ where errors are not penalized.

o Encourages the model to be as flat as possible (i.e., less


complex) while fitting the data within the ϵ -tube.
6. Relationship Between Primal and Dual Forms
Primal Formulation
In the primal form, the SVR optimization problem is:
n
1
min ||w||2+C ∑ ( ξ i +ξ ¿i )
w , b ,ξ ,ξ ¿ 2 i=1

Subject to:

{
y i−( w ϕ ( x i ) + b ) ≤ ϵ +ξ i

( w⊤ ϕ ( x i ) +b )− y i ≤ ϵ +ξ ¿i
¿
ξ i ,ξ i ≥ 0

 Variables:

o w : Weight vector in the feature space.

o b : Bias term.

o ξ , ξ¿ : Slack variables for handling errors beyond ϵ .

From Primal to Dual


 By applying the method of Lagrange multipliers and leveraging the KKT
(Karush-Kuhn-Tucker) conditions, we derive the dual formulation.
¿
 The dual variables α i ,α i correspond to the Lagrange multipliers
associated with the inequality constraints in the primal problem.

Step-by-Step Detailed Explanation


Let’s now perform a meticulous examination of each operation and variable,
ensuring every detail is accounted for.

A. Maximization Operator
max
¿
α, α

 Objective:

o Find the set of dual variables α and α ¿ that maximize the


objective function under the given constraints.

o This is a convex optimization problem due to the quadratic


nature of the objective function and the linear constraints.
B. First Term: Linear Sum
n

∑ ( αi −αi¿ ) y i
i=1

Breaking It Down
¿
1. Compute ( α i−α i ):
¿
o For each i, subtract α i from α i.

o This difference reflects the net contribution of data point i to the


model.

2. Multiply by y i:
¿
o Multiply the difference ( α i−α i ) by the target value y i.

o This aligns the contribution with the actual output.

3. Sum Over All i :

o Aggregate the products from step 2 for all data points.


Example Calculation
Suppose n=3:
¿
 ( α 1−α 1 ) y 1
¿
 ( α 2−α 2 ) y 2
¿
 ( α 3−α 3 ) y 3
Sum:
Sum=( α 1−α ¿1 ) y 1 + ( α 2−α ¿2 ) y 2+ ( α 3 −α 3¿ ) y 3

Interpretation
 Positive Contribution:
¿
o If ( α i−α i ) and y i have the same sign, the term contributes
positively to the objective function.
 Negative Contribution:
¿
o If ( α i−α i ) and y i have opposite signs, the term reduces the
objective function.
C. Second Term: Quadratic Double Sum
$$
 {i=1}^{n} {j=1}^{n} (_i - _i^)(_j - _j^) K(_i, _j)
$$
Breaking It Down
¿ ¿
1. Compute ( α i−α i ) and ( α j−α j ):

o For each pair ( i , j ), compute these differences independently.


2. Multiply the Differences:
¿ ¿
o Multiply ( α i−α i ) by ( α j−α j ) for each pair ( i , j ).
3. Compute the Kernel Function K ( x i , x j ) :

o Evaluate the kernel function for the input vectors x i and x j .


4. Multiply All Components:

o Multiply the result from step 2 by the kernel value from step 3.
5. Sum Over All Pairs ( i , j ):

o Aggregate the products for all combinations of i and j .


−1
6. Multiply by :
2
−1
o Scale the entire sum by .
2
Example Calculation
For n=2:
 Pairs: ( 1 ,1 ) , ( 1 ,2 ) , ( 2 ,1 ) , ( 2 , 2 )
Calculations:
¿ ¿
 ( α 1−α 1 )( α 1−α 1) K ( x 1 , x 1 )
¿ ¿
 ( α 1−α 1 )( α 2−α 2 ) K ( x 1 , x 2 )
¿ ¿
 ( α 2−α 2 )( α 1−α 1 ) K ( x 2 , x 1 )
¿ ¿
 ( α 2−α 2 )( α 2−α 2 ) K ( x 2 , x 2 )
Sum:
2 2
Sum=( α 1−α ¿1 ) K 11 +2 ( α 1−α ¿1 )( α 2 −α ¿2 ) K 12 + ( α 2−α ¿2 ) K 22

(Note: K ij =K ( x i , x j ) , and K 12=K 21 due to kernel symmetry.)


Interpretation
 The term penalizes the objective function when similar data points
(high K ( x i , x j ) ) have large differences in their dual variables.

 Encourages smoothness in the function by controlling the contributions


of similar data points.

D. The Kernel Function K ( x i , x j )


Definition
 General Form:

K ( x i , x j ) =⟨ ϕ ( x i ) , ϕ ( x j ) ⟩

 ϕ ( ⋅ ) : Mapping function transforming input data into a higher-


dimensional feature space.
Common Kernel Functions
1. Linear Kernel:

K ( x i , x j ) =x i x j

o No transformation; equivalent to standard dot product.


2. Polynomial Kernel:
d
K ( x i , x j ) =( γ x i x j +r )

o Parameters:

 γ : Scaling factor.

 r : Constant term.

 d : Degree of the polynomial.

3. Radial Basis Function (RBF) Kernel:

K ( x i , x j ) =exp (−γ||x i−x j||2 )

o γ : Controls the spread of the kernel.


4. Sigmoid Kernel:

K ( x i , x j ) =tanh ( γ x ⊤
i x j+ r )

Properties
 Symmetric: K ( x i , x j ) =K ( x j , x i ).

 Positive Definite: Ensures that the optimization problem remains


convex.
Role in SVR
 Enables the algorithm to operate in high-dimensional (possibly infinite-
dimensional) spaces without explicit computations in those spaces.

 Captures nonlinear relationships by measuring similarity in the


transformed feature space.

E. Constraints Detailed
Equality Constraint
n

∑ ( αi −αi¿ )=0
i=1

 Explanation:

o Ensures that the weighted sum of dual variables balances to


zero.

o Derived from the requirement that the regression function must


be unbiased.

Inequality Constraints
¿
0 ≤ α i , α i ≤C , ∀ i

 C : Regularization parameter.

o Controls the penalty for errors exceeding ϵ .

o Larger C allows for less violation of the ϵ -insensitive zone but


may lead to overfitting.

 Purpose:

o Keeps the dual variables within reasonable bounds.

o Prevents any single data point from dominating the model.

F. Constructing the Regression Function


After solving the dual problem, we derive the regression function:
n
f ( x )=∑ ( α i−α ¿i ) K ( x i , x ) +b
i=1

Components
1. Sum Over Support Vectors:
¿
o Only data points where ( α i−α i ) ≠ 0 contribute to the sum.
o These points are known as support vectors.

2. Kernel Function K ( x i , x ):

o Computes the similarity between a training data point x i and a


new input x .
3. Bias Term b :

o Calculated from the KKT conditions:


n
b= y i−∑ ( α j−α ¿j ) K ( x j , xi ) −ϵ for any α i such that 0< α i <C
j=1

o Ensures that the regression function aligns properly with the


data.

Interpretation
 The regression function is a weighted sum of kernel evaluations
between the support vectors and the input x , adjusted by the bias term
b.

 It predicts the target value for x based on its similarity to the training
data.

Top-Down Perspective
Let’s consider the broader picture and see how all these pieces fit together.

1. Goal of SVR
 Objective: Find a function f ( x ) that has at most ϵ deviation from the
actual target values y i for all training data, and is as flat as possible.

2. Utilizing the Kernel Trick


 By applying the kernel trick, we extend SVR to handle nonlinear
relationships in the data without explicitly performing nonlinear
transformations.
3. Dual Formulation Advantages
 Computational Efficiency:

o Solving the dual problem allows us to exploit the kernel function,


making computations feasible even in very high-dimensional
spaces.
 Sparsity:
o Solutions tend to be sparse, involving only a subset of the data
points (support vectors), leading to efficient prediction
computations.
4. Optimization Process Overview
1. Set Up the Dual Problem:

o Define the objective function and constraints based on the


dataset and chosen kernel.
2. Compute the Kernel Matrix:

o Evaluate K ( x i , x j ) for all i , j.


3. Solve the Quadratic Programming Problem:

o Use numerical optimization techniques (e.g., Sequential Minimal


Optimization) to find the optimal α and α ¿.
4. Determine the Bias Term b :

o Use the support vectors and KKT conditions to compute b .


5. Construct the Regression Function:

o Combine the dual variables, kernel evaluations, and bias term to


form f ( x ) .

5. Prediction
 For a new input x :

o Compute f ( x ) using the regression function.

o The predicted value is based on the weighted similarities


between x and the support vectors.

Bottom-Up Perspective
Starting from the fundamental elements, we see how they build up to form
the complete SVR model.

1. Data Points
 Each data point ( x i , y i ) contributes to the model through the dual
¿
variables α i and α i .

2. Dual Variables
 Reflect the influence of each data point.

 Constrained values ensure a balanced and generalizable model.


3. Kernel Evaluations
 Measure similarity in feature space.

 Allow the capture of complex, nonlinear patterns.

4. Objective Function
 Balances fitting the data (first term) with controlling model complexity
(second term).

 Maximization leads to the optimal set of dual variables under


constraints.

5. Regression Function
 Aggregates the contributions from support vectors.
 Provides predictions for new inputs based on learned patterns.

Additional Considerations
Hyperparameters
 C : Regularization parameter.

 ϵ : Width of the ϵ -insensitive zone.

 Kernel Parameters: Specific to the chosen kernel function (e.g., γ in


the RBF kernel).

Model Selection
 Cross-Validation: Used to select optimal hyperparameters by
evaluating model performance on validation sets.

 Kernel Choice: Impacts the ability of the SVR to capture underlying


data patterns.

Computational Complexity
 Training Time: Depends on the size of the dataset due to kernel
matrix computations.

 Prediction Time: Proportional to the number of support vectors.

Advantages of SVR
 Effective in high-dimensional spaces.

 Memory efficient due to support vectors.


 Versatile with different kernel functions.

Limitations
 Choice of hyperparameters can be non-trivial.

 Not well suited for very large datasets without using approximations.

Recap and Conclusion


We have meticulously dissected every component of the dual form of
Support Vector Regression using the Kernel Trick. By accounting for each
variable, operation, and concept, we’ve gained an in-depth understanding of
how SVR constructs a robust regression model capable of handling nonlinear
data.
From the maximization of the objective function to the constraints ensuring
model generalizability, every element plays a crucial role in the performance
of the SVR. The kernel function serves as a cornerstone, enabling the
algorithm to operate in complex feature spaces efficiently.

You might also like