Introduction to Support Vector Regression (SVR)
Introduction to Support Vector Regression (SVR)
{
y i−( w ⊤ ϕ ( x i ) +b ) ≤ ϵ+ ξi ,
subject to ¿ ( w ⊤ ϕ ( x i ) +b )− y i ≤ ϵ +ξ ¿i ,
¿
ξ i , ξ i ≥ 0 , ∀ i∈ {1 , 2 ,… ,n },
Explanation:
w : Weight vector in the feature space.
b : Bias term.
The objective is to minimize the norm of the weight vector (to ensure
flatness of the function) while penalizing deviations beyond ϵ .
i=1
n
−∑ α ¿i ( ϵ+ ξi¿ + y i−w⊤ ϕ ( x i ) −b )
i=1
n
−∑ ( ηi ξ i+ η¿i ξ ¿i ) .
i=1
¿
Setting the partial derivatives of L with respect to w ,b ,ξ i , ξi to zero, we
obtain:
1. Derivative with respect to w :
n
∂L
=w−∑ ( α i−α ¿i ) ϕ ( x ) =0
∂w i=1
This implies,
n
w=∑ ( α i−α ¿i ) ϕ ( x )
i=1
n
∂L
=−∑ ( α i−α ¿i ) =0
∂b i=1
∑ ( αi −αi¿ )=0
i=1
¿
3. Derivatives with respect to ξ i and ξ i :
¿ ¿
C−α i−α i =0 ¿ α i−α i =C
¿ ¿
Since ηi , ηi ≥ 0, it follows that 0 ≤ α i , α i ≤C .
Bias Term b
To find b , we use the fact that for any support vector x k with 0< α k <C or
¿
0< α k <C , the following holds:
n
b= y k −∑ ( α i−α ¿i ) K ( x i , x k ) ± ϵ .
i=1
¿
Depending on whether α k or α k is in use, we adjust ± ϵ accordingly.
The Kernel Trick
Principle
The kernel trick allows us to compute the inner products in the high-
dimensional feature space without explicitly performing the transformation
ϕ ( x ). Instead, we use a kernel function K ( x i , x j ) that computes the inner
product directly:
K ( x i , x j ) =⟨ ϕ ( x i ) , ϕ ( x j ) ⟩.
2. Polynomial Kernel:
d
K ( x i , x j ) =( γ x i x j +r ) ,
⊤
with γ >0 .
4. Sigmoid Kernel:
K ( x i , x j ) =tanh ( γ x ⊤
i x j+ r ) ,
Concrete Example
Let’s consider a simple dataset to illustrate SVR with a linear kernel.
Dataset:
Suppose we have three data points:
{
x 1=[ 1 ] , y 1=2 .
x 2=[ 2 ] , y 2=3 .
x 3=[ 3 ] , y 3=2 .
Parameters:
ϵ=0.5
C=1.0
⊤
Kernel: Linear K ( x i , x j ) =x i x j
Subject to:
{
3
∑ ( αi−α¿i )=0 ,
i=1
¿
0 ≤ αi , αi ≤ 1 , ∀ i .
{
α 1−α ¿1=0.6 ,
¿
α 2−α 2=0 ,
¿
α 3−α 3=−0.6 .
2. Compute b :
Using a support vector (e.g., x 1):
⊤
b ¿ y 1−w x 1 ± ϵ
¿ ¿ 2+1.2 ± 0.5
¿ ¿
Since ϵ=0.5, b lies between 2.7 and 3.7 . We can choose b=3.2 for simplicity.
3. Final Regression Function:
f ( x )=−1.2 x +3.2 .
Making Predictions:
For x=1.5 :
Visualization:
The regression function f ( x ) is a straight line with negative slope, fitting the
data within the ϵ -tube (± 0.5 around the regression line), except possibly at
the support vectors.
Connecting the Dots
From Primal to Dual:
Conclusion
The dual form of SVR, combined with the kernel trick, provides a powerful
framework for performing regression in high-dimensional or infinite-
dimensional spaces efficiently. By understanding the role of each component
in the dual formulation and how they contribute to the final regression
function, we can better interpret and apply SVR to complex real-world
problems.
subject to:
⊤
y i−w ϕ ( xi ) −b ≤ ϵ+ ξ i
⊤ ¿
w ϕ ( x i ) +b− y i ≤ ϵ+ ξi
¿
ξ i , ξi ≥ 0 , ∀ i
where:
C controls the tradeoff between model complexity and tolerance to
errors.
ϵ defines the margin within which errors are ignored.
¿
ξ i , ξi are slack variables for points outside the margin.
$$
_{i=1}^{n} _i^* ( + i^* + y_i - w^(x_i) - b ) - {i=1}^{n} (_i _i + _i^*
_i^*)
¿ ¿
$$ where ηi , ηi ≥ 0 are multipliers enforcing ξ i , ξi ≥ 0.
Dual Optimization Problem
By setting the derivatives to zero and eliminating w , we obtain the dual
problem:
n n n
1
max ∑ ( α i −α ) y i− ∑ ∑ ( α i−α ¿i )( α j−α ¿j ) K ( x i , x j )
¿
i
¿
α , α i=1 2 i=1 j=1
subject to:
n
∑ ( αi −αi¿ )=0
i=1
¿
0 ≤ α i , α i ≤C , ∀ i
⊤
where K ( x i , x j ) =ϕ ( xi ) ϕ ( x j ) is the kernel function.
relationships
Radial Basis K ( x i , x j ) =exp (−γ||x i−x j || )
Models2 complex, nonlinear
Function (RBF) patterns
Sigmoid K ( x i , x j ) =tanh ( β x i x j+Inspired
⊤
c) by neural networks
Steps
1. Compute Kernel Matrix:
[ ]
1 2 3 4 5
2 4 6 8 10
⊤
K ( X , X )= X X= 3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
4. Predict f ( 6 ) using:
n
f ( 6 )=∑ ( α i−α ¿i ) K ( x i , 6 ) +b
i=1
7. Conclusion
SVR in dual form efficiently handles nonlinear regression using
kernels.
∑ ( αi −αi¿ )=0
i=1
¿
0 ≤ α i , α i ≤C , ∀ i
where:
¿
α i ,α i are Lagrange multipliers (dual variables).
¿
Interpretation: Only support vectors (data points with nonzero α i or α i )
contribute to the final prediction.
Statistical Properties
Bias-Variance Tradeoff: Controlled via C and kernel parameters.
7. Practical Applications
7.1. Real-World Use Cases
Financial Forecasting: Stock price predictions.
8. Conclusion
SVR in dual form allows solving complex regression problems via
kernel tricks.
The method efficiently finds optimal regression functions using
support vectors.
Further Exploration:
Advanced Kernels: Exploring graph-based kernels.
[ ]
n n n
1
max ∑ ( α i−α ) y i− ∑ ∑ ( α i −α ¿i )( α j−α ¿j ) K ( x i , x j )
¿
i
α, α¿
i=1 2 i=1 j=1
Where:
and α =[ α 1 , α 2 , … , α n ].
¿ ¿ ¿ ¿
Optimization Variables
¿
α i ,α i : Lagrange multipliers (dual variables) associated with each data
point i.
Kernel Function
K ( x i , x j ) : The kernel function computes the inner product of the images
of x i and x j in the feature space.
o Definition: K ( x i , x j ) =⟨ ϕ ( x i ) , ϕ ( x j ) ⟩
2. Maximization Operator
max α ,α : Indicates that we aim to find the values of α and α ¿ that
¿
∑ ( αi −αi¿ ) y i
i=1
Operation:
Interpretation:
o This term measures how well the model’s predictions align with
the actual targets.
¿
o ( α i−α i ) represents the net influence of each data point.
o A positive value indicates a direct relationship, while a negative
value indicates an inverse relationship.
o Double Summation:
∑ ( αi −αi¿ )=0
i=1
Ensures that the net effect of the dual variables balances out.
Inequality Constraints
¿
0 ≤ α i , α i ≤C , ∀ i∈ {1 , 2 , … ,n }
{ y −f ¿( x )|≤ϵ otherwise¿
Lϵ ( y i , f ( xi ) ) = | i i
Purpose:
Subject to:
{
y i−( w ϕ ( x i ) + b ) ≤ ϵ +ξ i
⊤
( w⊤ ϕ ( x i ) +b )− y i ≤ ϵ +ξ ¿i
¿
ξ i ,ξ i ≥ 0
Variables:
o b : Bias term.
A. Maximization Operator
max
¿
α, α
Objective:
∑ ( αi −αi¿ ) y i
i=1
Breaking It Down
¿
1. Compute ( α i−α i ):
¿
o For each i, subtract α i from α i.
2. Multiply by y i:
¿
o Multiply the difference ( α i−α i ) by the target value y i.
Interpretation
Positive Contribution:
¿
o If ( α i−α i ) and y i have the same sign, the term contributes
positively to the objective function.
Negative Contribution:
¿
o If ( α i−α i ) and y i have opposite signs, the term reduces the
objective function.
C. Second Term: Quadratic Double Sum
$$
{i=1}^{n} {j=1}^{n} (_i - _i^)(_j - _j^) K(_i, _j)
$$
Breaking It Down
¿ ¿
1. Compute ( α i−α i ) and ( α j−α j ):
o Multiply the result from step 2 by the kernel value from step 3.
5. Sum Over All Pairs ( i , j ):
K ( x i , x j ) =⟨ ϕ ( x i ) , ϕ ( x j ) ⟩
o Parameters:
γ : Scaling factor.
r : Constant term.
K ( x i , x j ) =tanh ( γ x ⊤
i x j+ r )
Properties
Symmetric: K ( x i , x j ) =K ( x j , x i ).
E. Constraints Detailed
Equality Constraint
n
∑ ( αi −αi¿ )=0
i=1
Explanation:
Inequality Constraints
¿
0 ≤ α i , α i ≤C , ∀ i
C : Regularization parameter.
Purpose:
Components
1. Sum Over Support Vectors:
¿
o Only data points where ( α i−α i ) ≠ 0 contribute to the sum.
o These points are known as support vectors.
2. Kernel Function K ( x i , x ):
Interpretation
The regression function is a weighted sum of kernel evaluations
between the support vectors and the input x , adjusted by the bias term
b.
It predicts the target value for x based on its similarity to the training
data.
Top-Down Perspective
Let’s consider the broader picture and see how all these pieces fit together.
1. Goal of SVR
Objective: Find a function f ( x ) that has at most ϵ deviation from the
actual target values y i for all training data, and is as flat as possible.
5. Prediction
For a new input x :
Bottom-Up Perspective
Starting from the fundamental elements, we see how they build up to form
the complete SVR model.
1. Data Points
Each data point ( x i , y i ) contributes to the model through the dual
¿
variables α i and α i .
2. Dual Variables
Reflect the influence of each data point.
4. Objective Function
Balances fitting the data (first term) with controlling model complexity
(second term).
5. Regression Function
Aggregates the contributions from support vectors.
Provides predictions for new inputs based on learned patterns.
Additional Considerations
Hyperparameters
C : Regularization parameter.
Model Selection
Cross-Validation: Used to select optimal hyperparameters by
evaluating model performance on validation sets.
Computational Complexity
Training Time: Depends on the size of the dataset due to kernel
matrix computations.
Advantages of SVR
Effective in high-dimensional spaces.
Limitations
Choice of hyperparameters can be non-trivial.
Not well suited for very large datasets without using approximations.