w4 Generalisation
w4 Generalisation
• The value of this quantity will depend on the dataset Ɗj on which it is trained.
• We write the average over the complete ensemble of datasets as,
Expected value;
Arithmetic mean of a large number of independent realizations
• It may be that the hypothesis function h(.) is on average, different from the
regression function f (x). This is called bias.
• It may be that the hypothesis function is very sensitive to the particular dataset
Ɗj, so that for a given x, it is larger than the required value for some datasets,
and smaller for other datasets. This is called variance.
Bias and variance
Price
Price
Price
/ regression model
(more number of weights in the neural network, large number of nodes in the
decision tree, more number of rules in a fuzzy logic model, etc.)
Addressing overfitting:
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters or β.
― Works well when we have a lot of features, each of which
contributes a bit to predicting .
Regularization
Price
Price
• Now, this will adjust the coefficients based on your training data.
• If there is noise in the training data (or more terms in the equation),
then the estimated coefficients won’t generalize well to the future data.
• This is where regularization comes in and shrinks or regularizes these
learned estimates towards zero.
• Regularization, significantly reduces the variance of the model, without
substantial increase in its bias.
RIDGE Regression
Basics
Basics
• Above image shows ridge regression, where the RSS is modified by adding the
shrinkage quantity.
• Now, the coefficients are estimated by minimizing this function.
• Here, λ is the tuning parameter that decides how much we want to penalize
the flexibility of our model.
• The increase in flexibility of a model is represented by increase in its
coefficients
• if we want to minimize the above function, then these coefficients need to be
small.
Basics
• This is how the Ridge regression technique prevents coefficients from rising
too high.
• Also, notice that we shrink the estimated association of each variable with
the response, except the intercept β0, This intercept is a measure of the
mean value of the response when xi1 = xi2 = …= xip = 0.
• When λ = 0, the penalty term has no effect, and the estimates produced by
ridge regression will be equal to least squares.
• However, as λ→∞, the impact of the shrinkage penalty grows, and the
ridge regression coefficient estimates will approach zero.
RIDGE vs Oridinary Least Squares
I - Identity matrix
• You might have remembered that the optimal solution in case of
Oridinary least Squares (OLS) is
* = (XXT)–1Xy
RIDGE vs Oridinary least Squares
• The λ parameter is the regularization penalty. Notice that:
• So, setting λ to 0 is the same as using the OLS, while the larger its value,
the stronger is the coefficients' size penalized.
Constraints
• As can be seen, selecting a good value of λ is critical.
• Cross validation comes in handy for this purpose.
• The larger LAMBDA means, our prediction became less sensitive to the
independent variables.
• The coefficient estimates produced by this method are also known as the L2
norm.
• The coefficients that are produced by the standard least squares method are
scale equivariant, i.e. if we multiply each input by c then the corresponding
coefficients are scaled by a factor of 1/c.
• Therefore, regardless of how the predictor is scaled, the multiplication of
predictor and coefficient (Xjβj) remains the same.
Standardizing the predictors
• However, this is not the case with ridge regression, and therefore, we need to
standardize the predictors or bring the predictors to the same scale before
performing ridge regression.
• The formula used to do this is given below.
• This implies that ridge regression coefficients have the smallest RSS (loss
function) for all points that lie within the circle given by β1² + β2² ≤ s.
• Similarly, the lasso coefficients have the smallest RSS (loss function) for
all points that lie within the diamond given by |β1|+|β2|≤ s.
LASSO vs RIDGE
Details are in the next slides
LASSO vs RIDGE
• For a very large value of s, the green regions will contain the center of
the ellipse, making coefficient estimates of both regression techniques,
equal to the least squares estimates.
• But, in this above case, the lasso and ridge regression coefficient
estimates are given by the first point at which an ellipse contacts the
constraint region.
LASSO vs RIDGE
• Since ridge regression has a circular constraint with no sharp points, this
intersection will not generally occur on an axis, and so the ridge
regression coefficient estimates will be exclusively non-zero.
• However, the lasso constraint has corners at each of the axes, and so the
ellipse will often intersect the constraint region at an axis.
LASSO vs RIDGE