0% found this document useful (0 votes)
6 views

Lecture 7

This lecture focuses on supervised learning, particularly linear regression, and introduces concepts like overfitting, regularization, and cross-validation. It covers methods such as Lasso and Ridge regression for managing overfitting and emphasizes the importance of hyperparameter tuning and K-fold cross-validation for model selection. The session includes practical Python applications to reinforce these concepts.

Uploaded by

Geetha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 7

This lecture focuses on supervised learning, particularly linear regression, and introduces concepts like overfitting, regularization, and cross-validation. It covers methods such as Lasso and Ridge regression for managing overfitting and emphasizes the importance of hyperparameter tuning and K-fold cross-validation for model selection. The session includes practical Python applications to reinforce these concepts.

Uploaded by

Geetha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Programming for Data Science

Lecture 7 – Supervised Learning, Continued.

Thomas Lavastida
University of Texas at Dallas
[email protected]
Spring 2023
Agenda

• Assignment 2 Review
• Quick review of Supervised Learning and Linear Regression
• Linear Regression in Python
• Start Regularization and Cross Validation

2
Assignment 2 Review
Supervised Learning and Regression Review
Supervised Learning

• Given – labelled data points


• – features, independent variables, predictors, columns, etc.
• – target, dependent variable, outcome, etc.
• Continuous -> then we call this regression
• Discrete/categorical -> then we call this classification

• Goal: Find a mapping/function from ’s to ’s such that


Linear Regression

• Simple class of regression models


• Let be independent variables
• Model parameters (one for each indep. variable)
• Predicted outcome computed via a linear function:

• Compute ’s by minimizing average squared error


Overfitting

• As model gets more complex it can fit data


more closely
• New data we see (and want to make
predictions about) may not be fit well (i.e.,
high error)
• This is called overfitting

• Main idea to deal with this -> split into train


and test set
• Training set – used to compute model
parameters
• Test set – used to estimate accuracy of model
on new data
PYTHON PRACTICE
Review: Overfitting

• Model with overfitting problem


• Nice performance for data in hand
• Poor predictive accuracy for new dataset

• Solution 1 – Splitting data


• Training set: train the model (get parameters)
• Test set: evaluate performance

• Solution 2 – Regularization
Regularization – Intuition

• Overfitting occurrence: Too many variables

• True relationship: 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀

• Fit the data w/ 10th degree polynomial

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥 2 +… + 𝛽 10 𝑥 10 + 𝜀

Fewer variables
Regularization – Intuition (Cont.)

• Overfitting occurrence: Large variance/fluctuation

• Large coefficient => large fluctuation


• Under the same scale

• Green: 4 3 2
𝑓 ( 𝑥 ) =− 𝑥 +7 𝑥 − 5 𝑥 − 31 𝑥 +30

• Blue: 1
𝑔 ( 𝑥 )=− 𝑓 ( 𝑥)
5

Smaller coefficients

https://www.datacamp.com/community/tutorials/towards-preventing-overfitting-regularization
Regularization – Intuition (Cont.)

• What we need
• Smaller coefficients (coefficient closer to 0)
• Fewer variables (coefficient = 0)

• Penalize the magnitude of coefficients

• Regularization
• Modify our original linear regression model
• Add terms to penalize the magnitude of coefficients
Regularization

• Linear regression (fit only)


• Minimize the error between actual and predicted value
𝑛
𝑓 (𝝎)=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) )
2

𝑖=1

• Regularization (fit and overfit)


• Minimize the error between predicted and actual examples
• Penalize the coefficient magnitude of features

𝑛
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝑃𝑒𝑛𝑎𝑙𝑡𝑦(𝝎)
2

𝑖=1
Regularization – Two Methods

𝑛
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝑃𝑒𝑛𝑎𝑙𝑡𝑦(𝝎)
2

𝑖=1
Shrinkage Penalty

• Two formulation of shrinkage penalty


• L2 regularization: equivalent to the square of coefficient magnitude
=> Ridge regression

• L1 regularization: equivalent to absolute value of coefficient magnitude


=> Lasso regression
Ridge Regression

• Linear regression with L2 regularization (square of parameters)


• Minimize function:
𝑛 𝑘
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝜆 ∑ 𝜔
2 2 Shrinkage Penalty
𝑗
𝑖=1 𝑗=1
where

• Large magnitude increases

• the amount of penalty


Ridge Regression – Tuning Parameter

𝑛 𝑘
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝜆 ∑ 𝜔
2 2
𝑗
𝑖=1 𝑗=1

• the amount of penalty


• => A linear regression
• => All coefficients would be zero
• Higher , more penalty, smaller coefficients

• – hyperparameter
• NOT estimated with other parameters
• Set “manually” before model estimation
LASSO

• Linear regression with L1 regularization (absolute value of parameters)


𝑛 𝑘
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝜆 ∑ |𝜔 𝑗|
2

𝑖=1 𝑗=1
where .

• L1 penalty can force some coefficient estimates to be exactly zero

• Combines the shrinking advantage of ridge regression with variable selection

• LASSO: Least absolute shrinkage and selection operator


Hyperparameter Tuning and Cross
Validation
Hyperparameter Tuning

• Hyperparameters – set before running the model

• Examples
• LASSO and Ridge –
• Polynomial – degree of polynomial ()

• Intuition of tuning (polynomial case)


• Start by some potential values,
• For each , run the model
• Select the model with the best performance
Tuning Method – Grid Search

• Try all possible hyperparameters of interest

• Most commonly used method for hyperparameter tuning

• Polynomial regression case


• Define a set of potential polynomial degrees
• Estimate, evaluate, choose
Degree MSE Values

Lowest value, selected model


• Select the model with best performance … on which dataset?


Data Splitting – Model Training

• Model selection?
Labeled Data
• For each model, get performance
measure in test set
Training Set Test Set • Select model with best performance
in test data
Data Data
• Problem
Model Prediction and • “best model?”
Training Evaluation • “best fit for test set!”
Parameter
Estimates
• Overfitting test set

Performance measure (e.g., MSE) in test set • Solution: more splits


is unbiased (untouched new data)
Data Splitting – Model Selection

Original Training Set Test Set

Training Set Validation Set


• Validation set:
Data Data • Used for model selection (e.g.,
hyperparameter tuning)
Model Model
Training Selection • Test set:
Parameter • Untouched for training and selection
Estimates
• Used for model assessment
(generalizability)
Limitations of Single Splitting (Partition)

• Data waste: method applies to less data

• If not enough data – unreliable result


• Small training set
• Small test set

• Solution: Cross Validation


K-Fold Cross Validation

• Randomly cut dataset into segments


• Use the th segment as test set, the rest as training set
• Obtain , the mean squared error on the th segment (test set)
• After iterations, calculate mean of

𝑀𝑆 𝐸 1 𝑀𝑆 𝐸 2 𝑀𝑆 𝐸 5
1 1 1 1
2 2 2 2
3 3 3 … 3
4 4 4 4
5 5 5 5
K-Fold Cross Validation

• No data put to waste

• Small dataset
• Involves more data to train the model
• Reliable by taking the mean of multiple

• Model selection
• Using more data to evaluate performance of each model
CV for Model Selection

• Combine CV with grid search

• Example:
• Polynomial, grid search for degree, CV

• Leave a portion for test set


• Set grid for hyperparameter (let n be polynomial degree)
• Select model from CV

Degree MSE Values


Apply to test set

Lowest CV score

Grid Search with CV

• Manually set a grid of discrete hyperparameter values


• Set a metric for model performance

• Search exhaustively through the grid


• For each set of hyperparameters, evaluate each model’s CV score
• The optimal hyperparameters are those of the model achieving the best CV score
Tuning is expensive

• Run model repetitively


• N grid, K-fold CV => NK iterations
• Example: 20 grid, 5-fold CV

• Computationally expensive

• Sometimes very slight improvement


PYTHON PRACTICE

You might also like