Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
Piyush Rai
August 9, 2018
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 1
Recap
Decision Trees: Learning by asking questions. Ask the “important” questions first!
NO YES
5
3 NO YES NO YES
NO YES
NO YES Predict
3 y=3.5
y
2
Predict Predict
0 y=2.5 y=0.5
0 1 2 3 4 x 5 6 7 8
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 2
Linear Models
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 3
Linear Models
Consider learning to map an input x ∈ RD to its output y (say real-valued)
Assume the output to be a linear weighted combination of the D input features
(Predicted) Output
The Input
(D features)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 5
Linear Models for Binary Classification
If desired, can turn the score w > x into the probability of the label being +1 (logistic regression)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 6
Linear Models for Multi-class/Multi-label Classification
Recall that, in multi-class/multi-label classification, y = [y1 , y2 , . . . , yM ] is a vector of length M
Just like multi-output regression, each component ym of y can be modeled by a weight vector w m
Need a way to convert y ∈ RM to one-hot (for multi-class)/binary vector (for multi-label)
Note: In some cases, the score need not be converted, e.g.,
Can use the index of largest entry in y as the predicted class in multi-class classification
0.25 0.6 0.1 0.4 0.2
Can use the indices of top few entries in y as the predicted labels in multi-label classification
0.25 0.6 0.1 0.4 0.2
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 7
Linear Models for Dimensionality Reduction
Linear models can be used to reduce data-dimensionality (e.g., Principal Component Analysis)
z1 z2 zK K Latent Features
(K may be less than D)
D Input Features
Note that it looks similar to multi-output regression but the output vector z is latent
An example of an unsupervised learning problem
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 8
Linear Models to construct Deep Neural Networks
Linear models are used as basic components of deep neural networks (nonlinear models)
D Input Features
Each hidden layer has a learned latent features based representation of the original input x
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 9
Linear Models to construct Deep Neural Networks
Linear models are used as basic components of deep neural networks (nonlinear models)
Note: After each hidden layer, there A “Deep”
is also a nonlinearity (not shown) Feature
Learner
D Input Features
Each hidden layer has a learned latent features based representation of the original input x
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 10
Linear Models to construct Deep Neural Networks
D Input Features
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 11
Linear Models with Offset (Bias) Parameter
Can append a constant feature “1” for each input and rewrite as y = w > x, with x, w ∈ RD+1
We will assume the same and omit the explicit bias for simplicity of notation
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 12
Learning Linear Models
W W
z1 z2 zK
W W
The Input
(D features)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 14
Linear Regression: Pictorially
x
>
Error of the model for an example = yn − w x n (= yn − wxn for scalar input case)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 15
Linear Regression
Define the total error or “loss” on the training data, when using w as our model, as
N
X
L(w ) = (yn − w > x n )2
n=1
Note: Squared loss chosen for simplicity. Can define other type of losses too (more on this later)
The best w will be the one that minimizes the above error (requires optimization w.r.t. w )
N
X
ŵ = arg min L(w ) = arg min (yn − w > x n )2
w w
n=1
This is known as “least squares” linear regression (Gauss/Legendre, early 18th century)
Taking derivative (gradient) of L(w ) w.r.t. w and setting to zero
N N
X ∂ X
2(yn − w > x n ) (yn − x >
n w) = 0 ⇒ x n (yn − x >
n w) = 0
n=1
∂w n=1
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 16
Linear Regression
1 D 1
D
N N
Consider the closed form solution we obtained for linear regression based on least squares
The above closed form solution is nice but has some issues
The D × D matrix X> X may not be invertible
Based solely on minimizing the training error N > 2
P
n=1 (yn − w x n ) ⇒ can overfit the training data
Expensive inversion for large D: Can used iterative optimization techniques (will come to this later)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 17
Regularized Linear Regression (a.k.a. Ridge Regression)
PD
Consider regularized loss: Training error + `2 -squared norm of w , i.e., ||w ||22 = w > w = d=1 wd2
" N #
X
> 2 >
Lreg (w ) = (yn − w x n ) + λw w
n=1
There is a trade-off between the two terms: The regularization hyperparam λ > 0 controls it
Very small λ means almost no regularization (can overfit)
Very large λ means very high regularization (can underfit - high training error)
Can use cross-validation to choose the “right” λ
The solution to the above optimization problem is: w = (X> X + λID )−1 X> y
Note that, in this case, regularization also made inversion possible (note the λID term)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 18
How `2 Regularization Helps Here?
However, as per the model y = f (x) = w > x, yn and ym will differ by wd
Unless we constrain wd to have a small value, the difference wd would also be very large (which
isn’t what we want).
That’s why regularizing (via `2 regularization) and making the individual components of the weight
vector small helps
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 19
Regularization: Some Comments
Note: Since they learn a sparse w , `0 or `1 regularization is also useful for doing feature selection
(wd = 0 means feature d is irrelevant). We will revisit `1 later to formally see why `1 gives sparsity
Other techniques for regularization: Early stopping (of training), “dropout”, etc (popular in deep
neural networks; will revisit these later when discussing deep learning)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 20
Linear/Ridge Regression via Gradient Descent
Both least squares regression and ridge regression require matrix inversion
Least Squares w = (X> X)−1 X> y , Ridge w = (X> X + λID )−1 X> y
∂L
w (t) = w (t−1) − η where η is the learning rate
∂w w =w (t−1)
Repeat until converge
PN
For least squares, the gradient is ∂L
∂w =− n=1 x n (yn − x >
n w ) (no matrix inversion involved)
Such iterative methods for optimizing loss functions are widely used in ML. Will revisit these later
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 21
Linear Regression via Gradient-based Methods: Some Notes
We will revisit gradient based methods later but a few things to keep in mind
Gradient Descent guaranteed to converge to a local minima
Gradient Descent converges to global minima if the function is convex
For Gradient Descent, the learning rate is important (should not be too large or too small)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 22
Linear Regression as Solving System of Linear Equations
Can therefore view the linear regression problem as a system of linear equations
However, in linear regression, we would rarely have N = D, but N > D or D > N
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 23
Linear Regression: Some Other Comments
A simple and interpretable method. Very widely used.
Least squares and ridge regression are one of the very few ML problems with closed form solutions
Least Squares w = (X> X)−1 X> y , Ridge w = (X> X + λID )−1 X> y
Many ML problems can be easily reduced to the form y = Xw or Y = XW
Equivalence to over/underdetermined system of linear equations enables us to use efficient solvers
(a lot of work in the numerical linear algebra community to scale up linear systems solvers)
An interesting bit: Note that w = (X> X)−1 X> y ⇒ Aw = b where A = X> X and b = X> y
Using the above relation, can solve for w by solving Aw = b. A standard linear system with D
equations and D unknowns; can be solved using efficient linear systems solvers.
The basic (regularized) linear regression can also be easily extended to
Nonlinear Regression yn ≈ w > φ(x n ) by replacing the original feature vector x n by a nonlinear
transformation φ(x n ) (where φ may be pre-defined or itself learned)
Generalized Linear Model yn = g (w > x n ) when response yn is not real-valued but
binary/categorical/count, etc, and g is a “link function”
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 24
General Supervised Learning as Optimization
We saw that regularized least squares regression required solving
" N #
X λ
ŵ = arg min Lreg (w ) = arg min (yn − w x n ) + w > w > 2
w w
n=1
2
This is essentially the training loss (called “empirical loss”), plus the regularization term
In general, for supervised learning, the goal is to learn a function f , s.t. f (x n ) ≈ yn , ∀n
Moreover, we also want to have a simple f , i.e., have some regularization
Therefore, learning the best f amounts to solving the following optimization problem
N
X
fˆ = arg min Lreg (f ) = arg min `(yn , f (x n )) + λR(f )
f f
n=1
where `(yn , f (x n )) measures the model f ’s training loss on (x n , yn ) and R(f ) is a regularizer
For least squares regression, f (x n ) = w > x, and R(f ) = w > w , and `(yn , f (x n )) = (yn − w > x n )2
As we’ll see later, different supervised learning problems differ in the choice of f , R(.), and `
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 25
General Unsupervised Learning as Optimization
Can we formulate unsupervised learning problems as optimization problems? Yes, of course! :-)
Consider an unsupervised learning problem with N inputs X = {x n }N
n=1
x n ≈ f (z n ) ∀n
In this case, we can define a loss function `(xn , f (z n )) that measures how well f can “reconstruct”
the original x n from its new representation z n
This generic unsup. learning problem can thus be written as the following optimization problem
N
X
fˆ = arg min `(xn , f (z n )) + λR(f , Z)
f ,Z
n=1
In this case both f and Z need to be learned. Typically learned via alternating optimization
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 26