0% found this document useful (0 votes)
4 views

Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Linear Models and Learning via Optimization

Piyush Rai

Introduction to Machine Learning (CS771A)

August 9, 2018

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 1
Recap
Decision Trees: Learning by asking questions. Ask the “important” questions first!

NO YES
5

3 NO YES NO YES

Predict Predict Predict Predict


0
0 1 2 3 4 5 6
Red Blue Blue Red

NO YES

NO YES Predict
3 y=3.5

y
2

Predict Predict
0 y=2.5 y=0.5
0 1 2 3 4 x 5 6 7 8

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 2
Linear Models

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 3
Linear Models
Consider learning to map an input x ∈ RD to its output y (say real-valued)
Assume the output to be a linear weighted combination of the D input features
(Predicted) Output

The Input
(D features)

This is an example of a linear model with D parameters w = [w1 , w2 , . . . , wD ]


Inspired by linear models of neurons
w ∈ RD is also known as the weight vector
Here wd denotes how important the d-th input feature is for predicting y
The above is basically a linear model for simple regression (single, real-valued output y )
This basic model can also be used as building blocks in many more complex models
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 4
Linear Models for Multi-output Regression

Can assume each of the M outputs in y ∈ RM to be modeled by a linear model

Each output ym (m = 1, . . . , M) modeled by a weight vector w m ∈ RD : ym = w >


mx

The entire model for all M outputs can be represented as y = W> x


W = [w 1 , w 2 , . . . , w M ] is a D × M matrix

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 5
Linear Models for Binary Classification

Use the sign of the “score” w > x to do predict binary label

If desired, can turn the score w > x into the probability of the label being +1 (logistic regression)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 6
Linear Models for Multi-class/Multi-label Classification
Recall that, in multi-class/multi-label classification, y = [y1 , y2 , . . . , yM ] is a vector of length M

Just like multi-output regression, each component ym of y can be modeled by a weight vector w m
Need a way to convert y ∈ RM to one-hot (for multi-class)/binary vector (for multi-label)
Note: In some cases, the score need not be converted, e.g.,
Can use the index of largest entry in y as the predicted class in multi-class classification
0.25 0.6 0.1 0.4 0.2

Can use the indices of top few entries in y as the predicted labels in multi-label classification
0.25 0.6 0.1 0.4 0.2

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 7
Linear Models for Dimensionality Reduction

Linear models can be used to reduce data-dimensionality (e.g., Principal Component Analysis)

z1 z2 zK K Latent Features
(K may be less than D)

D Input Features

Note that it looks similar to multi-output regression but the output vector z is latent
An example of an unsupervised learning problem

Need to learn both z and W in these problems

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 8
Linear Models to construct Deep Neural Networks

Linear models are used as basic components of deep neural networks (nonlinear models)

Hidden Layer L z1(L)


z2(L)
zK(L)
KL Latent Features
L

Hidden Layer 2 z1(2)


z2(2)
zK(2)
K2 Latent Features
2

Hidden Layer 1 z1(1)


z2(1)
zK
(1)
K1 Latent Features
1

D Input Features

Each hidden layer has a learned latent features based representation of the original input x

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 9
Linear Models to construct Deep Neural Networks

Linear models are used as basic components of deep neural networks (nonlinear models)
Note: After each hidden layer, there A “Deep”
is also a nonlinearity (not shown) Feature
Learner

Hidden Layer L z1(L)


z2
(L)
zK
(L)
KL Latent Features
L

Hidden Layer 2 z1 (2)


z2
(2)
zK(2)
K2 Latent Features
2

Hidden Layer 1 z1 (1)


z2
(1)
zK
(1)
K1 Latent Features
1

D Input Features

Each hidden layer has a learned latent features based representation of the original input x

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 10
Linear Models to construct Deep Neural Networks

Can even construct multiple-output versions of deep neural network

Hidden Layer L z1(L)


z2(L)
zK(L)
KL Latent Features
L

Hidden Layer 2 z1(2)


z2(2)
zK(2)
K2 Latent Features
2

Hidden Layer 1 z1 (1)


z2(1)
zK(1)
K1 Latent Features
1

D Input Features

These can be used for multi-output regression, multi-class/multi-label classification, etc.

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 11
Linear Models with Offset (Bias) Parameter

Some linear models use an additional bias parameter b

Can append a constant feature “1” for each input and rewrite as y = w > x, with x, w ∈ RD+1
We will assume the same and omit the explicit bias for simplicity of notation

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 12
Learning Linear Models

W W

z1 z2 zK

W W

Linear Models are ubiquitous!


How do we learn them from data?
For linear models, learning = Learning the model parameters (the weights)
We will formulate learning as an optimization problem w.r.t. these parameters
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 13
Learning a Linear Model for Regression
Let’s focus on learning the simplest linear model for now: Linear Regression
(Predicted) Output

The Input
(D features)

Suppose we are given regression training data {(x n , yn )}N D


n=1 with x n ∈ R , and yn ∈ R
Let’s model the training data using w and assume yn ≈ w > x n , ∀n (equivalently y ≈ Xw )
1 D 1

Linear System of Equations


D
N N Can solve it to find optimal

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 14
Linear Regression: Pictorially

With one-dimensional inputs, linear regression would look like

x
>
Error of the model for an example = yn − w x n (= yn − wxn for scalar input case)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 15
Linear Regression
Define the total error or “loss” on the training data, when using w as our model, as
N
X
L(w ) = (yn − w > x n )2
n=1

Note: Squared loss chosen for simplicity. Can define other type of losses too (more on this later)
The best w will be the one that minimizes the above error (requires optimization w.r.t. w )
N
X
ŵ = arg min L(w ) = arg min (yn − w > x n )2
w w
n=1

This is known as “least squares” linear regression (Gauss/Legendre, early 18th century)
Taking derivative (gradient) of L(w ) w.r.t. w and setting to zero
N N
X ∂ X
2(yn − w > x n ) (yn − x >
n w) = 0 ⇒ x n (yn − x >
n w) = 0
n=1
∂w n=1

Simplifying further, we get a closed form solution for w ∈ RD


XN N
X
w =( x nx >
n )
−1
yn x n = (X> X)−1 X> y
n=1 n=1

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 16
Linear Regression

1 D 1

D
N N

Consider the closed form solution we obtained for linear regression based on least squares

The above closed form solution is nice but has some issues
The D × D matrix X> X may not be invertible
Based solely on minimizing the training error N > 2
P
n=1 (yn − w x n ) ⇒ can overfit the training data

Expensive inversion for large D: Can used iterative optimization techniques (will come to this later)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 17
Regularized Linear Regression (a.k.a. Ridge Regression)
PD
Consider regularized loss: Training error + `2 -squared norm of w , i.e., ||w ||22 = w > w = d=1 wd2
" N #
X
> 2 >
Lreg (w ) = (yn − w x n ) + λw w
n=1

Minimizing the above objective w.r.t. w does two things


Keeps the training error small
Keeps the `2 norm of w small (and thus also the individual components of w ): Regularization

There is a trade-off between the two terms: The regularization hyperparam λ > 0 controls it
Very small λ means almost no regularization (can overfit)
Very large λ means very high regularization (can underfit - high training error)
Can use cross-validation to choose the “right” λ

The solution to the above optimization problem is: w = (X> X + λID )−1 X> y
Note that, in this case, regularization also made inversion possible (note the λID term)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 18
How `2 Regularization Helps Here?

We saw that `2 regularization encourages the individual weights in w to be small


Small weights ensure that the function y = f (x) = w > x is smooth (i.e., we expect similar x’s to
have similar y ’s). Below is an informal justification:
Consider two points x n ∈ RD and x m ∈ RD that are exactly similar in all features except the d-th
feature where they differ by a small value, say 
Assuming a simple/smooth function f (x), yn and ym should also be close

However, as per the model y = f (x) = w > x, yn and ym will differ by wd
Unless we constrain wd to have a small value, the difference wd would also be very large (which
isn’t what we want).
That’s why regularizing (via `2 regularization) and making the individual components of the weight
vector small helps

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 19
Regularization: Some Comments

Many ways to regularize ML models (for linear as well as other models)


Some are based on adding a norm of w to the loss function (as we already saw)
Using `2 norm in the loss function promotes the individual entries w to be small (we saw that)
Using `0 norm encourages very few non-zero entries in w (thereby promoting “sparse” w )

||w ||0 = #nnz(w )

Optimizing with `0 is difficult (NP-hard problem); can use `1 norm as an approximation


D
X
||w ||1 = |wd |
d=1

Note: Since they learn a sparse w , `0 or `1 regularization is also useful for doing feature selection
(wd = 0 means feature d is irrelevant). We will revisit `1 later to formally see why `1 gives sparsity

Other techniques for regularization: Early stopping (of training), “dropout”, etc (popular in deep
neural networks; will revisit these later when discussing deep learning)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 20
Linear/Ridge Regression via Gradient Descent

Both least squares regression and ridge regression require matrix inversion
Least Squares w = (X> X)−1 X> y , Ridge w = (X> X + λID )−1 X> y

Can be computationally expensive when D is very large


A faster way is to use iterative optimization, such as batch or stochastic gradient descent
A basic batch gradient-descent based procedure looks like
Start with an initial value of w = w (0)
Update w by moving along the gradient of the loss function L

∂L
w (t) = w (t−1) − η where η is the learning rate
∂w w =w (t−1)
Repeat until converge
PN
For least squares, the gradient is ∂L
∂w =− n=1 x n (yn − x >
n w ) (no matrix inversion involved)

Such iterative methods for optimizing loss functions are widely used in ML. Will revisit these later

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 21
Linear Regression via Gradient-based Methods: Some Notes
We will revisit gradient based methods later but a few things to keep in mind
Gradient Descent guaranteed to converge to a local minima
Gradient Descent converges to global minima if the function is convex

A function is convex if second derivative is non-negative everywhere (for scalar functions) or if


Hessian is positive semi-definite (for vector-valued functions). For a convex function, every local
minima is also a global minima.
Note: The squared loss function in linear regression is convex
With `2 regularizer, it becomes strictly convex (single global minima)

For Gradient Descent, the learning rate is important (should not be too large or too small)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 22
Linear Regression as Solving System of Linear Equations

Solving y = Xw for w is like solving for D unknowns w1 , . . . , wD using N equations

y1 = x11 w1 + x12 w2 + . . . + x1D wD


y2 = x21 w1 + x22 w2 + . . . + x2D wD
..
.
yN = xN1 w1 + xN2 w2 + . . . + xND wD

Can therefore view the linear regression problem as a system of linear equations
However, in linear regression, we would rarely have N = D, but N > D or D > N

N > D case is an overdetermined system of linear equations (# equations > # unknowns)


D > N case is an underdetermined system of linear equations (# unknowns > # equations)
Thus methods to solve over/underdetermined systems can be used to solve linear regression as well
Many of these don’t require a matrix inversion (will provide a separate note with details)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 23
Linear Regression: Some Other Comments
A simple and interpretable method. Very widely used.
Least squares and ridge regression are one of the very few ML problems with closed form solutions
Least Squares w = (X> X)−1 X> y , Ridge w = (X> X + λID )−1 X> y
Many ML problems can be easily reduced to the form y = Xw or Y = XW
Equivalence to over/underdetermined system of linear equations enables us to use efficient solvers
(a lot of work in the numerical linear algebra community to scale up linear systems solvers)
An interesting bit: Note that w = (X> X)−1 X> y ⇒ Aw = b where A = X> X and b = X> y
Using the above relation, can solve for w by solving Aw = b. A standard linear system with D
equations and D unknowns; can be solved using efficient linear systems solvers.
The basic (regularized) linear regression can also be easily extended to
Nonlinear Regression yn ≈ w > φ(x n ) by replacing the original feature vector x n by a nonlinear
transformation φ(x n ) (where φ may be pre-defined or itself learned)
Generalized Linear Model yn = g (w > x n ) when response yn is not real-valued but
binary/categorical/count, etc, and g is a “link function”
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 24
General Supervised Learning as Optimization
We saw that regularized least squares regression required solving
" N #
X λ
ŵ = arg min Lreg (w ) = arg min (yn − w x n ) + w > w > 2
w w
n=1
2

This is essentially the training loss (called “empirical loss”), plus the regularization term
In general, for supervised learning, the goal is to learn a function f , s.t. f (x n ) ≈ yn , ∀n
Moreover, we also want to have a simple f , i.e., have some regularization
Therefore, learning the best f amounts to solving the following optimization problem
N
X
fˆ = arg min Lreg (f ) = arg min `(yn , f (x n )) + λR(f )
f f
n=1

where `(yn , f (x n )) measures the model f ’s training loss on (x n , yn ) and R(f ) is a regularizer
For least squares regression, f (x n ) = w > x, and R(f ) = w > w , and `(yn , f (x n )) = (yn − w > x n )2
As we’ll see later, different supervised learning problems differ in the choice of f , R(.), and `
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 25
General Unsupervised Learning as Optimization

Can we formulate unsupervised learning problems as optimization problems? Yes, of course! :-)
Consider an unsupervised learning problem with N inputs X = {x n }N
n=1

Unsupervised, so no labels. Suppose we are interested in learning a new representation Z = {z n }N


n=1

Assume a function f that models the relationship between x n and z n

x n ≈ f (z n ) ∀n

In this case, we can define a loss function `(xn , f (z n )) that measures how well f can “reconstruct”
the original x n from its new representation z n
This generic unsup. learning problem can thus be written as the following optimization problem
N
X
fˆ = arg min `(xn , f (z n )) + λR(f , Z)
f ,Z
n=1

In this case both f and Z need to be learned. Typically learned via alternating optimization

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 26

You might also like