0% found this document useful (0 votes)

4 views

Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)

Uploaded by

satyajitresearchict

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)

Uploaded by

satyajitresearchict

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Linear Models and Learning via Optimization

Piyush Rai

Introduction to Machine Learning (CS771A)

August 9, 2018

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 1
Recap
Decision Trees: Learning by asking questions. Ask the “important” questions first!

NO YES
5

3 NO YES NO YES

Predict Predict Predict Predict

0
0 1 2 3 4 5 6
Red Blue Blue Red

NO YES

NO YES Predict
3 y=3.5

y
2

Predict Predict
0 y=2.5 y=0.5
0 1 2 3 4 x 5 6 7 8

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 2
Linear Models

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 3
Linear Models
Consider learning to map an input x ∈ RD to its output y (say real-valued)
Assume the output to be a linear weighted combination of the D input features
(Predicted) Output

The Input
(D features)

This is an example of a linear model with D parameters w = [w1 , w2 , . . . , wD ]

Inspired by linear models of neurons
w ∈ RD is also known as the weight vector
Here wd denotes how important the d-th input feature is for predicting y
The above is basically a linear model for simple regression (single, real-valued output y )
This basic model can also be used as building blocks in many more complex models
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 4
Linear Models for Multi-output Regression

Can assume each of the M outputs in y ∈ RM to be modeled by a linear model

Each output ym (m = 1, . . . , M) modeled by a weight vector w m ∈ RD : ym = w >

The entire model for all M outputs can be represented as y = W> x

W = [w 1 , w 2 , . . . , w M ] is a D × M matrix

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 5
Linear Models for Binary Classification

Use the sign of the “score” w > x to do predict binary label

If desired, can turn the score w > x into the probability of the label being +1 (logistic regression)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 6
Linear Models for Multi-class/Multi-label Classification
Recall that, in multi-class/multi-label classification, y = [y1 , y2 , . . . , yM ] is a vector of length M

Just like multi-output regression, each component ym of y can be modeled by a weight vector w m
Need a way to convert y ∈ RM to one-hot (for multi-class)/binary vector (for multi-label)
Note: In some cases, the score need not be converted, e.g.,
Can use the index of largest entry in y as the predicted class in multi-class classification
0.25 0.6 0.1 0.4 0.2

Can use the indices of top few entries in y as the predicted labels in multi-label classification
0.25 0.6 0.1 0.4 0.2

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 7
Linear Models for Dimensionality Reduction

Linear models can be used to reduce data-dimensionality (e.g., Principal Component Analysis)

z1 z2 zK K Latent Features
(K may be less than D)

D Input Features

Note that it looks similar to multi-output regression but the output vector z is latent
An example of an unsupervised learning problem

Need to learn both z and W in these problems

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 8
Linear Models to construct Deep Neural Networks

Linear models are used as basic components of deep neural networks (nonlinear models)

Hidden Layer L z1(L)

z2(L)
zK(L)
KL Latent Features
L

Hidden Layer 2 z1(2)

z2(2)
zK(2)
K2 Latent Features
2

Hidden Layer 1 z1(1)

z2(1)
zK
(1)
K1 Latent Features
1

D Input Features

Each hidden layer has a learned latent features based representation of the original input x

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 9
Linear Models to construct Deep Neural Networks

Linear models are used as basic components of deep neural networks (nonlinear models)
Note: After each hidden layer, there A “Deep”
is also a nonlinearity (not shown) Feature
Learner

Hidden Layer L z1(L)

z2
(L)
zK
(L)
KL Latent Features
L

Hidden Layer 2 z1 (2)

z2
(2)
zK(2)
K2 Latent Features
2

Hidden Layer 1 z1 (1)

z2
(1)
zK
(1)
K1 Latent Features
1

D Input Features

Each hidden layer has a learned latent features based representation of the original input x

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 10
Linear Models to construct Deep Neural Networks

Can even construct multiple-output versions of deep neural network

Hidden Layer L z1(L)

z2(L)
zK(L)
KL Latent Features
L

Hidden Layer 2 z1(2)

z2(2)
zK(2)
K2 Latent Features
2

Hidden Layer 1 z1 (1)

z2(1)
zK(1)
K1 Latent Features
1

D Input Features

These can be used for multi-output regression, multi-class/multi-label classification, etc.

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 11
Linear Models with Offset (Bias) Parameter

Some linear models use an additional bias parameter b

Can append a constant feature “1” for each input and rewrite as y = w > x, with x, w ∈ RD+1
We will assume the same and omit the explicit bias for simplicity of notation

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 12
Learning Linear Models

W W

z1 z2 zK

W W

Linear Models are ubiquitous!

How do we learn them from data?
For linear models, learning = Learning the model parameters (the weights)
We will formulate learning as an optimization problem w.r.t. these parameters
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 13
Learning a Linear Model for Regression
Let’s focus on learning the simplest linear model for now: Linear Regression
(Predicted) Output

The Input
(D features)

Suppose we are given regression training data {(x n , yn )}N D

n=1 with x n ∈ R , and yn ∈ R
Let’s model the training data using w and assume yn ≈ w > x n , ∀n (equivalently y ≈ Xw )
1 D 1

Linear System of Equations

D
N N Can solve it to find optimal

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 14
Linear Regression: Pictorially

With one-dimensional inputs, linear regression would look like

x
>
Error of the model for an example = yn − w x n (= yn − wxn for scalar input case)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 15
Linear Regression
Define the total error or “loss” on the training data, when using w as our model, as
N
X
L(w ) = (yn − w > x n )2
n=1

Note: Squared loss chosen for simplicity. Can define other type of losses too (more on this later)
The best w will be the one that minimizes the above error (requires optimization w.r.t. w )
N
X
ŵ = arg min L(w ) = arg min (yn − w > x n )2
w w
n=1

This is known as “least squares” linear regression (Gauss/Legendre, early 18th century)
Taking derivative (gradient) of L(w ) w.r.t. w and setting to zero
N N
X ∂ X
2(yn − w > x n ) (yn − x >
n w) = 0 ⇒ x n (yn − x >
n w) = 0
n=1
∂w n=1

Simplifying further, we get a closed form solution for w ∈ RD

XN N
X
w =( x nx >
n )
−1
yn x n = (X> X)−1 X> y
n=1 n=1

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 16
Linear Regression

1 D 1

D
N N

Consider the closed form solution we obtained for linear regression based on least squares

The above closed form solution is nice but has some issues
The D × D matrix X> X may not be invertible
Based solely on minimizing the training error N > 2
P
n=1 (yn − w x n ) ⇒ can overfit the training data

Expensive inversion for large D: Can used iterative optimization techniques (will come to this later)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 17
Regularized Linear Regression (a.k.a. Ridge Regression)
PD
Consider regularized loss: Training error + `2 -squared norm of w , i.e., ||w ||22 = w > w = d=1 wd2
" N #
X
> 2 >
Lreg (w ) = (yn − w x n ) + λw w
n=1

Minimizing the above objective w.r.t. w does two things

Keeps the training error small
Keeps the `2 norm of w small (and thus also the individual components of w ): Regularization

There is a trade-off between the two terms: The regularization hyperparam λ > 0 controls it
Very small λ means almost no regularization (can overfit)
Very large λ means very high regularization (can underfit - high training error)
Can use cross-validation to choose the “right” λ

The solution to the above optimization problem is: w = (X> X + λID )−1 X> y
Note that, in this case, regularization also made inversion possible (note the λID term)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 18
How `2 Regularization Helps Here?

We saw that `2 regularization encourages the individual weights in w to be small

Small weights ensure that the function y = f (x) = w > x is smooth (i.e., we expect similar x’s to
have similar y ’s). Below is an informal justification:
Consider two points x n ∈ RD and x m ∈ RD that are exactly similar in all features except the d-th
feature where they differ by a small value, say
Assuming a simple/smooth function f (x), yn and ym should also be close

However, as per the model y = f (x) = w > x, yn and ym will differ by wd
Unless we constrain wd to have a small value, the difference wd would also be very large (which
isn’t what we want).
That’s why regularizing (via `2 regularization) and making the individual components of the weight
vector small helps

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 19
Regularization: Some Comments

Many ways to regularize ML models (for linear as well as other models)

Some are based on adding a norm of w to the loss function (as we already saw)
Using `2 norm in the loss function promotes the individual entries w to be small (we saw that)
Using `0 norm encourages very few non-zero entries in w (thereby promoting “sparse” w )

||w ||0 = #nnz(w )

Optimizing with `0 is difficult (NP-hard problem); can use `1 norm as an approximation

D
X
||w ||1 = |wd |
d=1

Note: Since they learn a sparse w , `0 or `1 regularization is also useful for doing feature selection
(wd = 0 means feature d is irrelevant). We will revisit `1 later to formally see why `1 gives sparsity

Other techniques for regularization: Early stopping (of training), “dropout”, etc (popular in deep
neural networks; will revisit these later when discussing deep learning)
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 20
Linear/Ridge Regression via Gradient Descent

Both least squares regression and ridge regression require matrix inversion
Least Squares w = (X> X)−1 X> y , Ridge w = (X> X + λID )−1 X> y

Can be computationally expensive when D is very large

A faster way is to use iterative optimization, such as batch or stochastic gradient descent
A basic batch gradient-descent based procedure looks like
Start with an initial value of w = w (0)
Update w by moving along the gradient of the loss function L

∂L
w (t) = w (t−1) − η where η is the learning rate
∂w w =w (t−1)
Repeat until converge
PN
For least squares, the gradient is ∂L
∂w =− n=1 x n (yn − x >
n w ) (no matrix inversion involved)

Such iterative methods for optimizing loss functions are widely used in ML. Will revisit these later

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 21
Linear Regression via Gradient-based Methods: Some Notes
We will revisit gradient based methods later but a few things to keep in mind
Gradient Descent guaranteed to converge to a local minima
Gradient Descent converges to global minima if the function is convex

A function is convex if second derivative is non-negative everywhere (for scalar functions) or if

Hessian is positive semi-definite (for vector-valued functions). For a convex function, every local
minima is also a global minima.
Note: The squared loss function in linear regression is convex
With `2 regularizer, it becomes strictly convex (single global minima)

For Gradient Descent, the learning rate is important (should not be too large or too small)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 22
Linear Regression as Solving System of Linear Equations

Solving y = Xw for w is like solving for D unknowns w1 , . . . , wD using N equations

y1 = x11 w1 + x12 w2 + . . . + x1D wD

y2 = x21 w1 + x22 w2 + . . . + x2D wD
..
.
yN = xN1 w1 + xN2 w2 + . . . + xND wD

Can therefore view the linear regression problem as a system of linear equations
However, in linear regression, we would rarely have N = D, but N > D or D > N

N > D case is an overdetermined system of linear equations (# equations > # unknowns)

D > N case is an underdetermined system of linear equations (# unknowns > # equations)
Thus methods to solve over/underdetermined systems can be used to solve linear regression as well
Many of these don’t require a matrix inversion (will provide a separate note with details)

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 23
Linear Regression: Some Other Comments
A simple and interpretable method. Very widely used.
Least squares and ridge regression are one of the very few ML problems with closed form solutions
Least Squares w = (X> X)−1 X> y , Ridge w = (X> X + λID )−1 X> y
Many ML problems can be easily reduced to the form y = Xw or Y = XW
Equivalence to over/underdetermined system of linear equations enables us to use efficient solvers
(a lot of work in the numerical linear algebra community to scale up linear systems solvers)
An interesting bit: Note that w = (X> X)−1 X> y ⇒ Aw = b where A = X> X and b = X> y
Using the above relation, can solve for w by solving Aw = b. A standard linear system with D
equations and D unknowns; can be solved using efficient linear systems solvers.
The basic (regularized) linear regression can also be easily extended to
Nonlinear Regression yn ≈ w > φ(x n ) by replacing the original feature vector x n by a nonlinear
transformation φ(x n ) (where φ may be pre-defined or itself learned)
Generalized Linear Model yn = g (w > x n ) when response yn is not real-valued but
binary/categorical/count, etc, and g is a “link function”
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 24
General Supervised Learning as Optimization
We saw that regularized least squares regression required solving
" N #
X λ
ŵ = arg min Lreg (w ) = arg min (yn − w x n ) + w > w > 2
w w
n=1
2

This is essentially the training loss (called “empirical loss”), plus the regularization term
In general, for supervised learning, the goal is to learn a function f , s.t. f (x n ) ≈ yn , ∀n
Moreover, we also want to have a simple f , i.e., have some regularization
Therefore, learning the best f amounts to solving the following optimization problem
N
X
fˆ = arg min Lreg (f ) = arg min `(yn , f (x n )) + λR(f )
f f
n=1

where `(yn , f (x n )) measures the model f ’s training loss on (x n , yn ) and R(f ) is a regularizer
For least squares regression, f (x n ) = w > x, and R(f ) = w > w , and `(yn , f (x n )) = (yn − w > x n )2
As we’ll see later, different supervised learning problems differ in the choice of f , R(.), and `
Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 25
General Unsupervised Learning as Optimization

Can we formulate unsupervised learning problems as optimization problems? Yes, of course! :-)
Consider an unsupervised learning problem with N inputs X = {x n }N
n=1

Unsupervised, so no labels. Suppose we are interested in learning a new representation Z = {z n }N

n=1

Assume a function f that models the relationship between x n and z n

x n ≈ f (z n ) ∀n

In this case, we can define a loss function `(xn , f (z n )) that measures how well f can “reconstruct”
the original x n from its new representation z n
This generic unsup. learning problem can thus be written as the following optimization problem
N
X
fˆ = arg min `(xn , f (z n )) + λR(f , Z)
f ,Z
n=1

In this case both f and Z need to be learned. Typically learned via alternating optimization

Intro to Machine Learning (CS771A) Linear Models and Learning via Optimization 26

The Hundred-Page Machine Learning Book - Andriy Burkov
No ratings yet
The Hundred-Page Machine Learning Book - Andriy Burkov
16 pages
Quiz
No ratings yet
Quiz
2 pages
771 A18 Lec4
100% (1)
771 A18 Lec4
128 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
771 A18 Lec5
No ratings yet
771 A18 Lec5
156 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
ML-2
No ratings yet
ML-2
155 pages
Lecture 5
No ratings yet
Lecture 5
18 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
QSRI-lecture1
No ratings yet
QSRI-lecture1
45 pages
Lec 03
No ratings yet
Lec 03
42 pages
CM20315 02 Supervised
No ratings yet
CM20315 02 Supervised
53 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Linear Regression: Machine Learning Course - CS-433
No ratings yet
Linear Regression: Machine Learning Course - CS-433
5 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
AI14 - MachineLearning
No ratings yet
AI14 - MachineLearning
49 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
Neural Networks Cheat Sheet - 2020 PDF
No ratings yet
Neural Networks Cheat Sheet - 2020 PDF
14 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
AI14 - MachineLearning
No ratings yet
AI14 - MachineLearning
49 pages
Introduction: Geometric Models: - Page 1 of 25
No ratings yet
Introduction: Geometric Models: - Page 1 of 25
25 pages
eng
No ratings yet
eng
10 pages
11_Học máy cơ bản_Hồi quy tuyến tính 1
No ratings yet
11_Học máy cơ bản_Hồi quy tuyến tính 1
105 pages
Machine Learning: Introduction and Linear Regression
No ratings yet
Machine Learning: Introduction and Linear Regression
29 pages
ML 01
No ratings yet
ML 01
24 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
WEEK 1-5
No ratings yet
WEEK 1-5
13 pages
Linear Models: CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Linear Models: CS771: Introduction To Machine Learning Piyush Rai
8 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
2 - Multiple Linear Regression
No ratings yet
2 - Multiple Linear Regression
71 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
Linear Regression
No ratings yet
Linear Regression
16 pages
Hundred Page ML Book CH 3
No ratings yet
Hundred Page ML Book CH 3
16 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
ML Lecture1
No ratings yet
ML Lecture1
37 pages
SML_Lecture1
No ratings yet
SML_Lecture1
37 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Linear Regression Class6
No ratings yet
Linear Regression Class6
15 pages
ML Insem Solved Question Paper
No ratings yet
ML Insem Solved Question Paper
4 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Lect 1
No ratings yet
Lect 1
24 pages
Lec 05
No ratings yet
Lec 05
54 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Lec06 Matt[1]
No ratings yet
Lec06 Matt[1]
60 pages
ML-1-PPT-UNIT-1
No ratings yet
ML-1-PPT-UNIT-1
93 pages
Worked Examples in Advanced Mechanics of Materials using MATLAB
From Everand
Worked Examples in Advanced Mechanics of Materials using MATLAB
Eric Okoth Ogur
No ratings yet
664323a4ab9fc670815da26a - Interim Brochure 7
No ratings yet
664323a4ab9fc670815da26a - Interim Brochure 7
12 pages
Combining Neural Gas and Learning Vector Quantization For Cursive Character Recognition
No ratings yet
Combining Neural Gas and Learning Vector Quantization For Cursive Character Recognition
13 pages
MachineLearning Unit-III Ppt
No ratings yet
MachineLearning Unit-III Ppt
26 pages
Diagnosing Cervical Cancer Using Machine Learning Methods
No ratings yet
Diagnosing Cervical Cancer Using Machine Learning Methods
3 pages
2nd Course
No ratings yet
2nd Course
4 pages
Different Artificial Neural Networks Architectures
No ratings yet
Different Artificial Neural Networks Architectures
27 pages
Data Science & Its Applications
No ratings yet
Data Science & Its Applications
59 pages
How Machine Learning Will Transform Biomedicine
No ratings yet
How Machine Learning Will Transform Biomedicine
10 pages
2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass
No ratings yet
2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass
5 pages
Multi-Class Retinal Diseases Detection Using Deep CNN With Minimal Memory Consumption PDF
100% (1)
Multi-Class Retinal Diseases Detection Using Deep CNN With Minimal Memory Consumption PDF
11 pages
L3good Neural
No ratings yet
L3good Neural
18 pages
HR-Analytics
No ratings yet
HR-Analytics
9 pages
Scopus Journal
No ratings yet
Scopus Journal
6 pages
Nndl Notes
No ratings yet
Nndl Notes
73 pages
BA Johannes Braun Druckversion
No ratings yet
BA Johannes Braun Druckversion
55 pages
A Machine Learning Based CIDS Model For Intrusion Detection To Ensure Security Within Cloud Network
No ratings yet
A Machine Learning Based CIDS Model For Intrusion Detection To Ensure Security Within Cloud Network
9 pages
QUESTION BANK FOR FUNDAMENTAL OF AI
No ratings yet
QUESTION BANK FOR FUNDAMENTAL OF AI
2 pages
Computational Data Science: Advanced Programme in
No ratings yet
Computational Data Science: Advanced Programme in
15 pages
Final Project
No ratings yet
Final Project
60 pages
Unsupervised Machine Learning-UNITIV
No ratings yet
Unsupervised Machine Learning-UNITIV
22 pages
New Paradigm of Industry 4.0 Internet of Things, Big Data Cyber Physical Systems
100% (10)
New Paradigm of Industry 4.0 Internet of Things, Big Data Cyber Physical Systems
187 pages
Maths
No ratings yet
Maths
12 pages
Unit1-Introduction to AI (Refference)-Converted
No ratings yet
Unit1-Introduction to AI (Refference)-Converted
6 pages
Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence
No ratings yet
Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence
19 pages
AI & Expert System ch12
No ratings yet
AI & Expert System ch12
13 pages
Arun Mani Sam, R&D Software Engineer
No ratings yet
Arun Mani Sam, R&D Software Engineer
21 pages
Ethics in AI
No ratings yet
Ethics in AI
16 pages
Cse121 Abhinandan Final Submition
No ratings yet
Cse121 Abhinandan Final Submition
9 pages
Documentation of our project
No ratings yet
Documentation of our project
21 pages

Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)

Uploaded by

Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)

Uploaded by

Linear Models and Learning via Optimization

Introduction to Machine Learning (CS771A)

Predict Predict Predict Predict

This is an example of a linear model with D parameters w = [w1 , w2 , . . . , wD ]

Can assume each of the M outputs in y ∈ RM to be modeled by a linear model

Each output ym (m = 1, . . . , M) modeled by a weight vector w m ∈ RD : ym = w >

The entire model for all M outputs can be represented as y = W> x

Use the sign of the “score” w > x to do predict binary label

Need to learn both z and W in these problems

Hidden Layer L z1(L)

Hidden Layer 2 z1(2)

Hidden Layer 1 z1(1)

Hidden Layer L z1(L)

Hidden Layer 2 z1 (2)

Hidden Layer 1 z1 (1)

Can even construct multiple-output versions of deep neural network

Hidden Layer L z1(L)

Hidden Layer 2 z1(2)

Hidden Layer 1 z1 (1)

These can be used for multi-output regression, multi-class/multi-label classification, etc.

Some linear models use an additional bias parameter b

Linear Models are ubiquitous!

Suppose we are given regression training data {(x n , yn )}N D

Linear System of Equations

With one-dimensional inputs, linear regression would look like

Simplifying further, we get a closed form solution for w ∈ RD

Minimizing the above objective w.r.t. w does two things

We saw that `2 regularization encourages the individual weights in w to be small

Many ways to regularize ML models (for linear as well as other models)

||w ||0 = #nnz(w )

Optimizing with `0 is difficult (NP-hard problem); can use `1 norm as an approximation

Can be computationally expensive when D is very large

A function is convex if second derivative is non-negative everywhere (for scalar functions) or if

Solving y = Xw for w is like solving for D unknowns w1 , . . . , wD using N equations

y1 = x11 w1 + x12 w2 + . . . + x1D wD

N > D case is an overdetermined system of linear equations (# equations > # unknowns)

Unsupervised, so no labels. Suppose we are interested in learning a new representation Z = {z n }N

Assume a function f that models the relationship between x n and z n

You might also like