0% found this document useful (0 votes)
3 views

Regression 0

The document discusses the origins and definitions of machine learning, highlighting its roots in statistics, computer science, and neuroscience. It explains the different types of machine learning, including supervised, unsupervised, and reinforcement learning, along with their respective algorithms and applications. Additionally, it covers regression analysis as a method for predicting outcomes based on independent variables, using examples like house price prediction.

Uploaded by

ARSH SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Regression 0

The document discusses the origins and definitions of machine learning, highlighting its roots in statistics, computer science, and neuroscience. It explains the different types of machine learning, including supervised, unsupervised, and reinforcement learning, along with their respective algorithms and applications. Additionally, it covers regression analysis as a method for predicting outcomes based on independent variables, using examples like house price prediction.

Uploaded by

ARSH SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

MACHINE LEARNING

WHY MACHINE LEARNING WAS INTRODUCED

 Statistics: How to efficiently train large complex models?

 Computer Science & Artificial Intelligence: How to train more


robust version of the AI system.

 Neuroscience: How to design operational models of the brain?


CAN YOU RECOGNIZE THESE PICTURES ?

If yes, How do you Recognize it ?


ORIGIN OF MACHINE LEARNING

……… Lies in very effort of understanding Intelligence

What is intelligence ?
It can be defined as the ability to comprehend; to understand and profit
from experience.

Capability of acquire and apply knowledge


LEARNING?
LEARNING?
2300
2300 YEARS
YEARS AGO……
AGO……

 Plato (427 – 347 BC)


 The concept of abstract ideas are known
to us a priori, through a Mystic
connection with world.
 He conclude that ability to think is found
in a priori knowledge of the concept
LEARNING ?
LEARNING ?
 Plato’s Pupil
 Aristotle (384 – 322 BC)
 Criticized his Teacher’s Theory
as it is not taking into account the
important aspect
An ability to learn or adapt to
changing world.
MACHINE LEARNING
MACHINE LEARNING
 Machine Learning is a subset of AI technique which use statistical
methods to enable machines to improve with experience.

• Learning –
– A computer program is said to learn from
• experience E
• with respect to some class of tasks T
• and performance measure P
– if its performance at tasks in T , as measured by P , improves with experience
E.” (Mitchell , 1997)
LEARNING ALGORITHMS…
LEARNING ALGORITHMS…
• General Tasks
– Classification, Regression, Transcription , Machine Translation etc.

• Performance measures
– Depends on the type of problem: Examples include –
• accuracy, error rate etc.
– Performance is measured on a dataset called test dataset, that is different
from the dataset used to train the algorithms.
– Often difficult to choose a performance measure that corresponds well to the
desired behavior of the system.

• Experience
– Algorithms are termed as supervised learning or unsupervised learning
algorithms based on the experience they are allowed to have on datasets.
EXAMPLE (HANDWRITING RECOGNITION LEARNING PROBLEM)
EXAMPLE (HANDWRITING RECOGNITION LEARNING PROBLEM)

 Task T: Recognition and classifying handwritten words within images

 Performance Measure P: Percentage of words correctly classified.

 Training experience E: A database of handwritten words with given


classification
MACHINE LEARNING

• Learning from experience on data to make predictions.

Machine
Learning
Data
algorithm

Training
Prediction

Unseen Trained Prediction


Data model
BRANCHES OF MACHINE
LEARNING

Source: https://towardsdatascience.com/coding-deep-learning-for-beginners-types-of-
machine-learning-b9e651e1ed9d
SUPERVISED MACHINE LEARNING
SUPERVISE APPROACH
MACHINE LEARNING APPROACH

 For each specific tasks


 We collect lots of examples with their known outcomes
 Learn a function that map inputs to outputs
 These programs tend to be data centric, i.e. driven by the learning
examples and tries to learn a preconceived hypothesis function that
can describe the mapping as close as possible.
SUPERVISED MACHINE LEARNING APPROACH

We collect lots of examples with their


known outcomes
Learn a function that map inputs to
outputs

Supervised Learning models are trying to find


parameter values that will allow them to
perform well on historical data. Then they
are used for making predictions on unknown
data, that was not a part of training dataset.
There are two main problems that can be solved with Supervised
Learning:

Classification Regression
Regression Classification

Linear Regression Logistic Regression

Multiple Linear Regression K-Nearest Neighbors

Polynomial Linear Regression Support Vector Machine

Support Vector Regression Naïve Bayes

Decision Tree Regression Decision Tree Classification

Random Forest Regression Random Forest Classification


SUPERVISED EXAMPLE & USE CASES

UNSUPERVISED
EXAMPLES & USE CASES
UNSUPERVISED MACHINE LEARNING APPROACH

 Finding patterns in data


 Draw inferences from non-labeled data (without reference to
known or labeled outcomes).
 Models based on this type of algorithms can be used for
discovering unknown data patterns and data structure itself.
CLUSTERING
ASSOCIATION RULE MINING

Source: https://www.quora.com/How-is-association-rule-
compared-with-collaborative-filtering-in-recommender-systems
DIMENSION REDUCTION METHOD
Association Rule
Clustering Dimension Reduction
Mining

K-Means Aprior PCA

Hierarchical FP-Growth LDA

DBSCAN Eclat
UNSUPERVISED EXAMPLE & USE CASES
REINFORCEMENT LEARNING

 Reinforcement learning is a type of machine learning where an agent learns to behave


in a environment by performing actions and seeing the results.
 Exploration (Trail and Error)
 Exploitation (Knowledge gained from the environment)
DEEP LEARNING

• The difference in artificial intelligence approaches over


the two decades (1997-2017)
– 1997: The IBM chess computer DeepBlue, was explicitly
programmed to win against the grandmaster Garry Kasparov in
1997
– 2017: AlphaGo was not preprogrammed to play Go.
– It learned using a general-purpose algorithm that allowed it to
interpret the game’s patterns.
• AlphaGo program applied deep learning.
DEEPDEEP LEARNING
LEARNING

 Deep learning is a new area of Machine Learning research,


which has been introduced with the objective of moving
machine learning closer to concept of its original goal:
Artificial Intelligence.

 It is inspired by the functionality of our brain cells called


neurons which led to the concept of artificial neural network
DEEP LEARNING
DEEP LEARNING

Source: https://citrusbits.com/killer-deep-learning-softwares/
MACHINE LEARNING VS DEEP LEARNING

Deep Learning IS Machine Learning

Data Dependency Hardware Requirement Execution time

Feature Engineering Interpretability

Problem Solving
REGRESSION
SUPERVISED
SUPERVISED LEARNING
LEARNING
Learning a discrete function- classification
algorithm attempt to estimate the mapping
function from the input variables to
discrete or categorical output variables

Learning a continuous function- regression


algorithm attempt to estimate the mapping
function from the input variables to
numeric or continuous output variables
CLASSIFICATION VS REGRESSION

Classification Regression
Source: https://in.springboard.com/blog/regression-vs-classification-in-machine-learning/
SUPERVISED LEARNING

Image Source: https://www.javatpoint.com/supervised-machine-learning


WHAT IS REGRESSION
WHAT IS REGRESSION

 It is used to predict target variables on a continuous scale.

Regression

Dataset

Map x  y
Identify
Relationship
SALARY AFTER COMPLETING THE COURSE

How much will your salary be ?


Depends on x = performance in course, quality of projects, etc….
TWEET POPULARITY

 How many people will retweet your tweet? (y)

 Depends on x = # followers, # of followers of followers, features of text tweeted,


popularity of hashtag, # of past retweets…….
REGRESSION ANALYSIS

 Regression Analysis is a statistical tool for investigating the


relationship between a dependent variable and one or more
independent variables/explanatory variable.

 Regression analysis is widely used for prediction and


forecasting
INDEPENDENT AND DEPENDENT VARIABLE

 Independent Variable (Explanatory Variable):


A variable whose value does not change by the effect of other variables and
is used to manipulate the dependent variable/target variable. It is often denoted
by X

 Dependent Variable
A variable whose value changes when there is any manipulation in the
values of independent variable. It is often denoted by Y
CASE STUDY: PREDICTING HOUSE PRICE
CASE STUDY: PREDICTING HOUSE PRICE

Size of house (ft) is independent variable also


known as control variable

Price of house is dependent variable/response


variable
WHAT IS REGRESSION
CASE STUDY: PREDICTING HOUSE PRICE

Regression

Dataset
BIVARIATE AND MULTIVARIATE MODEL

 Bivariate or simple regression model


Size of house X Y Price

 Multivariate or multiple regression model

Size of house X1

# of bedrooms X2 Y Price
Age of house X3
SIMPLE/BIVARIATE LINEAR REGRESSION

 Simple linear regression is a linear regression model with a single explanatory


variable.

 It concerns two-dimensional sample points with one independent variable and one
dependent variable and finds a linear function (a non-vertical straight line) that, as
accurately as possible, predicts the dependent variable values as a function of the
independent variables.

 The adjective simple refers to the fact that the outcome variable is related to a
single predictor.
HOW MUCH IS MY HOUSE WORTH?
LOOK AT RECENT SALES IN MY NEIGHBORHOOD

 How much did they sell for ?


𝒙(𝒊) 𝑦 (𝑖)

𝒙(𝒊) 𝑦 (𝑖)

𝒙(𝒊) 𝑦 (𝑖)

𝒙(𝒊) 𝑦 (𝑖)

𝒙(𝒊)
𝑦 (𝑖)
REGRESSION (HOUSE PRICE PREDICTION) Scatter plot is a mathematical diagram to
display values of two variables for a set of data.

Size of house (ft) is independent 𝒙 𝒊 , 𝒚𝒊


variable also known as control
variable

Dependent Variable
Price of house is dependent
Variable/response variable

Independent Variable

Scatter plots are used to investigates the position


relationship between the variables
SIMPLE LINEAR REGRESSION
House Price Predication
We want to fit the best line (linear function
Y = f(X)) to explain the data
SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION

 The equation that describe how dependent variable (y) is related to independent
variable (x). The equation is referred as a regression equation.
𝑦 = 𝑚𝑥 + 𝑐

 The simple linear regression model is:


ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
• x is independent variable
• Parameters/Regression coefficients are 𝜃0 (intercept) and 𝜃1 (𝑠𝑙𝑜𝑝𝑒)
Represents the relationship
REGRESSION between input
(𝑥) and output (y)
The simple linear regression equation is

House price (y)


ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙

Size of house (x)

1. The regression equation is a straight line


2. 𝜃0 intercept of the regression line
3. 𝜃1 𝑠𝑙𝑜𝑝𝑒 of the regression line
4. ℎ𝜃 𝑥 hypothesis of the model
ESTIMATION PROCESS

Regression Equation
𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
Unknown 𝜽𝟎 , 𝜽𝟏

Sample Data
𝜽𝟎 , 𝜽𝟏 are known
(x, y)
Estimated
Regression Equation
𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
GOAL OF REGRESSION MODEL

 Our goal to learn the model parameters that minimize error in the
model’s prediction.

𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
𝒚(𝒊)
House price (y)

𝒉𝜽 (𝒙(𝒊) )
𝒉𝜽 (𝒙(𝒊) )

𝒚(𝒊)

Size of house (x)


 To find the best parameters:
 Define the cost function , or loss function that measures how inaccurate our
model’s prediction are.

𝑦 (𝑖) − ℎ𝜃 (𝑥 (𝑖) )
𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
𝒚(𝒊)
House price (y)

ℎ𝜃 (𝑥 (𝑖) ) − 𝑦 (𝑖)
𝒉𝜽 (𝒙(𝒊) )
𝒉𝜽 (𝒙(𝒊) )

𝒚(𝒊)

Size of house (x)


SIMPLE LINEAR REGRESSION

Parameter :
Regression coefficient

Hθ(x) =
EFFECTS OF PARAMETERS ON LINE PLACEMENT
𝒉𝜽 𝒙 = 𝟏. 𝟓 + 𝟎 ∗ 𝒙 x y
3
𝒉𝜽 𝒙 = 𝟎 + 𝟎. 𝟓 ∗ 𝒙 1 1
𝒉𝜽 𝒙 = 𝟏 + 𝟎. 𝟓 ∗ 𝒙
2 2
3 3
2
1
0

0 1 2 3
EFFECTS OF PARAMETERS ON LINE PLACEMENT
𝒉𝜽 𝒙 = 𝟏. 𝟓 + 𝟎 ∗ 𝒙
3 x y
𝒉𝜽 𝒙 = 𝟎 + 𝟎. 𝟓 ∗ 𝒙
𝒉𝜽 𝒙 = 𝟏 + 𝟎. 𝟓 ∗ 𝒙 1 1
2 2
2

3 3
1

Example
Suppose x = 2.5
0

ℎ𝜃 𝑥 = 1 + 0.5 ∗ 𝑥
0 1 2 3

Predict the outcome


ℎ𝜃 𝑥 =1 + 0.5 *2.5
= 2.25
ESTIMATION PROCESS

Size of
house (x)
LEAST SQUARE METHOD

 One of the most common estimation


technique for linear regression is Least
Square Estimation.

 The least square method is a statistical


procedure to find the best fit for a set
of data points by minimizing the sum
of the offsets or residuals of points Size of
from plotted curve. house (x)
Least Square Method
𝑖
𝑦 = 𝜃0 + 𝜃1 𝑥 (𝑖) + 𝜀 𝑖

𝜀𝑖 = 𝑦 𝑖 − ℎ𝜃 (𝑥 𝑖 )

is residual error (RSS) in the ith observation

J(𝜃0 , 𝜃1 ) = (𝑦 1 − ℎ𝜃 (𝑥 (1) ))2 +(𝑦 2 − ℎ𝜃 (𝑥 (2) ))2 +(𝑦 3 − ℎ𝜃 (𝑥 (3) ))2


+⋯………….+
𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑎𝑙𝑙 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ℎ𝑜𝑢𝑠𝑒𝑠
So, our aim to minimize the total error.

1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖
− ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 J(𝜃0 , 𝜃1 )
𝜃0 , 𝜃1 Cost Function
EXAMPLE

 Let’s take only one parameters 𝜃1 .

1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖
− ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1

 Goal: 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽(𝜃1 )


𝜃1
x y
EXAMPLE
1 1
 𝒉𝜽 𝒙 , for fixed 𝜃1 , this is a  𝑱(𝜃1 ) is a function of 𝜃1 2 2
function of x

2
2

𝒉𝜽 𝒙 = 𝜽 𝟏 ∗ 𝒙

𝑱(𝜃1 )
y

1
1

𝜽𝟏 =1

0
0

0 1 2
0 1 2
𝜽𝟏
x

1 1
J(𝜃0 , 𝜃1 ) = 2𝑚 σ𝑚
𝑖=1(𝑦
𝑖 − 𝜃1 (𝑥 (𝑖) ))2 J(𝜃0 , 𝜃1 ) = 2∗2(02 + 02 ) = 0
EXAMPLE
 𝒉𝜽 𝒙 , for fixed 𝜃1 , this is a  𝑱(𝜃1 ) is a function of 𝜃1
function of x
3

𝒉𝜽 𝒙 = 𝜽 𝟏 ∗ 𝒙

2
2
y

𝑱(𝜃1 )
1
𝜽𝟏 =1.5
1

0
0

0 1 2
0 1 2
𝜽𝟏
x

1 1
J(𝜃0 , 𝜃1 ) = 2𝑚 σ𝑚
𝑖=1(𝑦
𝑖 − 𝜃1 (𝑥 (𝑖) ))2 J(𝜃0 , 𝜃1 ) = 2∗2((1 − 1.5)2 +(2 − 3)2 ) = 0.5
EXAMPLE
 𝑱(𝜃1 ) is a function of 𝜃1
 𝒉𝜽 𝒙 , for fixed 𝜃1 , this is a
function of x

2
2

𝒉𝜽 𝒙 = 𝜽 𝟏 ∗ 𝒙

𝑱(𝜃1 )
y

1
1

𝜽𝟏 =.75

0
0

0 1 2
0 1 2
𝜽𝟏
x

1 1
J(𝜃0 , 𝜃1 ) = 2𝑚 σ𝑚
𝑖=1(𝑦
𝑖 − 𝜃1 (𝑥 (𝑖) ))2 J(𝜃0 , 𝜃1 ) = 2∗2((1 − 0.75)2 +(2 − 1.5)2 ) = 0.07
COST FUNCTION SURFACE PLOT
CONTOUR PLOT

 Contour plot is also known as level plots.

 It is used to visualized the change in J(𝜃0 ,


𝜃1 ) as a function of two input 𝜃0 and 𝜃1 .
J(𝜃0 , 𝜃1 ) =f(𝜃0 , 𝜃0 )

 For a function f(𝜃0 , 𝜃0 ) of two variables,


assigned different colors to different
values of F.

 Pick some values to plot. The result will


be contours–curves in the graph along
which the values of f(𝜃0 , 𝜃0 ) are constant
EXAMPLE

 𝐽(𝜃0 , 𝜃1 ) (function of the


 ℎ𝜃 𝑥 , for fixed 𝜃0 , 𝜃1 , this is
parameters 𝜃1 , 𝜃1 )
a function of x
EXAMPLE
 ℎ𝜃 𝑥 , for fixed 𝜃0 , 𝜃1 , this is a  𝐽(𝜃0 , 𝜃1 ) (function of the
function of x parameters 𝜃1 , 𝜃1 )
EXAMPLE

 ℎ𝜃 𝑥 , for fixed 𝜃0 , 𝜃1 , this is a  𝐽(𝜃0 , 𝜃1 ) (function of the


function of x parameters 𝜃1 , 𝜃1 )
SUMMARY

Hypothesis ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥

Parameters 𝜃0 , 𝜃1

1
Cost Function J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1

Goal 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽(𝜃0 , 𝜃1 )


𝜃0 , 𝜃1
CONVEX AND CONCAVE FUNCTION
Convex Function Concave Function

Slope of change is 0
g′′(𝑧) ≥ 0
g(z)
𝑔′′ 𝑧 < 0

Slope of change is 0

a b a b

Slope of change
is 0
Example
g(𝑧) = 5 − (𝑧 − 10)2
𝑑(𝑔(𝑧)
= 0 − 2 𝑧 − 10
𝑑𝑧
= -2z + 20
Set 𝑑(𝑔(𝑧)Τ𝑑𝑧 = 0
z = 10
FINDING MAXIMUM VIA HILL CLIMBING
Derivative = 0

How do we know whether to move θ to right


or left ?
(Increase the value of θ or decrease θ)

𝑑𝑔(𝜃)
>0 -ve
𝑑𝜃 slope While not converged
𝑑𝑔(𝜃)
+ve 𝑑𝑔(𝜃) 𝜃 𝑡+1 ← 𝜃 𝑡 + α
<0 𝑑𝜃
slope 𝑑𝜃
iteration
Step Size

θ θ
Max(g(θ))
FINDING MINIMUM VIA HILL DESCENT

Min(g(θ)
When derivative is positive, we want to decrease
𝜃 and when derivative is negative, we want to
θ θ
increase 𝜃
𝑑𝑔(𝜃) 𝑑𝑔(𝜃)
<0 >0
𝑑𝜃 𝑑𝜃

-ve +ve
slope slope

While not converged


𝑑𝑔(𝜃)
𝜃 𝑡+1 ← 𝜃 -α
𝑡
𝑑𝜃
iteration
Step Size
STEP SIZE/LEARNING RATE (𝛼)
 With Fixed learning rate

Slowly reach to the optimum


position
STEP SIZE/LEARNING RATE (𝛼)

 With Fixed learning rate

Small step size Large step size


Advantage Advantage
Will converge to global optimum Moving fast toward the optimum
Disadvantage Disadvantage
Slow convergence May overshoot the optimum point
STEP SIZE/LEARNING RATE (𝛼)
 Decreasing Step Size

Step size is scheduled Common Choice:


𝑡
𝛽
α =
𝑡

𝛽
α𝑡 =
𝑡
CONVERGENCE CRITERIA

 For convex function, optimum occurs when

𝑑𝑔 𝜃
=0
𝑑𝜃

In practice, stop when While not converged


𝑑𝑔(𝜃)
𝜃 𝑡+1
← 𝜃 -α
𝑡
𝑑𝑔 𝜃 iteration
𝑑𝜃
<ϵ Step Size
𝑑𝜃
FINDING THE LEAST SQUARES LINE
𝜃0 , 𝜃1

Solution is unique and


+ gradient decent will
converge to minimum

1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
Cost Function J(𝜃0 , 𝜃1 )
𝜃0 , 𝜃1
COMPUTE THE GRADIENT
1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1

𝑖 (𝑖)
ℎ𝜃 (𝑥 ) = 𝜃0 + 𝜃1 𝑥
𝑚
1
𝐽(𝜃0 , 𝜃1 ) = ෍(𝑦 𝑖 − (𝜃0 + 𝜃1 𝑥 (𝑖) ))2
2𝑚
𝑖=1

𝑚
𝜕J(𝜃0 , 𝜃1 ) 1 𝑖
= ෍(𝑦 − (𝜃0 + 𝜃1 𝑥 (𝑖) ))2
𝜕𝜃0 2𝑚
𝑖=1

𝑚
𝜕J(𝜃0 , 𝜃1 ) 1 𝑖
= ෍(𝑦 − (𝜃0 +𝜃1 𝑥 (𝑖) ))2
𝜕𝜃1 2𝑚
𝑖=1
𝑚
𝜕J(𝜃0 , 𝜃1 ) 1 𝑖
= ෍(𝑦 − ℎ𝜃 (𝑥 (𝑖) ))2
𝜕𝜃0 2𝑚
𝑖=1
𝑚
1
= ෍(𝑦 𝑖 −(𝜃0 + 𝜃1 𝑥 𝑖
))(−1)
𝑚
𝑖=1

𝑚
𝜕J(𝜃0 , 𝜃1 ) 1 𝑖
= ෍(𝑦 − (𝜃0 + 𝜃1 𝑥 (𝑖) ))2
𝜕𝜃1 2𝑚
𝑖=1
𝑚
1
= ෍(𝑦 𝑖
− (𝜃0 + 𝜃1 𝑥 (𝑖) )) . (−𝑥 (𝑖) )
𝑚
𝑖=1
COMPUTE THE GRADIENT
1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1

Putting it together
1 𝑚
− σ𝑖=1[𝑦 𝑖 −(𝜃0 +𝜃1 𝑥 𝑖 )]
𝛻J(𝜃0 , 𝜃1 ) = 𝑚

1
σ𝑚
2𝑚 𝑖=1
[𝑦 𝑖 −(𝜃 +𝜃 𝑥 (𝑖) )
0 1 ] . (𝑥 (𝑖) )
APPROACH 1 : SET GRADIENT = 0
1 𝑚
− σ𝑖=1[𝑦 𝑖 −(𝜃0 +𝜃1 𝑥 𝑖 )]
𝛻J(𝜃0 , 𝜃1 ) = 𝑚

1
σ𝑚
2𝑚 𝑖=1
[𝑦 𝑖 −(𝜃 +𝜃 𝑥 (𝑖) )
0 1 ] . (𝑥 (𝑖) )

Top Term

σ𝑚 𝑦 𝑖 𝜃1 σ𝑚 𝑥 𝑖
𝜃0 = 𝑖=1
− 𝑖=1
𝑚 𝑚
Bottom Term

1 𝑖 2
− σ𝑦 𝑖 𝑥 𝑖
− 𝜃0 σ 𝑥 𝑖
− 𝜃1 σ 𝑥 =0
2𝑚

σ 𝑦 𝑖 σ𝑥 𝑖
σ𝑦 𝑖 𝑥 𝑖 −
𝜃1 = 2 σ 𝑦
𝑚
𝑖 σ𝑥 𝑖
σ𝑥 𝑖 −
𝑚
Note

෍𝑦 𝑖 𝑥 𝑖 ෍𝑥 𝑖
෍𝑥 𝑖 2 ෍𝑦 𝑖
QUESTION 1

Find the least square regression line, for the following data.
Also estimate the value of y when x = 10

X Y
0 2
1 3
2 5
3 4
4 6
SOLUTION

ℎ𝜃 𝑥 = 2.2 + 0.9 𝑥

x = 10

ℎ𝜃 𝑥 = 2.2 + 0.9𝑥
= 11.2
APPROACH 2: GRADIENT DESCENT

Gradient descent is an optimization algorithm used to find the values of parameters


(coefficients) of a function (f) that minimizes a cost function (cost).
GRADIENT DESCENT

 Gradient descent algorithm

 Get estimated parameters


 Intercepts
 Slope
 Used to form predictions
1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1
Have some function
ℎ𝜃 (𝑥 𝑖 )= 𝜃0 + 𝜃1 𝑥 (𝑖)

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽(𝜃0 , 𝜃1 )
𝜃0 , 𝜃1

Outlines:
Start with some 𝜃0 , 𝜃1
Keep changing 𝜃0 , 𝜃1 to reduce J(𝜃0 , 𝜃1 ) until we hopefully
end up at a minimum.
𝑊ℎ𝑖𝑙𝑒 𝑛𝑜𝑡 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑
{
𝑓𝑜𝑟 𝑗 = 0 𝑡𝑜 1
𝜕𝐽(𝜃0 ,𝜃1 )
𝜃𝑗 = 𝜃𝑗 − 𝛼
𝜕𝜃𝑗
}
GRADIENT DESCENT ALGORITHM

𝜕𝐽(𝜃1 )
Slope of the <0
line is -ve 𝜕𝜃1

𝜃1
𝜃1 = 𝜃1 − 𝛼 −𝑣𝑒 𝑣𝑎𝑙𝑢𝑒
Increase the value of 𝜽𝟏 with some quantity
GRADIENT DESCENT ALGORITHM

𝜕𝐽(𝜃1 )
>0
𝐽(𝜃1 ) 𝜕𝜃1 Slope of the
line is +ve

𝜃1

𝜃1 = 𝜃1 − 𝛼 +𝑣𝑒 𝑣𝑎𝑙𝑢𝑒
Decrease the value of 𝜽𝟏 with some quantity
GRADIENT DESCENT ALGORITHM

𝜕𝐽(𝜃1 )
Slope of the =0
𝜕𝜃1
line is 0

𝜃1 = 𝜃1 − 𝛼 ∗ 0
No change
GRADIENT DESCENT ALGORITHM
𝑊ℎ𝑖𝑙𝑒 𝑛𝑜𝑡 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑
{
𝑚
1 𝑖 𝑖
𝜃0 = 𝜃0 + 𝛼 ෍ (𝑦 −(ℎ𝜃 (𝑥 ))
𝑚
𝑖=1

𝑚
1 𝑖 𝑖 𝑖
𝜃1 = 𝜃1 + 𝛼 ෍ (𝑦 − ℎ𝜃 (𝑥 )𝑥
𝑚
𝑖=1
}
LINEAR REGRESSION WITH GRADIENT DESCENT

 Linear Regression Model


𝒉𝜽 𝒙(𝒊) = 𝜽𝟎 + 𝜽𝟏 𝒙(𝒊)
𝟏
J(𝜽𝟎 , 𝜽𝟏 ) = σ𝒎 (𝒚 𝒊 − 𝒉𝜽 (𝒙(𝒊) ))𝟐
𝟐𝒎 𝒊=𝟏

Linear Regression
with
 Gradient Descent Algorithm Gradient descent

𝑾𝒉𝒊𝒍𝒆 𝒏𝒐𝒕 𝒄𝒐𝒏𝒗𝒆𝒓𝒈𝒆𝒅


{
𝒇𝒐𝒓 𝒋 = 𝟎 𝒕𝒐 𝟏
𝝏𝑱(𝜽𝒋 ,𝜽𝒋 )
𝜽𝒋 = 𝜽 𝒋 − 𝜶
𝝏𝜽𝒋
}
GRADIENT DESCENT ALGORITHM
 Types of Gradient Descent Algorithm

 Stochastic gradient descent


 SGD randomly picks one data point from the whole data set at each iteration.

 Batch gradient descent


 Every step of gradient descent uses all the training examples

 Mini-batch gradient descent


 A balance between the goodness of gradient descent and speed of SGD.
 sample a small number of data points instead of just one point at each step.
COEFFICIENT OF DETERMINATION (𝑟 2 )

Quantifies the goodness of a fit.


 𝑟2
 Is a measure of how close each data
point fits to the regression line.

 In other words, it represents the


fraction of variance in dependent
variable (response) that has been
explained by the regression model
 R-Squared is a way of measuring how much better than the mean line
you have done based on summed squared error.
Our objective is to do better than the mean. For instance this regression line will give A
lower sum squared error than using the horizontal line.
Ideally, you would have zero regression error, i.e. Your regression line would perfectly
match the data. In that case you would get an r-squared value of 1
𝐴𝑐𝑡𝑢𝑎𝑙

𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 𝑦𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
2
෍ 𝑦𝑖 − 𝑦𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = ෍ 𝑦𝑖 − 𝑦ത 2

𝑆𝑆𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 =
2
෍ 𝑦𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 − 𝑦ത
𝑦ത

Intercept
EXAMPLE
Regression Line
X Y SS_Total Y = 6x
SS_Regression
-5
0 0 169 -5 5 25
1 1 144 1 0 0
2 4 81 7 -3 9
3 9 16 13 -4 16
4 16 9 19 -3 9
5 25 144 25 0 0
6 36 529 31 5 25
Average 13
Total 1092 84

R-squared
0.923
Source: http://www.fairlynerdy.com/what-is-r-squared/

You might also like