Linear Regression
Linear Regression
Introduction 1
Simple regression 2
Making predictions 2
Cost function 2
Gradient descent 3
Training 4
Model evaluation 5
Summary 6
Multivariable regression 6
Growing complexity 6
Normalization 6
Making predictions 7
Initialize weights 8
Cost function 8
Gradient descent 8
Simplifying with matrices 9
Bias term 10
Model evaluation 10
Introduction
Linear Regression is a supervised machine learning algorithm where the predicted output is continuous
and has a constant slope. It's used to predict values within a continuous range, (e.g. sales, price) rather
than trying to classify them into categories (e.g. cat, dog). There are two main types:
Simple regression
Simple linear regression uses traditional slope-intercept form, where and are the variables our
algorithm will try to "learn" to produce the most accurate predictions. represents our input data and
represents our prediction.
y = mx + b
Multivariable regression
A more complex, multi-variable linear equation might look like this, where represents the coefficients, or
weights, our model will try to learn.
f(x, y, z) = w1x + w2y + w3z
The variables represent the attributes, or distinct pieces of information, we have about each
observation. For sales predictions, these attributes might include a company's advertising spend on radio,
TV, and newspapers.
Sales = w1Radio + w2TV + w3News
Simple regression
Let’s say we are given a dataset with the following columns (features): how much a company spends on
Radio advertising each year and its annual Sales in terms of units sold. We are trying to develop an
equation that will let us to predict units sold based on how much a company spends on radio advertising.
The rows (observations) represent companies.
Making predictions
Our prediction function outputs an estimate of sales given a company's radio advertising spend and our
current values for Weight and Bias.
Sales = Weight ⋅ Radio + Bias
Weight
the coefficient for the Radio independent variable. In machine learning we call coefficients weights.
Radio
the independent variable. In machine learning we call these variables features.
Bias
the intercept where our line intercepts the y-axis. In machine learning we can call intercepts bias. Bias
offsets all predictions that we make.
Our algorithm will try to learn the correct values for Weight and Bias. By the end of our training, our
equation will approximate the line of best fit.
Code
Cost function
The prediction function is nice, but for our purposes we don't really need it. What we need is a :doc:`cost
function <loss_functions>` so we can start optimizing our weights.
Let's use :ref:`mse` as our cost function. MSE measures the average squared difference between an
observation's actual and predicted values. The output is a single number representing the cost, or score,
associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our
model.
Math
Given our simple linear equation , we can calculate MSE as:
n
MSE = 1N ∑ (yi − (mxi + b))2
i=1
Note
Code
Gradient descent
To minimize MSE we use :doc:`gradient_descent` to calculate the gradient of our cost function. Gradient
descent consists of looking at the error that our weight currently gives us, using the derivative of the cost
function to find the gradient (The slope of the cost function using our current weight), and then changing
our weight to move in the direction opposite of the gradient. We need to move in the opposite direction of
the gradient since the gradient points up the slope instead of down it, so we move in the opposite direction
to try to decrease our error.
Math
There are two :ref:`parameters <glossary_parameters>` (coefficients) in our cost function we can control:
weight and bias . Since we need to consider the impact each one has on the final prediction, we use
partial derivatives. To find the partial derivatives, we use the :ref:`chain_rule`. We need the chain rule
because is really 2 nested functions: the inner function and the outer
function .
Returning to our cost function:
n
f(m, b) = 1N ∑ (yi − (mxi + b))2
i=1
Using the following:
(yi − (mxi + b))2 = A(B(m, b))
We can split the derivative into
A(x) = x 2
df
dx
= A 0(x) = 2x
and
B(m, b) = yi − (mxi + b) = yi − mxi − b
dx
dm
= B 0(m) = 0 − xi − 0 = − xi
dx
db
= B 0(b) = 0 − 0 − 1 = − 1
And then using the :ref:`chain_rule` which states:
df df dx
dm
= dx dm
df df dx
db
= dx db
We then plug in each of the parts to get the following derivatives
df
dm
= A 0(B(m, f))B 0(m) = 2(yi − (mxi + b)) ⋅ − xi
df
db
= A 0(B(m, f))B 0(b) = 2(yi − (mxi + b)) ⋅ −1
We can calculate the gradient of this cost function as:
\begin{align}■f'(m,b) =■ \begin{bmatrix}■ \frac{df}{dm}
Code
To solve for the gradient, we iterate through our data points using our new weight and bias values and
take the average of the partial derivatives. The resulting gradient tells us the slope of our cost function at
our current position (i.e. weight and bias) and the direction we should update to reduce our cost function
(we move in the direction opposite the gradient). The size of our update is controlled by the :ref:`learning
rate <glossary_learning_rate>`.
for i in range(companies):
# Calculate partial derivatives
# -2x(y - (mx + b))
weight_deriv += -2*radio[i] * (sales[i] - (weight*radio[i] + bias))
Training
Training a model is the process of iteratively improving your prediction equation by looping through the
dataset multiple times, each time updating the weight and bias values in the direction indicated by the
slope of the cost function (gradient). Training is complete when we reach an acceptable error threshold, or
when subsequent training iterations fail to reduce our cost.
Before training we need to initialize our weights (set default values), set our :ref:`hyperparameters
<glossary_hyperparameters>` (learning rate and number of iterations), and prepare to log our progress
over each iteration.
Code
for i in range(iters):
weight,bias = update_weights(radio, sales, weight, bias, learning_rate)
Model evaluation
If our model is working, we should see our cost decrease after every iteration.
Logging
Visualizing
Cost history
Summary
By learning the best values for weight (.46) and bias (.25), we now have an equation that predicts future
sales based on radio advertising investment.
Sales = . 46Radio + . 025
How would our model perform in the real world? I’ll let you think about it :)
Multivariable regression
Let’s say we are given data on TV, radio, and newspaper advertising spend for a list of companies, and
our goal is to predict sales in terms of units sold.
Growing complexity
As the number of features grows, the complexity of our model increases and it becomes increasingly
difficult to visualize, or even comprehend, our data.
One solution is to break the data apart and compare 1-2 features at a time. In this example we explore
how Radio and TV investment impacts Sales.
Normalization
As the number of features grows, calculating gradient takes longer to compute. We can speed this up by
"normalizing" our input data to ensure all values are within the same range. This is especially important for
datasets with high standard deviations or differences in the ranges of the attributes. Our goal now will be to
normalize our features so they are all in the range -1 to 1.
Code
def normalize(features):
**
features - (200, 3)
features.T - (3, 200)
#Vector Subtraction
feature -= fmean
#Vector Division
feature /= frange
return features
Note
Matrix math. Before we continue, it's important to understand basic :doc:`linear_algebra` concepts
as well as numpy functions like numpy.dot().
Making predictions
Our predict function outputs an estimate of sales given our current weights (coefficients) and a company's
TV, radio, and newspaper spend. Our model will try to identify weight values that most reduce our cost
function.
Sales = W1TV + W2Radio + W3Newspaper
Cost function
Now we need a cost function to audit how our model is performing. The math is the same, except we swap
the expression for . We also divide the expression by 2 to make derivative
calculations simpler.
n
1 ∑
MSE = 2N (yi − (W1x1 + W2x2 + W3x3))2
i=1
Gradient descent
Again using the :ref:`chain_rule` we can compute the gradient--a vector of partial derivatives describing
the slope of the cost function for each weight.
\begin{align}■f'(W_1) = -x_1(y - (W_1 x_1 + W_2 x_2 + W
return weights
X = [
[x1, x2, x3]
[x1, x2, x3]
.
.
.
[x1, x2, x3]
]
targets = [
[1],
[2],
[3]
]
#1 - Get Predictions
predictions = predict(X, weights)
#2 - Calculate error/loss
error = targets - predictions
return weights
Bias term
Our train function is the same as for simple linear regression, however we're going to make one final tweak
before running: add a :ref:`bias term <glossary_bias_term>` to our feature matrix.
In our example, it's very unlikely that sales would be zero if companies stopped advertising. Possible
reasons for this might include past advertising, existing customer relationships, retail locations, and
salespeople. A bias term will help us capture this base case.
Code
Below we add a constant 1 to our features matrix. By setting this value to 1, it turns our bias term into a
constant.
bias = np.ones(shape=(len(features),1))
features = np.append(bias, features, axis=1)
Model evaluation
After training our model through 1000 iterations with a learning rate of .0005, we finally arrive at a set of
weights we can use to make predictions:
Sales = 4.7TV + 3.5Radio + . 81Newspaper + 13.9
Our MSE cost dropped from 110.86 to 6.25.
References
1 https://en.wikipedia.org/wiki/Linear_regression
2 http://www.holehouse.org/mlclass/04_Linear_Regression_with_multiple_variables.html
3 http://machinelearningmastery.com/simple-linear-regression-tutorial-for-machine-learning
4 http://people.duke.edu/~rnau/regintro.htm
5 https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression
6 https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms