0% found this document useful (0 votes)

22 views

SLRin R

The document discusses simple linear regression in R. It describes building a linear regression model to predict sales based on advertising budget for YouTube. Visualizations and statistical tests show the relationship between sales and YouTube budget is linear and the regression model fits the data well.

Uploaded by

Jyo Brahmara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

SLRin R

Uploaded by

Jyo Brahmara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Simple Linear Regression in R

• The simple linear regression is used to predict a quantitative

outcome y on the basis of one single predictor variable x. The goal is
to build a mathematical model (or formula) that defines y as a
function of the x variable.
• Once, we built a statistically significant model, it’s possible to use it
for predicting future outcome on the basis of new x values.

2
Formula and basics
The mathematical formula of the linear regression can be written as y = b0 + b1*x + e, where:
• b0 and b1 are known as the regression beta coefficients or parameters:
• b0 is the intercept of the regression line; that is the predicted value when x = 0.
• b1 is the slope of the regression line.
• e is the error term (also known as the residual errors), the part of y that can be explained by the regression
model
The figure illustrates the linear regression model, where:
• the best-fit regression line is in blue
• the intercept (b0) and the slope (b1) are shown in green
• the error terms (e) are represented by vertical red lines

• From the scatter plot above, it can be seen that not all the data points fall exactly on the fitted regression line.
Some of the points are above the blue curve and some are below it; overall, the residual errors (e) have
approximately mean zero.
• The sum of the squares of the residual errors are called the Residual Sum of Squares or RSS.
• The average variation of points around the fitted regression line is called the Residual Standard
Error (RSE). This is one the metrics used to evaluate the overall quality of the fitted regression model. The
lower the RSE, the better it is.

3
Contd...
• Since the mean error term is zero, the outcome variable y can be approximately
estimated as follow:
• y ~ b0 + b1*x
• Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is
as minimal as possible. This method of determining the beta coefficients is
technically called least squares regression or ordinary least squares (OLS)
regression.
• Once, the beta coefficients are calculated, a t-test is performed to check whether or
not these coefficients are significantly different from zero. A non-zero beta
coefficients means that there is a significant relationship between the predictors (x)
and the outcome variable (y).

4
Loading required R packages
Load required packages:
• tidyverse for data manipulation and visualization
• ggpubr: creates easily a publication ready-plot

5
Examples of data and problem
• We’ll use the marketing data set [datarium package]. It contains the impact
of three advertising medias (youtube, facebook and newspaper) on sales. Data
are the advertising budget in thousands of dollars along with the sales. The
advertising experiment has been repeated 200 times with different budgets
and the observed sales have been recorded.
• First install the datarium package
using devtools::install_github("kassmbara/datarium"), then load
and inspect the marketing data as follow:

6
Inspect the data:

We want to predict future sales on the basis of advertising budget spent on youtube.

7
Visualization
• Create a scatter plot displaying the sales units versus youtube advertising
budget.
• Add a smoothed line

The graph above suggests a linearly increasing relationship between the sales and the youtube variables.
This is a good thing, because, one important assumption of the linear regression is that the relationship
between the outcome and predictor variables is linear and additive.

8
Correlation Coefficient
• It’s also possible to compute the correlation coefficient between the two variables
using the R function cor():
cor(marketing$sales, marketing$youtube)## [1] 0.782
• The correlation coefficient measures the level of the association between two
variables x and y. Its value ranges between -1 (perfect negative correlation: when x
increases, y decreases) and +1 (perfect positive correlation: when x increases, y
increases).
• A value closer to 0 suggests a weak relationship between the variables. A low
correlation (-0.2 < x < 0.2) probably suggests that much of variation of the outcome
variable (y) is not explained by the predictor (x). In such case, we should probably
look for better predictor variables.
• In our example, the correlation coefficient is large enough, so we can continue by
building a linear model of y as a function of x.
9
Computation
• The simple linear regression tries to find the best line to predict sales on the basis of
youtube advertising budget.
• The linear model equation can be written as follow: sales = b0 + b1 * youtube
• The R function lm() can be used to determine the beta coefficients of the linear model:

• The results show the intercept and the beta coefficient for the youtube variable.

10
Interpretation
From the output above:
• the estimated regression line equation can be written as follow: sales = 8.44 +
0.048*youtube
• the intercept (b0) is 8.44. It can be interpreted as the predicted sales unit for a zero
youtube advertising budget. Recall that, we are operating in units of thousand dollars.
This means that, for a youtube advertising budget equal zero, we can expect a sale of
8.44 *1000 = 8440 dollars.
• the regression beta coefficient for the variable youtube (b1), also known as the slope, is
0.048. This means that, for a youtube advertising budget equal to 1000 dollars, we can
expect an increase of 48 units (0.048*1000) in sales. That is, sales = 8.44 +
0.048*1000 = 56.44 units. As we are operating in units of thousand dollars, this
represents a sale of 56440 dollars.

11
Regression line
• To add the regression line onto the scatter plot, you can use the
function stat_smooth() [ggplot2]. By default, the fitted line is presented with
confidence interval around it. The confidence bands reflect the uncertainty about the line. If
you don’t want to display it, specify the option se = FALSE in the
function stat_smooth().

12
Model assessment
• We built a linear model of sales as a function of youtube advertising
budget: sales = 8.44 + 0.048*youtube.
• Before using this formula to predict future sales, you should make sure that
this model is statistically significant, that is:
• there is a statistically significant relationship between the predictor and the outcome
variables
• the model that we built fits very well the data in our hand.

13
Model summary

• summary(model)
• The summary outputs shows 6 components, including:
• Call. Shows the function call used to compute the regression model.
• Residuals. Provide a quick view of the distribution of the residuals, which by definition
have a mean zero. Therefore, the median should not be far from zero, and the minimum
and maximum should be roughly equal in absolute value.
• Coefficients. Shows the regression beta coefficients and their statistical significance.
Predictor variables, that are significantly associated to the outcome variable, are marked
by stars.
• Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that
are used to check how well the model fits to our data.
14
Coefficients significance
• The coefficients table, in the model statistical summary, shows:
• the estimates of the beta coefficients
• the standard errors (SE), which defines the accuracy of beta coefficients. For a given
beta coefficient, the SE reflects how the coefficient varies under repeated sampling. It
can be used to compute the confidence intervals and the t-statistic.
• the t-statistic and the associated p-value, which defines the statistical significance of
the beta coefficients.

15
t-statistic and p-values:
• For a given predictor, the t-statistic (and its associated p-value) tests whether or not there is a statistically significant
relationship between a given predictor and the outcome variable, that is whether or not the beta coefficient of the predictor is
significantly different from zero.
• The statistical hypotheses are as follow:
• Null hypothesis (H0): the coefficients are equal to zero (i.e., no relationship between x and y)
• Alternative Hypothesis (Ha): the coefficients are not equal to zero (i.e., there is some relationship between x and y)
• Mathematically, for a given beta coefficient (b), the t-test is computed as t = (b - 0)/SE(b), where SE(b) is the standard error of
the coefficient b. The t-statistic measures the number of standard deviations that b is away from 0. Thus a large t-statistic will
produce a small p-value.
• The higher the t-statistic (and the lower the p-value), the more significant the predictor. The symbols to the right visually
specifies the level of significance. The line below the table shows the definition of these symbols; one star means 0.01 < p <
0.05. The more the stars beside the variable’s p-value, the more significant the variable.
• A statistically significant coefficient indicates that there is an association between the predictor (x) and the outcome (y)
variable.
• In our example, both the p-values for the intercept and the predictor variable are highly significant, so we can reject the null
hypothesis and accept the alternative hypothesis, which means that there is a significant association between the predictor a nd
the outcome variables.

16
Standard errors and confidence intervals:
• The standard error measures the variability/accuracy of the beta coefficients. It can be
used to compute the confidence intervals of the coefficients.
• For example, the 95% confidence interval for the coefficient b1 is defined as b1 +/-
2*SE(b1), where:
• the lower limits of b1 = b1 - 2*SE(b1) = 0.047 - 2*0.00269 = 0.042
• the upper limits of b1 = b1 + 2*SE(b1) = 0.047 + 2*0.00269 = 0.052
• That is, there is approximately a 95% chance that the interval [0.042, 0.052] will contain
the true value of b1. Similarly the 95% confidence interval for b0 can be computed as b0
+/- 2*SE(b0).

17
Model accuracy
• Once you identified that, at least, one predictor variable is significantly
associated the outcome, you should continue the diagnostic by checking how
well the model fits the data. This process is also referred to as the goodness-of-
fit
• The overall quality of the linear regression fit can be assessed using the
following three quantities, displayed in the model summary:
• The Residual Standard Error (RSE).
• The R-squared (R2)
• F-statistic

18
Residual standard error (RSE).
• The RSE (also known as the model sigma) is the residual variation, representing the average
variation of the observations points around the fitted regression line. This is the standard deviation
of residual errors.
• RSE provides an absolute measure of patterns in the data that can’t be explained by the model.
When comparing two models, the model with the small RSE is a good indication that this model fits
the best the data.
• Dividing the RSE by the average value of the outcome variable will give you the prediction error rate,
which should be as small as possible.
• In our example, RSE = 3.91, meaning that the observed sales values deviate from the true regression
line by approximately 3.9 units in average.
• Whether or not an RSE of 3.9 units is an acceptable prediction error is subjective and depends on the
problem context. However, we can calculate the percentage error. In our data set, the mean value of
sales is 16.827, and so the percentage error is 3.9/16.827 = 23%.

19
R-squared and Adjusted R-squared:
• The R-squared (R2) ranges from 0 to 1 and represents the proportion of information
(i.e. variation) in the data that can be explained by the model. The adjusted R-
squared adjusts for the degrees of freedom.
• The R2 measures, how well the model fits the data. For a simple linear regression,
R2 is the square of the Pearson correlation coefficient.
• A high value of R2 is a good indication. However, as the value of R2 tends to increase
when more predictors are added in the model, such as in multiple linear regression
model, you should mainly consider the adjusted R-squared, which is a penalized R2
for a higher number of predictors.
• An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the
outcome has been explained by the regression model.
• A number near 0 indicates that the regression model did not explain much of the variability in the
outcome.

20
F-Statistic:
• The F-statistic gives the overall significance of the model. It assess whether at least
one predictor variable has a non-zero coefficient.
• In a simple linear regression, this test is not really interesting since it just duplicates
the information in given by the t-test, available in the coefficient table. In fact, the F
test is identical to the square of the t test: 312.1 = (17.67)^2. This is true in any model
with 1 degree of freedom.
• The F-statistic becomes more important once we start using multiple predictors as in
multiple linear regression.
• A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In
our example, the F-statistic equal 312.14 producing a p-value of 1.46e-42, which is
highly significant.

21
Summary
• After computing a regression model, a first step is to check whether, at least,
one predictor is significantly associated with outcome variables.
• If one or more predictors are significant, the second step is to assess how well
the model fits the data by inspecting the Residuals Standard Error (RSE), the
R2 value and the F-statistics. These metrics give the overall quality of the
model.
• RSE: Closer to zero the better
• R-Squared: Higher the better
• F-statistic: Higher the better

22
Any
Queries?
Thank you

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Practical Earned Value Analysis: 25 Project Indicators from 5 Measurements
From Everand
Practical Earned Value Analysis: 25 Project Indicators from 5 Measurements
Akram Najjar
No ratings yet
Regression Analysis Using R
No ratings yet
Regression Analysis Using R
17 pages
AA3 - Linear Regression - 2024
No ratings yet
AA3 - Linear Regression - 2024
26 pages
vapitulo 3 big data
No ratings yet
vapitulo 3 big data
65 pages
Linear Regression
No ratings yet
Linear Regression
64 pages
Regression I
No ratings yet
Regression I
41 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Simple Linear Regression Using a Real Dataset in R and Excel
No ratings yet
Simple Linear Regression Using a Real Dataset in R and Excel
4 pages
ML 3 1
No ratings yet
ML 3 1
60 pages
Lecture 12 Regression
No ratings yet
Lecture 12 Regression
55 pages
Predicting Pregnancies of Our Customers I - Regression Model
No ratings yet
Predicting Pregnancies of Our Customers I - Regression Model
50 pages
Statistics For Business STAT130: Unit 8: Correlation and Regression Analysis
No ratings yet
Statistics For Business STAT130: Unit 8: Correlation and Regression Analysis
56 pages
Regression
No ratings yet
Regression
21 pages
1.1 Regression Analysis
No ratings yet
1.1 Regression Analysis
33 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
29 pages
MLR- R and R2
No ratings yet
MLR- R and R2
17 pages
Topic Simple Linear Regression
No ratings yet
Topic Simple Linear Regression
38 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
28 pages
Na9vr1 SZWvb69fvimVUw BF C2 W2 Multiple Regression Models
No ratings yet
Na9vr1 SZWvb69fvimVUw BF C2 W2 Multiple Regression Models
25 pages
IS4242 W3 Regression Analyses
No ratings yet
IS4242 W3 Regression Analyses
67 pages
Regression Analysis
No ratings yet
Regression Analysis
20 pages
Lecture1 STAT4355
No ratings yet
Lecture1 STAT4355
59 pages
Estimation of Causal Relationships I: Illustration 1
No ratings yet
Estimation of Causal Relationships I: Illustration 1
8 pages
Marketing Engineering Notes
No ratings yet
Marketing Engineering Notes
46 pages
Session 5 Marked B PDF
No ratings yet
Session 5 Marked B PDF
36 pages
Simple Regression Analysis
No ratings yet
Simple Regression Analysis
60 pages
2024-Lecture 11
No ratings yet
2024-Lecture 11
37 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
Intro to reg models
No ratings yet
Intro to reg models
27 pages
MKTG 4110 Class 6
No ratings yet
MKTG 4110 Class 6
10 pages
Lecture 4
No ratings yet
Lecture 4
62 pages
Chapter14
No ratings yet
Chapter14
65 pages
Estimating Demand: Learn How To Interpret The Results of Regression Analysis Based On Demand Data
No ratings yet
Estimating Demand: Learn How To Interpret The Results of Regression Analysis Based On Demand Data
18 pages
Chapter 4: Economic Analysis
No ratings yet
Chapter 4: Economic Analysis
18 pages
To Find Correlation and Regression of The Following Data
No ratings yet
To Find Correlation and Regression of The Following Data
5 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Module 3 - Data Analysis_S RM
No ratings yet
Module 3 - Data Analysis_S RM
63 pages
Regression Models - Follow
No ratings yet
Regression Models - Follow
7 pages
1683609733_Deck2-BusinessIntelligence-M1-ACSA
No ratings yet
1683609733_Deck2-BusinessIntelligence-M1-ACSA
15 pages
Using R For Linear Regression
No ratings yet
Using R For Linear Regression
9 pages
Regression
No ratings yet
Regression
3 pages
STATS135 REVIEWER
No ratings yet
STATS135 REVIEWER
5 pages
Learn Linear Regression with R_ Linear Regression In R Cheatsheet _ Codecademy
No ratings yet
Learn Linear Regression with R_ Linear Regression In R Cheatsheet _ Codecademy
5 pages
Simple Linear Regression sample
No ratings yet
Simple Linear Regression sample
55 pages
Simple Regression
No ratings yet
Simple Regression
35 pages
Summary: Correlation and Regression
No ratings yet
Summary: Correlation and Regression
6 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
Introduction To Simple Linear Regression
No ratings yet
Introduction To Simple Linear Regression
34 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
TCMG - MEEG 573 - SP - 20 - Lecture - 7
No ratings yet
TCMG - MEEG 573 - SP - 20 - Lecture - 7
69 pages
BUSINESS STATISTICS: Simple Linear Regression and Correlation
No ratings yet
BUSINESS STATISTICS: Simple Linear Regression and Correlation
55 pages
Chapter 5
No ratings yet
Chapter 5
73 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Notes Part 2
No ratings yet
Notes Part 2
101 pages
10 - Regression 1
No ratings yet
10 - Regression 1
58 pages
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
University of Wah Complex Engineering Problem
No ratings yet
University of Wah Complex Engineering Problem
2 pages
Module 4 - Deterministic Inventory Models
No ratings yet
Module 4 - Deterministic Inventory Models
6 pages
Numerical Methods c3 Coursework
100% (2)
Numerical Methods c3 Coursework
7 pages
AOA Course Contents
No ratings yet
AOA Course Contents
5 pages
Remeshing A Deformed Mesh
No ratings yet
Remeshing A Deformed Mesh
3 pages
2024_IJANS-203900 (1)
No ratings yet
2024_IJANS-203900 (1)
10 pages
Prototype Design and Analysis of Controllers For One Dimensional Ball and Beam System
No ratings yet
Prototype Design and Analysis of Controllers For One Dimensional Ball and Beam System
6 pages
Examination 7
No ratings yet
Examination 7
6 pages
11
No ratings yet
11
9 pages
Newton Raphson Method
No ratings yet
Newton Raphson Method
26 pages
Simplex Duality
100% (2)
Simplex Duality
43 pages
02 Activity 3 Quantitative Methods
No ratings yet
02 Activity 3 Quantitative Methods
2 pages
Module_3
No ratings yet
Module_3
76 pages
22224
No ratings yet
22224
3 pages
Genetic Algorithm and Confusion Matrix For Document Clustering
No ratings yet
Genetic Algorithm and Confusion Matrix For Document Clustering
7 pages
Artificial Intelligence Notes PDF
100% (3)
Artificial Intelligence Notes PDF
2 pages
Commande Matlab LMI
No ratings yet
Commande Matlab LMI
20 pages
Tutorial Sheet 1
No ratings yet
Tutorial Sheet 1
1 page
WK 13 Power Series Solution
No ratings yet
WK 13 Power Series Solution
40 pages
Week 5 and 6 - Greedy Strategy Technique
No ratings yet
Week 5 and 6 - Greedy Strategy Technique
58 pages
h11561 Data Domain Encryption WP
No ratings yet
h11561 Data Domain Encryption WP
15 pages
Image Fusion Using Various Transforms: IPASJ International Journal of Computer Science (IIJCS)
No ratings yet
Image Fusion Using Various Transforms: IPASJ International Journal of Computer Science (IIJCS)
8 pages
Mutation
No ratings yet
Mutation
7 pages
SVM 1
No ratings yet
SVM 1
6 pages
Finite Mixture of Skewed Distributions: Víctor Hugo Lachos Dávila Celso Rômulo Barbosa Cabral Camila Borelli Zeller
No ratings yet
Finite Mixture of Skewed Distributions: Víctor Hugo Lachos Dávila Celso Rômulo Barbosa Cabral Camila Borelli Zeller
108 pages
Recommender System
No ratings yet
Recommender System
26 pages
SHA1 Using JAVA and Its Explanation
No ratings yet
SHA1 Using JAVA and Its Explanation
3 pages
An Image-Based System For Pavement Crack Evaluation Using Transfer Learning and Wavelet Transform
No ratings yet
An Image-Based System For Pavement Crack Evaluation Using Transfer Learning and Wavelet Transform
13 pages
Teaching Plan 1718 PDF
No ratings yet
Teaching Plan 1718 PDF
6 pages
KBS ملخص
No ratings yet
KBS ملخص
14 pages

SLRin R

Uploaded by

SLRin R

Uploaded by

Simple Linear Regression in R

• The simple linear regression is used to predict a quantitative

You might also like