0% found this document useful (0 votes)
16 views35 pages

Engineering - Simple Correlation and Regression - 2024

Uploaded by

Malack Chagwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views35 pages

Engineering - Simple Correlation and Regression - 2024

Uploaded by

Malack Chagwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

PROBABILITY AND STATISTCS

SIMPLE LINEAR REGRESSION


INTRODUCTION
• Correlation and regression help us examine the
relationships between quantitative variables.
• Correlation analysis enables us to know
1. If there is a linear relationship between variables
2. The strength of the linear relationship between variables
3. The direction of the linear relationship between
variables
• Regression analysis enables us to
1. Estimate one variable on the basis of another
Linear relationships between variables
• There are simple linear relationships that deal with the
relationships between two variables.
• There are multiple linear relationships that deal with
relationships to do with more than two variables.
• This class will look at correlation and regression analysis
between two variables
Simple linear relationships
• Simple linear relationships consist of an independent
variable and a dependent variable
• An independent variable AKA an explanatory variable can
be manipulated and/or controlled.
• A dependent variable AKA a response variable is one that
is expected to change due to the manipulation of the
independent variable
Simple linear relationships (Cont’d)
• A simple linear relationship can be positive or
negative
• A positive relationship occurs when both variables
increase or decrease at the same time
• A negative relationship occurs when one variables
increases while the other variable decreases; and
vice versa.
Scatter Plots
• Simple linear relationships can be presented diagrammatically
using a scatter plot
• A scatter plot is a graph of the ordered pairs (𝑥, 𝑦) of numbers
consisting of the independent variable 𝑥 and the dependent
variable 𝑦.
• The scales of the variables can be different, and the coordinates
of the axes are determined by the smallest and largest data
values of the variables.
• Below is data obtained in a study on the number of absences from
team meetings and quarterly appraisals (in %) of seven randomly
selected employees from a particular consultancy firm. Use it to
construct a scatter plot and comment on the relationship
•Employ Number of absences 𝒙 Appraisal score (in %) 𝒚
ees
1 6 82
2 2 86
3 15 43
4 9 74
5 12 58
6 5 90
7 8 78
Correlation
• Correlation is generally the nature of a linear relationship
between two variables.
• The existence, strength and direction of a linear
relationship between variables is quantified using a
correlation coefficient.
• There are various measures of the correlation coefficient
but the one we’ll use is the Pearson product moment
correlation coefficient (which we shall refer to as
Pearson’s coefficient) which we will denote as 𝑟
Pearson’s coefficient (Cont’d)
• Below is summary diagram of what different values of 𝑟 mean in terms of
direction and strength of the linear relationship between the two variables
Pearson’s coefficient (Cont’d)
• Below are diagrammatic representations of the relationship
between 𝑟 and scatter plots
Pearson’s coefficient (Cont’d)
•𝑟 is not associated with units. Its value will
remain the same if units are changed or if the
𝑥 and 𝑦 values are switched
Regression
• Regression analysis is a technique used to come up with a
an equation that expresses the linear relationship
between variables by allowing us to estimate/predict the
value of a dependent variable using input from an
independent variable.
• The regression equation is an equation that enables us to
express the linear relationship between quantitative
variables
• It is given by 𝑦ො = 𝑎 + 𝑏𝑥 where 𝑦ො is the estimate of 𝑦, 𝑎 is
the 𝑦-intercept,
ො and 𝑏 is the slope
Derivation of the regression equation
• This is done using the least squares method/principle.
• It comes up with a the line of best fit where the equation will provide 𝑎
and 𝑏 such that we will have the minimum value of the sum of squared
vertical distances between the estimated/predicted y-values and observed
y-values
• Using the work absences and appraisal data
a) Derive the least squares regression equation that
relates the two variables
b) Interpret the slope in (a)
c) Use the equation in (a) to predict the quarterly
appraisal score for an employee with 10
absences.
NOTE

•Predictions can only be made using values


within the ranges of the x-values used to
derive the regression equation
•The slope of the regression equation will
always have the same sign as the correlation
coefficient
SIMPLE LINEAR REGRESSION USING
SPSS
Assumptions
1. Linearity of variables
• This means there must be a linear relationship between
the dependent and independent variables.
✓This can be checked visually using a scatterplot and
numerically using Pearson's coefficient.
2. The dataset must not contain extreme outliers
✓Check this using the standardized residuals row. If the
minimum std residual is less than -3 we have extreme
outliers. If the maximum std residual is more than 3 we
have extreme outliers.
Assumptions (Cont’d)
3. Independence of the observations
✓This is generally assumed for a random sample but can
also be tested using Durbin-Watson. Values of the Durbin-
Watson between 1.5 and 2.5 are generally acceptable.
4. Homoscedasticity
• This means that residuals must have constant variance at
every level of the independent variable
✓This is checked using a residual plot. The goal is that the
residuals must not show any kind of noticeable pattern.
Assumptions (Cont’d)
5. The residual must be normally or approximately
normally distributed
✓This is checked using the Normal P-P plot. We generally
want the residual points to be on the diagonal line or as
close to it as possible
Dataset
• You have been given a dataset that has employee salary
and years of experience data. Use it to answer the
questions that follow.
a) State and perform diagnostics to check the
assumptions of simple linear regression.
b) At 𝛼 = 0.05, test the hypothesis that the slope is zero
c) State and interpret the model
SOLUTIONS (Cont’d)
a) Answer
• Assumption 1
SOLUTIONS (Cont’d)
• Assumption 1 (Cont’d)

✓The scatterplot depicts a positive linear relationship


✓The correlation coefficient (R=0.978) indicates a strong positive
linear relationship since its value is between 0.5 and 1
SOLUTIONS (Cont’d)
• Assumption 2

✓The minimum value of std. residual (-1.375) is more than -3, meaning
there are no extreme outliers on the lower end.
✓The maximum value of std residual (1.978) is less than 3, meaning there
are no extreme outliers on the upper end.
SOLUTIONS (Cont’d)
• Assumption 3

✓The Durbin-Watson value (1.648) is between 1.5 and 2.5


therefore independence of observations is assumed.
SOLUTIONS (Cont’d)
• Assumption 4
SOLUTIONS (Cont’d)
• Assumption 4 (Cont’d)
✓The residual plot shows a somewhat constant spread of
residuals above zero and below zero. This is observed in
the spread of the residual points not showing any
noticeable pattern.
SOLUTIONS (Cont’d)
• Assumption 5
SOLUTIONS (Cont’d)
• Assumption 5 (Cont’d)
✓We can see from the Normal P-P Plot that the points rest
on the diagonal line or have a minimal deviation from the
diagonal line. This shows normality of the residuals.
SOLUTIONS (Cont’d)
b) Step 1
✓𝐻0 : 𝛽 = 0
✓𝐻1 : 𝛽 ≠ 0
• Step 2
✓𝛼 = 0.05 𝐺𝑖𝑣𝑒𝑛
• Step 3
SOLUTIONS (Cont’d)
• Step 3 (Cont’d)
✓𝑡 = 24.950
• Step 4
✓𝑝 − 𝑣𝑎𝑙𝑢𝑒 < 0.001
• Step 5
✓Since the p-value is less than 0.05, we reject the null hypothesis. We
conclude that the true slope is significantly different from zero.
✓Backing the p-value is the 95% CI[8674.119,10225.806] that does not
contain a zero.
SOLUTIONS (Cont’d)
c) Answer
The model:
𝑦ො = 24848.204 + 9449.962𝑥
The interpretation:
For every unit increase in years of experience, we expect a 𝑀𝑊𝐾944,996.2
increase in salary.
HELPFUL RESOURCE
• https://ezspss.com/simple-linear-regression-in-spss-including-
interpretation/
Thank You.

You might also like