0% found this document useful (0 votes)
4 views

Presentation4 - Bivariate Analysis and Simple Linear Regression

for data analysis

Uploaded by

cfchalimba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Presentation4 - Bivariate Analysis and Simple Linear Regression

for data analysis

Uploaded by

cfchalimba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Analytics and Quantitative

Techniques
Introduction to Bivariate Analysis and Simple
Linear Regression
Introduction to Bivariate Analysis and
Simple Linear Regression
• Welcome to the lecture on Bivariate Analysis and Simple Linear
Regression.
• Today, we will cover key concepts including:
• Correlation coefficient
• Simple linear regression model
• Interpretation of regression coefficients
• Goodness of fit (R-squared).
• We will use a real-world case study to understand the relationship
between advertising spend and sales revenue.
Introduction to Bivariate Analysis and
Simple Linear Regression…
• Bivariate analysis involves the analysis of two variables to
determine the empirical relationship between them.
• Simple linear regression is a statistical method used to
understand and quantify the relationship between two continuous
variables.
Correlation Coefficient
• The correlation coefficient (r) measures the strength and direction of the
linear relationship between two variables.
• It ranges from -1 to 1:
• r = 1: Perfect positive linear relationship
• r = -1: Perfect negative linear relationship
• r = 0: No linear relationship
• Formula:

• Round the value of r to three decimal places


Scatter plot
• A scatter plot is a type of data visualization that shows the
relationship between two quantitative variables.
• Each point on the scatter plot represents an observation in the
dataset

Subject Age (x) Blood-Pressure (y)


A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152
Scatter plot …
• Scatter plots can reveal different types of
patterns (linear, non-linear).
• A linear pattern indicates a straight-line
relationship between variables.
• Non-linear patterns include curves or clusters.
• Direction of the relationship indicates whether
the variables increase together (positive
correlation) or if one increases while the other
decreases (negative correlation).
• No clear direction suggests no correlation
Scatter plot…
• A strong correlation means data points are closely packed around
the line of best fit.
• A weak correlation means data points are more scattered.
• Outliers are data points that are significantly different from other
observations.
• Outliers can distort the overall impression of the data's pattern
and strength.
• It is important to investigate outliers to understand their cause.
• Steps to interpret scatter plots: look for patterns, determine
direction, assess strength, and identify outliers.
Correlation Coefficient
• Example: Advertising Spend (x)
Sales Revenue
(y)
• Given dataset (Advertising Spend, Sales 100 200
Revenue): 150 300

• Calculation Steps: 200 400


250 450
1.Calculate the sum of x, y, x^2, y^2, and xy. 300 500
2.Substitute into the formula to find r.
Sales Revenue
Advertising Spend (x)
Correlation Coefficient…
(y)
100 200
150 300
200 400
250 450
300 500
Correlation Coefficient…

The correlation coefficient is 0.996, indicating a very strong positive linear


relationship between advertising spend and sales revenue
Example 2
Consider a dataset of a company’s advertising spend ($) and corresponding sales ($).
Calculate the correlation coefficient and interpret the results.

Advertising Spend (x) Sales (y)


10 25
20 45
30 65
40 70
50 85
Example 2…
Correlation Limitations
• Correlation does not necessarily imply causation all the time.
• Correlation measures the strength of a relationship between two
variables but does not indicate that one variable causes the other.
• For example, ice cream sales and drowning incidents may have a high
correlation during summer months.
• However, this does not mean ice cream sales cause drowning
incidents.
• Both variables are likely influenced by a third variable: hot weather.
• Another example, children's shoe size and reading ability may be
correlated.
• However, larger shoe sizes do not cause better reading ability.
• Both variables are influenced by age.
Correlation Limitations…

• Outliers are data points that are significantly different from others in the
dataset.
• Outliers can distort the correlation coefficient, making it appear stronger or
weaker than it actually is.
• Use additional methods such as experiments or longitudinal studies.
• Perform robust statistical analyses, consider data transformations, or
conduct analyses with or without outliers.
Correlation and causation
• Understand the nature of linear relationship between independent variable x and dependent
variable y.
• When a hypothesis test indicates a significant linear relationship exists between the variables, then
the researcher must consider the following possibilities (When the null hypothesis has been
rejected for a specific significance level (α) value, then any of the following 5 possibilities can
exist.)
1. There is direct cause-and-effect relationship between the variables e.g. water causes plants
to grow
2. There is a reverse cause-and-effect relationship between the variables e.g. extreme coffee
consumption causes nervousness but the truth is extreme nervousness person craves for
coffee to calm the nerves.
3. The relationship between the variables may be caused by third variable e.g. number of
drowning deaths correlates with number of soft drinks consumed in the summer, but the
high temperature.
4. There maybe a complexity of interrelationships among many variables e.g. student
secondary school grades and college grades, but there could be hours of study, motivation,
IQ etc.
5. The relationship may be coincidental
Linear regression
• In studying relationships between two variables, collect data then construct
scatter plot to determine nature of relationship:
1. Positive linear relationship
2. Negative linear relationship
3. Curvilinear relationship
4. No discernible relationship
• Next compute to calculate the correlation to determine if it is significant or
not.
• If the correlation is significant then you can proceed to calculate regression
i.e. that’s the data’s line of best fit.
• The purpose of regression line is to enable the researcher to see the trend
and make predictions on the basis of the data.
• It is of no use to do regression analysis if the correlation is not significant
because you cannot make prediction based on such data.
Simple Linear Regression Model
• Simple linear regression aims to model the relationship between a
dependent variable (y) and an independent variable (x) using a
linear equation.
• Model Equation:
Regression model
• Relation between variables where changes in some variables
may “explain” or possibly “cause” changes in other variables.
• Explanatory variables are termed the independent variables
and the variables to be explained are termed the dependent
variables.
• Regression model estimates the nature of the relationship
between the independent and dependent variables.
• Change in dependent variables that results from changes in
independent variables, ie. size of the relationship.
• Strength of the relationship.
• Statistical significance of the relationship.
Examples
• Dependent variable is retail price of gasoline in Regina – independent
variable is the price of crude oil.
• Dependent variable is employment income – independent variables might
be hours of work, education, occupation, sex, age, region, years of
experience, unionization status, etc.

• Price of a product and quantity produced or sold:


• Quantity sold affected by price. Dependent variable is quantity of
product sold – independent variable is price.
• Price affected by quantity offered for sale. Dependent variable is price –
independent variable is quantity sold.
600 160

140

500

120

400

100

300 80

60

200

40

100

20

0 0
1981M01

1982M01

1983M01

1984M01

1985M01

1986M01

1987M01

1988M01

1989M01

1990M01

1991M01

1992M01

1993M01

1994M01

1995M01

1996M01

1997M01

1998M01

1999M01

2000M01

2001M01

2002M01

2003M01

2004M01

2005M01

2006M01

2007M01

2008M01
Crude Oil price index, 1997=100, left axis Regular gasoline prices, regina, cents per litre, right axis

Source: CANSIM II Database (Vector v1576530 and v735048


respectively)
Bivariate and multivariate models
Bivariate or simple regression model
(Education) x y (Income)

Multivariate or multiple regression model


(Education) x1
(Sex) x2
(Experience) x3 y (Income)
(Age) x4

Model with simultaneous relationship


Price of wheat Quantity of wheat produced
Assumptions of the Model
• For the simple linear regression model to be valid, the following
assumptions must be satisfied:
1.Linearity: The relationship between the independent and
dependent variables is linear.
2.Independence: The residuals (errors) are independent.
3.Homoscedasticity: The residuals have constant variance at
every level of x.
4.Normality: The residuals are normally distributed.
Fitting a Simple Linear Regression Model
• Using the same dataset, we will fit a simple linear regression
model.
• Steps:
1.Calculate the means of x and y Sales Revenue
Advertising Spend (x)
2.Calculate the slope (β1). (y)
100 200
3.Calculate the intercept (β0). 150 300
4.Formula: 200 400
250 450
300 500
Fitting a Simple Linear Regression Model…
Calculation Steps:

Régression Equation: y = −262 + 3.16x


Interpretation of Regression Coefficients
• Intercept (β0): The expected value of y when x is zero. In this case,
-262, which doesn't make practical sense, indicating that the
intercept may not be meaningful in all contexts.
• Slope (β1): The change in y for a one-unit change in x. Here, for
every additional unit of advertising spend, sales revenue
increases by 3.16 units.
Goodness of Fit (R-squared)
• R-squared measures the proportion of variance in the
dependent variable that is predictable from the independent
variable.
• It ranges from 0 to 1:
• R^2 = 1: Perfect fit
• R^2 = 0: No fit
• Formula:
Goodness of Fit (R-squared)
• Calculation Steps:
1.Compute predicted values
2.Compute the residual sum of squares (RSS) and total sum of
squares (TSS).
3.Calculate
Advertising Spend (x) Sales Revenue (y) Predicted Sales Revenue
100 200 -262 + 3.16*100 = 54
150 300 -262 + 3.16*150 = 212
200 400 -262 + 3.16*200 = 370
250 450 -262 + 3.16*250 = 528
300 500 -262 + 3.16*300 = 686

The R^2 value of 0.992 indicates that approximately 99.2% of the variance in
sales revenue is explained by advertising spend.
Exercise - Employee Performance and
Training Hours
• Scenario: Telekom Networks Malawi (TNM) Employee
Training Hours (x)
Performance
operates a Call-centre manned by 100 call- ID Rating (y)
agents. TNM has invested huge sum of money on 1 5 60
training the call-agents. TNM now wants to 2 7 65
analyze the impact of training hours on employee
performance. 3 10 70
4 15 80
• Conduct the following analyses:
• Correlation Coefficient: To see if there is a 5 20 85
relationship between training hours and performance 6 25 90
ratings.
7 30 95
• Regression Model: To predict performance based on
training hours. 8 35 100
• Interpretation: Helps in designing effective training 9 40 105
programs.
• Goodness of Fit: To measure how well the training 10 45 110
explains performance variations.
Employee ID Training Hours (x) Performance Rating (y)
1 5 60
2 7 65
3 10 70
4 15 80
5 20 85
6 25 90
7 30 95
8 35 100
9 40 105
10 45 110
Summary and Key Takeaways
• We covered the correlation coefficient, simple linear regression,
interpretation of regression coefficients, and goodness of fit.
• We used a case study on the relationship between advertising
spend and sales revenue.
• Key takeaway: Regression analysis is a powerful tool for
understanding relationships between variables and making
predictions.

You might also like