0% found this document useful (0 votes)

13 views

DataAnalysis1 Lectures12and13

Uploaded by

Chamod Kanishka

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

DataAnalysis1 Lectures12and13

Uploaded by

Chamod Kanishka

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

MODULE 13

Simple Linear
Regression
MODULE 13 SIMPLE LINEAR REGRESSION

Regression Analysis
What is it?

Regression analysis is the process of studying how an outcome variable

changes when one or more predictor variables changes. In short, it is
estimating relationships between variables.

If you recall from our review of statistical testing, many statistical tests
including regression analysis need you to provide a formula. This formula
dictates what outcome you are trying to predict and what your possible
predictors are.
MODULE 13 SIMPLE LINEAR REGRESSION

Your Test Formula

In a linear regression analysis, just like in many other statistical tests, you will
need to identify an outcome variable (also called response or dependent
variable) and one or more predictor variables (also called independent
variables). This is extremely easy to do.

As shared in our previous module, the formula looks something like this :

> outcome_variable ~ predictor_variable1 + predictor_variable2

This curly symbol, called a tilde, is what separates and defines your outcome and predictor variables
MODULE 13 SIMPLE LINEAR REGRESSION

Regression Analysis
Regression tests count as parametric tests. That means that you must be
able to meet certain statistical assumptions about your data. Below are the
three basic statistical assumptions that your data must meet :

Independence of Homogeneity of Normality of Data

Observations Variance

The observations or variables that The variance within each group The data follows a normal
you include in your test are not being compared is similar among all distribution (a.k.a. a bell curve).
related. For example, multiple groups. If one group has more This assumption applies only to
measurements of a single test variation than others, it will limit the quantitative data.
subject are not independent, but test’s effectiveness. This means
measurements of multiple that the distribution or “spread” of
different test subjects are. datapoints are equal.
MODULE 13 SIMPLE LINEAR REGRESSION

Assumptions Review
What Does This Mean Again?
Independence of
Observations By collecting your data through valid

The observations or variables that

sampling methods, the measurements made
you include in your test are not
related. For example, multiple with your sample should not have been
measurements of a single test
subject are not independent, but influenced by measurements from another
measurements of multiple
different test subjects are.
sample. There should also be no correlation
between your independent variables.
MODULE 13 SIMPLE LINEAR REGRESSION

Assumptions Review
What Does This Mean Again?
Homogeneity of
Variance Variance is the measure of variability.

The variance within each group

Variability refers to how spread apart data
being compared is similar among all
groups. If one group has more points lie from the center of a distribution.
variation than others, it will limit the
test’s effectiveness. This means Basically, the groups you are comparing
that the distribution or “spread” of
datapoints are equal.
have similar spreads in their data.
MODULE 13 SIMPLE LINEAR REGRESSION

Assumptions Review
What Does This Mean Again?
Normality of Data
Provided that you are working with

The data follows a normal

quantitative data, it must be normally
distribution (a.k.a. a bell curve).
This assumption applies only to distributed. It is wise to check for normality
quantitative data.
using the Shapiro-Wilk test to ensure that
your data is normal. Most things in the natural
world follow a normal distribution (e.g. height,
birth weight, reading ability).
MODULE 13 SIMPLE LINEAR REGRESSION

Regression Analysis
The Line in Linear

Linear regressions try to fit a line in your observed data. That means it
employs an additional assumption : Your data has a linear relationship!

Types of Linear Regression : ❌ ❌

1. Simple Linear Regression
• One response, one predictor

2. Multiple Linear Regression

• One response, multiple predictors

Source: Statisticsolutions
MODULE 13 SIMPLE LINEAR REGRESSION

Simple Linear Regression

Let’s say that you’re a herpetologist who studies reptiles and amphibians.
You want to know if there’s a relationship between the weight of alligators
and the length of their noses. You have data from a sample of alligators :

> alligator <- data.frame(SnoutLength = c(3.87, 3.61, 4.33, 3.43, 3.81,

3.83, 3.46, 3.76, 3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
Weight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25))

What’s Happening in this Code?

Here, we are creating a dataframe using the data.frame() function and plugging it into the object name alligator.
The dataframe contains two columns, SnoutLength and Weight, which each contain a vector full of unique values.
MODULE 13 SIMPLE LINEAR REGRESSION

Example Scenario
Now that we have the dataframe
set up, we can write our formula.
We have to decide what our
response and predictor variables
are. We should start by forming our
research question.
MODULE 13 SIMPLE LINEAR REGRESSION

Example Scenario
The question we would like to ask is — Can snout length predict an
alligator’s weight? In order to frame that into a formula, we’d have to
write it using the basic format below :

> outcome ~ predictor # Your basic format for a test formula

> Weight ~ SnoutLength # The test formula for our scenario

If I decide to flip snout length and weight in the research question, then it will
also switch the positions of the variables in the formula.
MODULE 13 SIMPLE LINEAR REGRESSION

Fitting your Model

In order to run a simple linear regression, we only need the linear model or
lm() function. We have to fit the model to the dataset, which means we
need to apply the necessary parameters in order for the model to run.

Here is the basic code structure of a simple linear regression using lm():

> plot <- lm(outcome ~ predictor, data = dataframe)

This is what the code looks likes once we fit the model to the dataset :

> plot <- lm(Weight ~ SnoutLength, data = alligator)

MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results

In order to read the results, you need to use the summary() function on the
object containing your linear model. Like this :

> plot <- lm(Weight ~ SnoutLength, data = alligator)

> summary(plot)

This applies to most methods of statistical testing. If you want to view the
results, make sure to take the steps above.
MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results

Once you run the code for
viewing the results, you should
see an output that looks like this
in your R Notebook. We can see
the p-value in the coefficients
column and at the bottom of the
results. The p-value is below 0.05,
which means that the results are
statistically significant.
MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results

The estimate column, also
called the regression coefficient
or R2 value, is the estimated
effect. The value given tells us
that for every unit increase in
snout length, there is a 3.4311 unit
increase in weight.
MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results

The std. error column displays
the standard error of the estimate.
This value tells us the amount of
variation there is in our estimate of
the relationship between alligator
weight and snout length.
MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results

The t-value column displays
the test statistic — it shows how
closely the distribution of your
data matches the distribution
expected under the null
hypothesis. The larger the test
statistic, the less likely that results
were by chance.
MODULE 14

MODULE 14

Multiple Linear
Regression and GLMs
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Line of Best Fit

Also called the trendline, it shows the best location in which a linear equation
would fit into a set of data on a scatterplot. This is what happens when you
perform a linear regression.
Source: statisticshowto.com

1 — A perfect fit! (never happens) 2 — What if we move a point down? 3 — What if there’s nothing there?
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
A residual (e) is the difference
between a data point and the
regression line. It is calculated
using this formula :

Residual (e) = Observed value

(y) – predicted value ( )

Remember: You do not have to

calculate for residuals yourself, Source: Statology

your model does it for you.

ŷ
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
Residuals are at times called
errors, but not because it’s a
mistake or because something
is wrong with the model —
it just points at unexplained
differences in your data.

Source: Statology
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
Interpreting Residuals

You can actually tell a lot about

your linear regression based on
the residuals without even
seeing the plot first.

Point furthest below 25% of residuals 25% of residuals Point furthest above
the regression line are < this number are > this number the regression line

Min 1Q Median 3Q Max

-0.24348 -0.03186 0.03740 0.07727 0.12669
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
➊ ➊
These should be symmetrically Symmetrical in magnitude
distributed in terms of magnitude means the absolute values are
similar, disregarding positive or
negative symbols.

Min 1Q Median 3Q Max

-0.24348 -0.03186 0.03740 0.07727 0.12669

➋ ➋ A median close to zero implies

that the model is not skewed
The median should be as
one way or another.
close to zero as possible
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
Let’s take a look at an example of skewed residuals :

Min 1Q Median 3Q Max

-279.72 -98.15 -47.17 60.29 890.42

These numbers tell us

that there is an outlier
in our data. We can see
it when we plot it out.
1 — With the outlier… 2 — Without the outlier!
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

M. Linear Regression
A Multiple Linear Regression estimates the relationships between one
outcome variable and two or more predictor variables. This is in contrast to
the Simple Linear Regression, which is when you are working with only one
predictor variable.

It’s great for answering these two questions :

1. How strong is the relationship between variables?

2. What is the value of the dependent variable at a certain value of the
independent variables?
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Generalized LMs
What if your dependent variable does not have a normal distribution? You still
have options. You’re better off using a Generalized Linear Model (GLM). For
these types of linear models, you’ll need to choose the correct distribution
and link function.

1. Poisson. You have count data whose maximal value is unknown.

2. Binomial. You have count data whose maximal value is known.
3. Gamma. Your data has continuous values that are all >0.
4. Gaussian. Your data has continuous values that might be <0.
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Logistic Regression
Let’s learn about how GLMs are used by playing out a scenario using logistic
regression. A Logistic Regression is a type of regression model that is used
when you have a binary outcome variable. For example, it can estimate the
probability of an event occurring : Will it, or will it not?

We can combine this with our topic on multiple linear regression by tackling a
scenario with multiple independent variables. Let’s start.
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Logistic Regression
A researcher is interested in how variables, such as GRE (Graduate Record
Exam scores), GPA (grade point average), and prestige of the undergraduate
institution effect admission into graduate school. The outcome variable,
admit or don’t admit, is a binary.

You have the option to follow along in this demonstration if you would like to.
You can load the data for this scenario with this code :

> admissions <- read.csv("https://stats.idre.ucla.edu/stat/data/

binary.csv")
> head(admissions) # View the first 6 rows of the dataframe
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Logistic Regression
Your binary response variable is
## admit gre gpa rank
called admit. It’s safe to say that 1 ## 1 0 380 3.61 3
## 2 1 660 3.67 3
refers to yes and 0 refers to no. ## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
GRE and GPA are continuous
variables, while rank is categorical
(values 1-4). Institutions with a rank
of 1 have the highest prestige.
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Logistic Regression
First, we need to convert the rank column into a factor so that it will be
treated as a categorical variable.

> admissions$rank <- factor(admissions$rank)

Once we have done this, we can use the glm() function to fit our model with
the necessary parameters. We need to use summary() to view the results.

> model <- glm(admit ~ gre + gpa + rank, data = admissions, family =
"binomial")
> summary(model)
MODULE 14

Thank You

Rosie Rabbitx27s Radish PDF Free
0% (1)
Rosie Rabbitx27s Radish PDF Free
20 pages
Examples Biostatistics. Final
No ratings yet
Examples Biostatistics. Final
90 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
FDSA UNIT 5
No ratings yet
FDSA UNIT 5
48 pages
Unit v -Update
No ratings yet
Unit v -Update
53 pages
What Is Regression Analysis
No ratings yet
What Is Regression Analysis
18 pages
CPSY 501:: Class 4 Outline
No ratings yet
CPSY 501:: Class 4 Outline
22 pages
datamining unit4
No ratings yet
datamining unit4
21 pages
Models Assignment
No ratings yet
Models Assignment
43 pages
Mod3
No ratings yet
Mod3
50 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
23 pages
Book 1
No ratings yet
Book 1
34 pages
Multinomial Logistic Regression-1
No ratings yet
Multinomial Logistic Regression-1
17 pages
52) Statistical Analysis
No ratings yet
52) Statistical Analysis
11 pages
11 Regression JASP
100% (1)
11 Regression JASP
35 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
Basicof Stats
No ratings yet
Basicof Stats
7 pages
EDA 4th Module
No ratings yet
EDA 4th Module
26 pages
Predictive - Modelling - Project - PDF 1
No ratings yet
Predictive - Modelling - Project - PDF 1
31 pages
Module 4: Regression Shrinkage Methods
No ratings yet
Module 4: Regression Shrinkage Methods
5 pages
Appendix: Statistical Models of Attrition
No ratings yet
Appendix: Statistical Models of Attrition
3 pages
Test of Normality
No ratings yet
Test of Normality
7 pages
4823 Dsejournal
No ratings yet
4823 Dsejournal
129 pages
Regression Analysis Presentation
No ratings yet
Regression Analysis Presentation
52 pages
Panel Data
No ratings yet
Panel Data
6 pages
Model Validation - Correlation For Updating
No ratings yet
Model Validation - Correlation For Updating
14 pages
5
No ratings yet
5
23 pages
4 Econometric Techniques
No ratings yet
4 Econometric Techniques
30 pages
Applying_Machine_Learning_Algorithms_with_Scikit-learn(Sklearn)_-_Notes
No ratings yet
Applying_Machine_Learning_Algorithms_with_Scikit-learn(Sklearn)_-_Notes
19 pages
Presentation 1 Deon Francis George
No ratings yet
Presentation 1 Deon Francis George
12 pages
Regression analysis
No ratings yet
Regression analysis
16 pages
Summary: Correlation and Regression
No ratings yet
Summary: Correlation and Regression
6 pages
Bike Assignment - Subjective Sol
No ratings yet
Bike Assignment - Subjective Sol
5 pages
ML Endsem
No ratings yet
ML Endsem
14 pages
DA Unit-3
No ratings yet
DA Unit-3
11 pages
DS - UNIT - III - QB & Ans
No ratings yet
DS - UNIT - III - QB & Ans
25 pages
Thesis Using Multiple Regression
100% (3)
Thesis Using Multiple Regression
5 pages
Research Notes
No ratings yet
Research Notes
4 pages
DA Unit-3
No ratings yet
DA Unit-3
14 pages
Linear Regression
100% (2)
Linear Regression
28 pages
National University of Modern Languages Lahore Campus Topic
No ratings yet
National University of Modern Languages Lahore Campus Topic
5 pages
Assignment On Probit Model
No ratings yet
Assignment On Probit Model
17 pages
Lecture 3
No ratings yet
Lecture 3
27 pages
Measuring Relationship via Regression Analysis and Correlation-1
No ratings yet
Measuring Relationship via Regression Analysis and Correlation-1
18 pages
Week 9 Lecture - Revision Test-dual-translated
No ratings yet
Week 9 Lecture - Revision Test-dual-translated
92 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
Questions Stats and Trix
No ratings yet
Questions Stats and Trix
39 pages
SEM With AMOS and Tutorial
No ratings yet
SEM With AMOS and Tutorial
118 pages
Basic Statistics: Basic Statistical Interview Question
No ratings yet
Basic Statistics: Basic Statistical Interview Question
5 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
26 pages
Paired T Tests - Practical
No ratings yet
Paired T Tests - Practical
3 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
DA UNIT-III
No ratings yet
DA UNIT-III
14 pages
FMD PRACTICAL FILE
No ratings yet
FMD PRACTICAL FILE
61 pages
UNIT 3
No ratings yet
UNIT 3
13 pages
Analysingn quantitative data Mbizi modif
No ratings yet
Analysingn quantitative data Mbizi modif
38 pages
Asset v1 - Indic AI+PR103+2020 - T3+type@asset+block@1 Running Linear Regression in R
No ratings yet
Asset v1 - Indic AI+PR103+2020 - T3+type@asset+block@1 Running Linear Regression in R
74 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
RRU5258 Hardware Description (15) (PDF) - EN
No ratings yet
RRU5258 Hardware Description (15) (PDF) - EN
41 pages
李瑞轩 - 双刀 (Live) - 0048oKr11wUbBn
No ratings yet
李瑞轩 - 双刀 (Live) - 0048oKr11wUbBn
5 pages
10 1108 - Jabs 01 2021 0025
No ratings yet
10 1108 - Jabs 01 2021 0025
17 pages
Get Practical Recording Techniques the Step by Step Approach to Professional Audio Recording 6th Edition Bruce Bartlett free all chapters
100% (5)
Get Practical Recording Techniques the Step by Step Approach to Professional Audio Recording 6th Edition Bruce Bartlett free all chapters
76 pages
MD120 Installation Instructions
No ratings yet
MD120 Installation Instructions
5 pages
Shafa Salsabila (11190423) Assignment 5
No ratings yet
Shafa Salsabila (11190423) Assignment 5
2 pages
ICT Lecture-05 Ok-F
No ratings yet
ICT Lecture-05 Ok-F
26 pages
06laplac Ti 89 Laplas
No ratings yet
06laplac Ti 89 Laplas
10 pages
Demic
No ratings yet
Demic
16 pages
Fane Colossus 18XB DS240316
No ratings yet
Fane Colossus 18XB DS240316
1 page
Services & Scope of Work of An Architect
No ratings yet
Services & Scope of Work of An Architect
6 pages
Oracle Food and Beverage Oracle Linux For MICROS Integrators
No ratings yet
Oracle Food and Beverage Oracle Linux For MICROS Integrators
25 pages
Dexhand A Space Qualified Multi Fingered
No ratings yet
Dexhand A Space Qualified Multi Fingered
7 pages
ANNEXURE FIGS Anexure L M
No ratings yet
ANNEXURE FIGS Anexure L M
42 pages
Impact of Digital Marketing On Online Purchase Intention: Mediation Effect of Customer Relationship Management
No ratings yet
Impact of Digital Marketing On Online Purchase Intention: Mediation Effect of Customer Relationship Management
17 pages
Data Structure Lab Manual
No ratings yet
Data Structure Lab Manual
243 pages
Name: Date:: Experiment 02 Common Base Amplifier
75% (4)
Name: Date:: Experiment 02 Common Base Amplifier
3 pages
Naveen Enterprises Profile
No ratings yet
Naveen Enterprises Profile
20 pages
Unceasing Fire Ministries Church E281 Ward 9 Osizweni 2952: Page 1 of 2
No ratings yet
Unceasing Fire Ministries Church E281 Ward 9 Osizweni 2952: Page 1 of 2
2 pages
NLP Assignment-6 Solution
No ratings yet
NLP Assignment-6 Solution
5 pages
Complete Forecasting Exercise Solutions
No ratings yet
Complete Forecasting Exercise Solutions
5 pages
7.layer 3 Switching
No ratings yet
7.layer 3 Switching
43 pages
Sikagrout®-114 Ae: Product Data Sheet
No ratings yet
Sikagrout®-114 Ae: Product Data Sheet
3 pages
Assignment #3 Department of Automobile and Mechanical Engineering IOE Thapathali Engineering Campus Thapathali, Kathmandu
No ratings yet
Assignment #3 Department of Automobile and Mechanical Engineering IOE Thapathali Engineering Campus Thapathali, Kathmandu
2 pages
Price List TKDN Mei 2023-1
No ratings yet
Price List TKDN Mei 2023-1
1 page
Assessment Test
No ratings yet
Assessment Test
2 pages
T T 29363 Buddy The Dogs Internet Safety Story Powerpoint - Ver - 5
No ratings yet
T T 29363 Buddy The Dogs Internet Safety Story Powerpoint - Ver - 5
18 pages
S. No Qty Unit Price in USD Total Price in USD Item Description
No ratings yet
S. No Qty Unit Price in USD Total Price in USD Item Description
1 page
Developing Add in Applications Workshop
No ratings yet
Developing Add in Applications Workshop
9 pages

DataAnalysis1 Lectures12and13

Uploaded by

DataAnalysis1 Lectures12and13

Uploaded by

MODULE 13

Regression analysis is the process of studying how an outcome variable

Your Test Formula

> outcome_variable ~ predictor_variable1 + predictor_variable2

Independence of Homogeneity of Normality of Data

The observations or variables that

The variance within each group

The data follows a normal

Types of Linear Regression : ❌ ❌

2. Multiple Linear Regression

Simple Linear Regression

> alligator <- data.frame(SnoutLength = c(3.87, 3.61, 4.33, 3.43, 3.81,

What’s Happening in this Code?

> outcome ~ predictor # Your basic format for a test formula

> Weight ~ SnoutLength # The test formula for our scenario

Fitting your Model

> plot <- lm(outcome ~ predictor, data = dataframe)

> plot <- lm(Weight ~ SnoutLength, data = alligator)

Reading the Results

> plot <- lm(Weight ~ SnoutLength, data = alligator)

Reading the Results

Reading the Results

Reading the Results

Reading the Results

Line of Best Fit

Residual (e) = Observed value

Remember: You do not have to

your model does it for you.

You can actually tell a lot about

Min 1Q Median 3Q Max

Min 1Q Median 3Q Max

➋ ➋ A median close to zero implies

Min 1Q Median 3Q Max

These numbers tell us

It’s great for answering these two questions :

1. How strong is the relationship between variables?

1. Poisson. You have count data whose maximal value is unknown.

> admissions <- read.csv("https://stats.idre.ucla.edu/stat/data/

> admissions$rank <- factor(admissions$rank)

You might also like