0% found this document useful (0 votes)
13 views

DataAnalysis1 Lectures12and13

Uploaded by

Chamod Kanishka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

DataAnalysis1 Lectures12and13

Uploaded by

Chamod Kanishka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

MODULE 13

MODULE 13

Simple Linear
Regression
MODULE 13 SIMPLE LINEAR REGRESSION

Regression Analysis
What is it?

Regression analysis is the process of studying how an outcome variable


changes when one or more predictor variables changes. In short, it is
estimating relationships between variables.

If you recall from our review of statistical testing, many statistical tests
including regression analysis need you to provide a formula. This formula
dictates what outcome you are trying to predict and what your possible
predictors are.
MODULE 13 SIMPLE LINEAR REGRESSION

Your Test Formula


In a linear regression analysis, just like in many other statistical tests, you will
need to identify an outcome variable (also called response or dependent
variable) and one or more predictor variables (also called independent
variables). This is extremely easy to do.

As shared in our previous module, the formula looks something like this :

> outcome_variable ~ predictor_variable1 + predictor_variable2

This curly symbol, called a tilde, is what separates and defines your outcome and predictor variables
MODULE 13 SIMPLE LINEAR REGRESSION

Regression Analysis
Regression tests count as parametric tests. That means that you must be
able to meet certain statistical assumptions about your data. Below are the
three basic statistical assumptions that your data must meet :

Independence of Homogeneity of Normality of Data


Observations Variance

The observations or variables that The variance within each group The data follows a normal
you include in your test are not being compared is similar among all distribution (a.k.a. a bell curve).
related. For example, multiple groups. If one group has more This assumption applies only to
measurements of a single test variation than others, it will limit the quantitative data.
subject are not independent, but test’s effectiveness. This means
measurements of multiple that the distribution or “spread” of
different test subjects are. datapoints are equal.
MODULE 13 SIMPLE LINEAR REGRESSION

Assumptions Review
What Does This Mean Again?
Independence of
Observations By collecting your data through valid

The observations or variables that


sampling methods, the measurements made
you include in your test are not
related. For example, multiple with your sample should not have been
measurements of a single test
subject are not independent, but influenced by measurements from another
measurements of multiple
different test subjects are.
sample. There should also be no correlation
between your independent variables.
MODULE 13 SIMPLE LINEAR REGRESSION

Assumptions Review
What Does This Mean Again?
Homogeneity of
Variance Variance is the measure of variability.

The variance within each group


Variability refers to how spread apart data
being compared is similar among all
groups. If one group has more points lie from the center of a distribution.
variation than others, it will limit the
test’s effectiveness. This means Basically, the groups you are comparing
that the distribution or “spread” of
datapoints are equal.
have similar spreads in their data.
MODULE 13 SIMPLE LINEAR REGRESSION

Assumptions Review
What Does This Mean Again?
Normality of Data
Provided that you are working with

The data follows a normal


quantitative data, it must be normally
distribution (a.k.a. a bell curve).
This assumption applies only to distributed. It is wise to check for normality
quantitative data.
using the Shapiro-Wilk test to ensure that
your data is normal. Most things in the natural
world follow a normal distribution (e.g. height,
birth weight, reading ability).
MODULE 13 SIMPLE LINEAR REGRESSION

Regression Analysis
The Line in Linear

Linear regressions try to fit a line in your observed data. That means it
employs an additional assumption : Your data has a linear relationship!

Types of Linear Regression : ❌ ❌


1. Simple Linear Regression
• One response, one predictor

2. Multiple Linear Regression


• One response, multiple predictors

Source: Statisticsolutions
MODULE 13 SIMPLE LINEAR REGRESSION

Simple Linear Regression


Let’s say that you’re a herpetologist who studies reptiles and amphibians.
You want to know if there’s a relationship between the weight of alligators
and the length of their noses. You have data from a sample of alligators :

> alligator <- data.frame(SnoutLength = c(3.87, 3.61, 4.33, 3.43, 3.81,


3.83, 3.46, 3.76, 3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
Weight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25))

What’s Happening in this Code?


Here, we are creating a dataframe using the data.frame() function and plugging it into the object name alligator.
The dataframe contains two columns, SnoutLength and Weight, which each contain a vector full of unique values.
MODULE 13 SIMPLE LINEAR REGRESSION

Example Scenario
Now that we have the dataframe
set up, we can write our formula.
We have to decide what our
response and predictor variables
are. We should start by forming our
research question.
MODULE 13 SIMPLE LINEAR REGRESSION

Example Scenario
The question we would like to ask is — Can snout length predict an
alligator’s weight? In order to frame that into a formula, we’d have to
write it using the basic format below :

> outcome ~ predictor # Your basic format for a test formula

> Weight ~ SnoutLength # The test formula for our scenario

If I decide to flip snout length and weight in the research question, then it will
also switch the positions of the variables in the formula.
MODULE 13 SIMPLE LINEAR REGRESSION

Fitting your Model


In order to run a simple linear regression, we only need the linear model or
lm() function. We have to fit the model to the dataset, which means we
need to apply the necessary parameters in order for the model to run.

Here is the basic code structure of a simple linear regression using lm():

> plot <- lm(outcome ~ predictor, data = dataframe)

This is what the code looks likes once we fit the model to the dataset :

> plot <- lm(Weight ~ SnoutLength, data = alligator)


MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results


In order to read the results, you need to use the summary() function on the
object containing your linear model. Like this :

> plot <- lm(Weight ~ SnoutLength, data = alligator)


> summary(plot)

This applies to most methods of statistical testing. If you want to view the
results, make sure to take the steps above.
MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results


Once you run the code for
viewing the results, you should
see an output that looks like this
in your R Notebook. We can see
the p-value in the coefficients
column and at the bottom of the
results. The p-value is below 0.05,
which means that the results are
statistically significant.
MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results


The estimate column, also
called the regression coefficient
or R2 value, is the estimated
effect. The value given tells us
that for every unit increase in
snout length, there is a 3.4311 unit
increase in weight.
MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results


The std. error column displays
the standard error of the estimate.
This value tells us the amount of
variation there is in our estimate of
the relationship between alligator
weight and snout length.
MODULE 13 SIMPLE LINEAR REGRESSION

Reading the Results


The t-value column displays
the test statistic — it shows how
closely the distribution of your
data matches the distribution
expected under the null
hypothesis. The larger the test
statistic, the less likely that results
were by chance.
MODULE 14

MODULE 14

Multiple Linear
Regression and GLMs
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Line of Best Fit


Also called the trendline, it shows the best location in which a linear equation
would fit into a set of data on a scatterplot. This is what happens when you
perform a linear regression.
Source: statisticshowto.com

1 — A perfect fit! (never happens) 2 — What if we move a point down? 3 — What if there’s nothing there?
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
A residual (e) is the difference
between a data point and the
regression line. It is calculated
using this formula :

Residual (e) = Observed value


(y) – predicted value ( )

Remember: You do not have to


calculate for residuals yourself, Source: Statology

your model does it for you.



MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
Residuals are at times called
errors, but not because it’s a
mistake or because something
is wrong with the model —
it just points at unexplained
differences in your data.

Source: Statology
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
Interpreting Residuals

You can actually tell a lot about


your linear regression based on
the residuals without even
seeing the plot first.

Point furthest below 25% of residuals 25% of residuals Point furthest above
the regression line are < this number are > this number the regression line

Min 1Q Median 3Q Max


-0.24348 -0.03186 0.03740 0.07727 0.12669
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
➊ ➊
These should be symmetrically Symmetrical in magnitude
distributed in terms of magnitude means the absolute values are
similar, disregarding positive or
negative symbols.

Min 1Q Median 3Q Max


-0.24348 -0.03186 0.03740 0.07727 0.12669

➋ ➋ A median close to zero implies


that the model is not skewed
The median should be as
one way or another.
close to zero as possible
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Residual Values
Let’s take a look at an example of skewed residuals :

Min 1Q Median 3Q Max


-279.72 -98.15 -47.17 60.29 890.42

These numbers tell us


that there is an outlier
in our data. We can see
it when we plot it out.
1 — With the outlier… 2 — Without the outlier!
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

M. Linear Regression
A Multiple Linear Regression estimates the relationships between one
outcome variable and two or more predictor variables. This is in contrast to
the Simple Linear Regression, which is when you are working with only one
predictor variable.

It’s great for answering these two questions :

1. How strong is the relationship between variables?


2. What is the value of the dependent variable at a certain value of the
independent variables?
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Generalized LMs
What if your dependent variable does not have a normal distribution? You still
have options. You’re better off using a Generalized Linear Model (GLM). For
these types of linear models, you’ll need to choose the correct distribution
and link function.

1. Poisson. You have count data whose maximal value is unknown.


2. Binomial. You have count data whose maximal value is known.
3. Gamma. Your data has continuous values that are all >0.
4. Gaussian. Your data has continuous values that might be <0.
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Logistic Regression
Let’s learn about how GLMs are used by playing out a scenario using logistic
regression. A Logistic Regression is a type of regression model that is used
when you have a binary outcome variable. For example, it can estimate the
probability of an event occurring : Will it, or will it not?

We can combine this with our topic on multiple linear regression by tackling a
scenario with multiple independent variables. Let’s start.
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Logistic Regression
A researcher is interested in how variables, such as GRE (Graduate Record
Exam scores), GPA (grade point average), and prestige of the undergraduate
institution effect admission into graduate school. The outcome variable,
admit or don’t admit, is a binary.

You have the option to follow along in this demonstration if you would like to.
You can load the data for this scenario with this code :

> admissions <- read.csv("https://stats.idre.ucla.edu/stat/data/


binary.csv")
> head(admissions) # View the first 6 rows of the dataframe
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Logistic Regression
Your binary response variable is
## admit gre gpa rank
called admit. It’s safe to say that 1 ## 1 0 380 3.61 3
## 2 1 660 3.67 3
refers to yes and 0 refers to no. ## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
GRE and GPA are continuous
variables, while rank is categorical
(values 1-4). Institutions with a rank
of 1 have the highest prestige.
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS

Logistic Regression
First, we need to convert the rank column into a factor so that it will be
treated as a categorical variable.

> admissions$rank <- factor(admissions$rank)

Once we have done this, we can use the glm() function to fit our model with
the necessary parameters. We need to use summary() to view the results.

> model <- glm(admit ~ gre + gpa + rank, data = admissions, family =
"binomial")
> summary(model)
MODULE 14

Thank You

You might also like