DataAnalysis1 Lectures12and13
DataAnalysis1 Lectures12and13
MODULE 13
Simple Linear
Regression
MODULE 13 SIMPLE LINEAR REGRESSION
Regression Analysis
What is it?
If you recall from our review of statistical testing, many statistical tests
including regression analysis need you to provide a formula. This formula
dictates what outcome you are trying to predict and what your possible
predictors are.
MODULE 13 SIMPLE LINEAR REGRESSION
As shared in our previous module, the formula looks something like this :
This curly symbol, called a tilde, is what separates and defines your outcome and predictor variables
MODULE 13 SIMPLE LINEAR REGRESSION
Regression Analysis
Regression tests count as parametric tests. That means that you must be
able to meet certain statistical assumptions about your data. Below are the
three basic statistical assumptions that your data must meet :
The observations or variables that The variance within each group The data follows a normal
you include in your test are not being compared is similar among all distribution (a.k.a. a bell curve).
related. For example, multiple groups. If one group has more This assumption applies only to
measurements of a single test variation than others, it will limit the quantitative data.
subject are not independent, but test’s effectiveness. This means
measurements of multiple that the distribution or “spread” of
different test subjects are. datapoints are equal.
MODULE 13 SIMPLE LINEAR REGRESSION
Assumptions Review
What Does This Mean Again?
Independence of
Observations By collecting your data through valid
Assumptions Review
What Does This Mean Again?
Homogeneity of
Variance Variance is the measure of variability.
Assumptions Review
What Does This Mean Again?
Normality of Data
Provided that you are working with
Regression Analysis
The Line in Linear
Linear regressions try to fit a line in your observed data. That means it
employs an additional assumption : Your data has a linear relationship!
Source: Statisticsolutions
MODULE 13 SIMPLE LINEAR REGRESSION
Example Scenario
Now that we have the dataframe
set up, we can write our formula.
We have to decide what our
response and predictor variables
are. We should start by forming our
research question.
MODULE 13 SIMPLE LINEAR REGRESSION
Example Scenario
The question we would like to ask is — Can snout length predict an
alligator’s weight? In order to frame that into a formula, we’d have to
write it using the basic format below :
If I decide to flip snout length and weight in the research question, then it will
also switch the positions of the variables in the formula.
MODULE 13 SIMPLE LINEAR REGRESSION
Here is the basic code structure of a simple linear regression using lm():
This is what the code looks likes once we fit the model to the dataset :
This applies to most methods of statistical testing. If you want to view the
results, make sure to take the steps above.
MODULE 13 SIMPLE LINEAR REGRESSION
MODULE 14
Multiple Linear
Regression and GLMs
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS
1 — A perfect fit! (never happens) 2 — What if we move a point down? 3 — What if there’s nothing there?
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS
Residual Values
A residual (e) is the difference
between a data point and the
regression line. It is calculated
using this formula :
Residual Values
Residuals are at times called
errors, but not because it’s a
mistake or because something
is wrong with the model —
it just points at unexplained
differences in your data.
Source: Statology
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS
Residual Values
Interpreting Residuals
Point furthest below 25% of residuals 25% of residuals Point furthest above
the regression line are < this number are > this number the regression line
Residual Values
➊ ➊
These should be symmetrically Symmetrical in magnitude
distributed in terms of magnitude means the absolute values are
similar, disregarding positive or
negative symbols.
Residual Values
Let’s take a look at an example of skewed residuals :
M. Linear Regression
A Multiple Linear Regression estimates the relationships between one
outcome variable and two or more predictor variables. This is in contrast to
the Simple Linear Regression, which is when you are working with only one
predictor variable.
Generalized LMs
What if your dependent variable does not have a normal distribution? You still
have options. You’re better off using a Generalized Linear Model (GLM). For
these types of linear models, you’ll need to choose the correct distribution
and link function.
Logistic Regression
Let’s learn about how GLMs are used by playing out a scenario using logistic
regression. A Logistic Regression is a type of regression model that is used
when you have a binary outcome variable. For example, it can estimate the
probability of an event occurring : Will it, or will it not?
We can combine this with our topic on multiple linear regression by tackling a
scenario with multiple independent variables. Let’s start.
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS
Logistic Regression
A researcher is interested in how variables, such as GRE (Graduate Record
Exam scores), GPA (grade point average), and prestige of the undergraduate
institution effect admission into graduate school. The outcome variable,
admit or don’t admit, is a binary.
You have the option to follow along in this demonstration if you would like to.
You can load the data for this scenario with this code :
Logistic Regression
Your binary response variable is
## admit gre gpa rank
called admit. It’s safe to say that 1 ## 1 0 380 3.61 3
## 2 1 660 3.67 3
refers to yes and 0 refers to no. ## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
GRE and GPA are continuous
variables, while rank is categorical
(values 1-4). Institutions with a rank
of 1 have the highest prestige.
MODULE 14 MULTIPLE LINEAR REGRESSION AND GLMS
Logistic Regression
First, we need to convert the rank column into a factor so that it will be
treated as a categorical variable.
Once we have done this, we can use the glm() function to fit our model with
the necessary parameters. We need to use summary() to view the results.
> model <- glm(admit ~ gre + gpa + rank, data = admissions, family =
"binomial")
> summary(model)
MODULE 14
Thank You