0% found this document useful (0 votes)
13 views44 pages

8_2_correlations+models_ninell

Uploaded by

Sophia Lindholm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views44 pages

8_2_correlations+models_ninell

Uploaded by

Sophia Lindholm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Correlations &

Linear Models
Lecture 13
Empirical Methods 2 & Theory of Science
01.11.2024 2

What you know so far


• Hypothesis testing - NHST
• The t-test & Fisher’s p-value
• Test statistics & statistical power
• Confidence intervals & statistical significance
• Problems with NHST
• Effect sizes (Cohen’s d)
• Correlation (Pearson’s r)
01.11.2024 3

Today
• Associations & Scatter Plots
• Correlation (Pearson’s r)
• Linear Models & how to fit them
• Residuals
• Assumptions of Linear Models &
Steps of Hypothesis Testing
01.11.2024 4

Today
• Associations & Scatter Plots
• Correlation (Pearson’s r)
• Linear Models & how to fit them
• Residuals
• Assumptions of Linear Models &
Steps of Hypothesis Testing
01.11.2024 5

Statistical Inference — what we’ve seen so far.


Estimate Probabilities
• Calculate the likelihood of specific outcomes in your data.
• Example: What’s the probability of a me choosing Kebabistan over Kösem?
Analyze Central Tendency and Distribution
• Identify averages (mean, median, mode) and the spread of data.
• Example: What is the average kebab score in a Nørrebro, and how much do scores vary?
Compare Categories
• Examine differences and relationships between groups within the same variable.
• Example: Do kebab lovers and pizza favers differ in average scores on a stats test?
Determine Effect Size and Significance
• Measure how strong the results are and whether they’re statistically meaningful.
• Example: Does a new bread type improve score of Kebabistan significantly?
01.11.2024 6

* Asking more Questions


1. Exploring Associations Between Variables
• Questions: How are two continuous variables related?
• Examples: Is there a relationship between Harris voters and shoe size? Do
happier students tend to eat more kebab?
2. Making Predictions
• Questions: What can we predict based on known information?
• Examples: After how much time will my grade likely not improve anymore?
How will win the presidential election?
3. Understanding Causality
• Questions: How do changes in one variable affect another?
• Examples: How expensive should a pita brød be to gain the highest reward?
01.11.2024 7

Associations

PRICE
Amount of
ice cream

Questions to ask:
• Who or What is Described?
• What Variables are Included?
• How are the Variables Measured?
• What Types of Variables Are There?
01.11.2024 8

Graphing Associations: The Scatter Plot.


Who or What is Described?
• Identify the individuals or cases the data
represent (e.g., people, products, events).
What Variables are Included?
• List the variables you’re examining and
understand their role in the data.
How are the Variables Measured?
• Consider the measurement scale for each
variable (e.g., dollars, years, scores).
What Types of Variables Are There?
• Determine which variables are quantitative
(e.g., size, prince) and which are
categorical (e.g., type of ice cream).
01.11.2024 9

Misleading Scatter Plots


01.11.2024 10

* How to Read a Scatter Plot


1. Identify the Overall Pattern
• Look for trends, shapes, and any deviations that stand out.
2. Describe Key Aspects of the Pattern
• Form: Is the relationship linear, curved, or random?
• Direction: Is there a positive (upward) or negative (downward) trend?
• Strength: Are the points tightly clustered (strong relationship) or spread
out (weak relationship)?
3. Spot Deviations
• Outliers: Identify any points that fall far outside the overall trend—they
may indicate unique cases or data errors.
4. Recognize Clusters
• Clusters suggest subgroups within the data, which might represent
different categories or types of individuals in the dataset.
01.11.2024 11

Let’s Examine a Scatter Plot


• Identify the Overall Pattern
• Describe Key Aspects of the Pattern
• Spot Deviations
• Recognize Clusters
01.11.2024 12

Today
• Associations & Scatter Plots
• Correlation (Pearson’s r)
• Linear Models & how to fit them
• Residuals
• Assumptions of Linear Models &
Steps of Hypothesis Testing
01.11.2024 13

Pearson’s r
Definition: Pearson’s r measures the direction and strength of the linear
relationship between two quantitative variables
• Can range from -1 (negatively correlated) to +1 (positively correlated)

Assumptions:
• Normality
• Linearity
• Homoscedasticity – constant scatter
pattern across range
• Independence of errors
01.11.2024 14

* Correlation
Based on Means and Standard Deviations
• Correlation is calculated using the average values (means) and variability (standard
deviations) of two variables.
No Need for Independent or Dependent Variables
• Correlation measures the relationship without assuming one variable depends on the
other.
Applies Only to Continuous Variables
• Both variables need to be continuous (e.g., height, temperature) for correlation to be
meaningful.
Standardized Values, No Units
• Correlation uses standardized scores (z-scores), meaning it has no units of
measurement—just a value.

rxy = =
01.11.2024 15

* Correlation Formula
calculate one data point sum of all individual differences
the average from group x (same as on the left)

mean from
the result group x

rxy = =

standard deviation
of group x square to
“n-1” instead of “n” is the “Bessel's correction” remove the sign
square root to
(look it up if you’re interested, not covered here ;))
reverse the square
formula of the
standard deviation
01.11.2024 16

Correlation
Size (x) Price (y) (xi-μx) / sx (yi-μy) / sy Multiply

.75 85.5 -1,015 -0,991 1,0058849


1 103.5 -0,564 -0,604 0,3407469
1.5 148.5 0,3382 0,3627 0,1226689

2 189 1,2402 1,2331 1,5292722

Mean 1.3125 131.625 - - ∑: 2,9985729

St.Dev. 0.554338946 46.53023211 - - 1/n-1: 0,9995243

our correlation
01.11.2024 17

*
Correlation
• Measures strength of only linear relationships.
• Is very sensitive to sample size.
• Is sensitive to outliers != caution!
• Cannot prove causality, only infer presence of a relationship.
01.11.2024 18

Correlation

Correlation measures the direction


and strength of a linear association
01.11.2024 19

Correlation and The Problem of Outliers


01.11.2024 20

Correlation is not, never ever, ever causation


01.11.2024 21
01.11.2024 22

What can you say about these statements? Discuss!


• There is a high correlation between number of sodas sold in one year and
number of divorces, years 1950-2010.
• There is also a high correlation between number of teachers and number of
bars for cities in Germany.
• There is a high correlation between amount of daily walking and quality of
health for men aged over 65.
01.11.2024 23

Association vs. Causality


01.11.2024 24

Association vs. Causality

Owning Cats Relation? Being struck by lightning

Owning Cats Being struck by lightning

Climb things to get cat


“Confounding variable”
01.11.2024 25

Today
• Associations & Scatter Plots
• Correlation (Pearson’s r)
• Linear Models & how to fit them
• Residuals
• Assumptions of Linear Models &
Steps of Hypothesis Testing
01.11.2024 26

From correlation to Linear Models


• Correlation measures the extent to which two
variables have a linear relationship
• Linear models summarizes the relationship between
two variables, establishing a direction of effect and
allowing us to predict an outcome from a known
measure
01.11.2024 27

* What is a *Model*? Discuss :)


Definition: A model is a simplified representation of a system, concept, or
phenomenon used to understand, describe, or predict behaviors and outcomes.
Key Characteristics
• Simplification: Focuses on important aspects while ignoring details.
• Representation: Any — math-equations, diagrams, simulations, physical replicas.
• Purpose: Used for prediction, explanation, and analysis.
• Assumptions: Built on specific assumptions that define its scope and limitations.
• Validation: Tested against real-world data to ensure accuracy and reliability.
Examples of Models
• Economic Models: Forecast supply and demand curves, the stock market
• Biological Models: Represent population dynamics, where birds migrate when.
• Physics Models: Describe laws of motion, the universe.
• Machine Learning Models: ChatGPT
01.11.2024 28

What is a *Linear* Model?


Definition: A model that shows linear relationships, i.e. a describes a linear graph.

linear exponential log-normal sinus non-parametric


01.11.2024 29

* Variables
Definition of a Linear Model
• Dependent Variable (Response Variable): 𝑦
• The outcome you are trying to predict or explain, e.g. ice cream price.
• Independent Variables (Explanatory Variables): 𝑥1, 𝑥2, …
• The values (predictors) that make changes to the dependent v., e.g. ice cream size
Purpose of Variables
• Objective: To explain patterns in the dependent variable (𝑦) using the
independent variables (𝑥1, 𝑥2, …).
Model Representation
• Relationship: 𝑦 = model + error
• Variance Partitioning: The variance / error in 𝑦 can be divided into:
• Explained Variance: Variance attributed to the independent variables.
• Unexplained Variance: Variance due to random error or other factors.
01.11.2024 30

From Correlation to Linear Model Fitting

Linear Regression: A straight line


describes how a response variable y
changes as variable x changes.
y = b 0 + b 1x

We can predict the value of y for a given


value of x, for a given reason.

30
01.11.2024 31

*
The Regression Equation Linear Function of an Independent Variable
• Represents the relationship between the
Linear Function of X Random Error independent 𝑋 and the dependent 𝑌
Random Error
• Accounts for the variability in 𝑌 that cannot
be explained by the linear function.
Intercept Slope Intercept (a):
• Represents the value of 𝑌 when 𝑋 = 0.
Slope (b):
• How much does 𝑌 change for a change in 𝑋.
• Is a ratio of change, or steepness of line.
01.11.2024 32

* *Fitting* a Regression Model


Fitting a model means actually finding and drawing the line that
comes as close as possible to the points representing our data
01.11.2024 33

*Fitting* a Regression Model


With this line, we can predict the values of Y that we haven’t
actually observed ourselves!
01.11.2024 34

* *Fitting* a Regression Model – Extrapolating


Sometimes, people extrapolate. I.e. they predict values far outside the range
of values of the explanatory variable x used to obtain the line. Such predictions
are often not accurate.
01.11.2024 35

*Fitting* a Regression Model – Extrapolating


Sometimes, people extrapolate. I.e. they predict values far outside the range
of values of the explanatory variable x used to obtain the line. Such predictions
are often not accurate.
01.11.2024 36

Today
• Associations & Scatter Plots
• Correlation (Pearson’s r)
• Linear Models & how to fit them
• Residuals
• Assumptions of Linear Models &
Steps of Hypothesis Testing
01.11.2024 37

Least Squares Regression


Regression is fitting a best line, how do we pick the best?
01.11.2024 38

* Least Squares Regression


The least squares line minimizes the
squared deviations from data –the least
squared error approach.
Steps for finding the best line:
1. Create a scatterplot to see if there is
a linear relationship in the data
2. Use the data to estimate b1 (slope)
and b0 (intercept)
y = b0 + b1x (mean)
01.11.2024 39

Residuals
A residual is the difference between the predicted value and the observed
value. Examining the residuals helps assess how well the line describes the
data.
The model can under- or over-estimate.
residual = observed y – predicted y
01.11.2024 40

* Residuals
A residual plot is a scatterplot of the regression residuals against the explanatory
variable, helps us assess the model assumptions. If the regression line fits the
overall pattern of the data, there should be no pattern in the residuals.
Trend in dispersion can indicate
that a data transformation is
necessary.
01.11.2024 41

Today
• Associations & Scatter Plots
• Correlation (Pearson’s r)
• Linear Models & how to fit them
• Residuals
• Assumptions of Linear Models &
Steps of Hypothesis Testing
01.11.2024 42

*
Regression Assumptions
• Additivity and linearity (same as correlation!)
• Independent errors (independent and identically
distributed IID assumption)
• Normally distributed
• Predictors are uncorrelated with external variables
(nothing lurking!)
• No multicollinearity
• Homoscedasticity
• Non-zero variance (something to explain!)
01.11.2024 43

Steps in hypothesis testing


1. Define study question.
2. Set null and alternative hypothesis.
3. Choose good tests, calculate a test statistics.
4. Calculate a p-value.
5. Make a decision and interpret your conclusions.
01.11.2024 44

Thanks! :)

You might also like