0% found this document useful (0 votes)
5 views

Statistical Methods

Uploaded by

adiazfcb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Statistical Methods

Uploaded by

adiazfcb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1/18/24

Simple Linear Regression

^ means estimated value


● Cause and Effect - Soccer Example
● “Simple” - only 1 cause
● “Linear” - linear relationship between the force of the kick and how far the ball travels
● Economic model: represents relationships
Simple Linear Regression Model
● Assumptions
○ 1. Let X be the magnitude or value of the CAUSE (or INPUT) event
Let Y be the value of the RESPONSE (or OUTPUT) event
■ In regression models, X is known as the “PREDICTOR” and Y is known
as the “PREDICTED”
■ Stats textbooks use X as “INDEPENDENT” and Y as “DEPENDENT”
but this is wrong in some cases. These are used in functions or 2 sets of
objects
■ A “function” is a relationship between 2 sets such that it picks objects
from one set X, the Domain of the function, and throws it away to a region
Y called the Range of the function.
■ Each X point can only create 1 image in Y
○ Linear relationship between the response and value of the cause
Y = βo + β1X + ε
■ ε means everything else, independent variable subject to normal
probability distribution
■ Where the independent random variable (r.v.) ε is subject to or governed
by ~ which indicates a Normal Law N
■ ε ~ N (0, σ2)
● In the long run, avg distance ball travels from penalty kick to goal
only considers the force used to kick since other forces like wind
are canceled out → expected value of ε is 0
● σ2 denotes the variance of ε.
■ Here, β0 and β1 are constant real numbers. Constant parameters called
structured parameters. σ2 is a constant real number too
■ Y is distance ball travels to goal, ε represents everything else but the kick
(it is an event random variable controlled by normal distribution)
■ μ represents the mean
■ σ2 is the summation of squares
■ Σ and σ: “summation”
■ nΣi=1 Xi = X1 + X2 + … Xn
■ var(z) = E[(z-μ z)2]
■ nΣi=1 (Xi - μx)
● Data: { (X, Y) }
○ 1. (X1 , Y1): first observation, 2 dimensional (strength of kick, distance ball
traveled)
○ 2. (X2 , Y2) …
○ Nth observation (Xn , Yn)
○ n number of independent observations (random)
● Construct a relationship that measures the average response generated by a known cause
value supported by the evidence
○ E[Y/X] → expected average of the response generated by the cause
■ “Expected average distance traveled given that we are kicking the ball
with the same force”
○ E[Y/X] = E[βo + β1X + ε | X] = E[βo + β1X | X] + E[ε | X]
■ E[ε | X] → E[ε] but expected value of ε is 0 so the term vanishes
○ (expected value of response) Ŷ = expected value of first parameter + expected
value of second parameter * known force
○ Ŷ = β^o + β^1X Linear Regression Equation
■ Task is to find best possible numerical values for β^o and β^1 by using the
data set
■ β^o and β^1 are unknown values and the hat symbol represents predicted or
estimated value of the variable
○ Objective: search and find the best possible numerical values for β^o and β^1

1/25/24
Simple Linear Regression Continued
Model
● Outcome represented by Y
● Value of cause denoted by X
1. Y = β0 + β1X + ε
2. ε ~ N (0, σ2) THIS IS OUR MODEL
● Beta1 is positive, positive relationship between X and Y
● Beta1 = 0, no relationship between X and Y
● Beta1 is negative, negative or inverse relationship between X and Y
● Cause not only factor to generate an outcome, external factors as well “ε”
a. Does not contribute anything at all in the long run to the average response value
● Ŷ = β^0 + β^1X
● Data: {number of independent observations}
○ Observation (i) (x , y)
■ i=1, observation 1 (x1 , y1) → Ŷ1 = β^0 + β^1X1
■ i=2, observation 2 (x2 , y2) → Ŷ2 = β^0 + β^1X2
■ i=n, observation n (xn , yn) → Ŷn = β^0 + β^1Xn
○ Compute sample averages (X,Y)
○ “Centroid” center of the data set, equilibrium of the data points balance
○ Prediction Error for each observation (e): “Residual”; e = Y - Ŷ
■ e1 = Y1 - Ŷ1 = Y1 - [ β^0 + β^1X1]
■ e2 = Y2 - Ŷ2 = Y2 - [ β^0 + β^1X2]
■ en = Yn - Ŷn = Yn - [ β^0 + β^1Xn]
○ minimize{β^0 , β^1} {nΣi=1 ei }
■ instead of adding all error terms, adds all square error terms
○ minimize{β^0 , β^1} nΣi=1 ei2 : least squared error method of estimating parameters
■ “LSE” Method
● “Measure of Variations” captured by a statistic called the “Sum of Squares”
○ SSX: nΣi=1 (Xi - X̄)2
○ SSY: nΣi=1 (Yi - Ȳ)2
○ SXY: nΣi=1 (Xi - X̄)(Yi - Ȳ)
○ (SXY/SSX) represents what happens to the value of Y when u increase the value
of X by 1 unit.
■ Amount of change in Y value in data set when increasing X by 1 or more
○ 1. β^1 = (SXY/SSX) = b1
○ 2. β^0 = Ȳ - b1X̄ = b0
● Ŷ = b0 + b1X : best fitted line to the data set line that gets as close as possible to the data
points
● WARNING:
——Xinput———Xmin————Xinput————Xmax——Xinput———>
Linear line between Xmin and Xmax so we can find the middle Xinput with Ŷ = b0 + b1X
○ Cannot find relationship between X and Y if choose Xinput outside of the max and
min points.
○ Stay within the observation, can go a bit outside but be very careful.
○ Observations are supported by reality, outside of them we don’t know what will
happen
Question: Can we use our equation?
Answer: must have evidence of agreement between the model assumptions and data set recorded
information
Model Assumptions (must see evidence of these in the data set)
1. Linearity - must show evidence of linear relationships
2. Normality - normal probability distribution (bell curve)
3. Homoscedasticity - variable has same spread as all other constant variables.
Same Variance for all Y values no matter the Y value
a. Scedasticity, refers to the spread, how much things are scattered around
4. Independence
Performance Measures
● Measure of Variations for Y:
○ Total variations captured in data set:
(Total) SST = nΣi=1 (Yi - Ȳ)2 ; dft = (n-1)
○ Let X be the number of children a person has
X1 = 0, X2 = 1, X3=9. find avg number of children.
X̄ = (X1 + X2 + X3)/3 = (0+1+9)/3 = 10/3 = 3.333 children/person
○ (Regression): SSR = nΣi=1 (Ŷi - Ȳ)2 ; dfR = 1
○ Random (Error): SSE = nΣi=1 (Yi - Ŷi)2 ; dfE = n-2
○ SST = SSR + SSE; dfT = dfR + dfE
● Variance Estimator: known as “Mean Squares” denoted by MS
MS = SS/df
○ MSE = SSE/dfE = SSE/(n-2) = ^σ2
○ ^σ = Standard Deviation Estimator

1. Standard Error of Estimation


a. ∫Y | X = √SSE/(n-2)
i. Smaller the distance the better
2. Coefficient of Determination aka R-Square, r2
a. Tells us how reliable the ratio is
b. r2 = SSR/SST = (r)2 = (Linear Correlation Coefficient of X and Y values,
Recorded in Data)2
i. 0%? Knowledge of cause is completely useless. Can’t explain anything
about response behavior, and response behavior is completely random
ii. 100%? Perfect predictor. Knowing the cause completely describes the
response. Will almost never get 100%
iii. Smaller the number = cause less important, randomness dominates
Higher the number = more important to know the cause
iv. Tells us how reliable the cause event is
Question: Can we use the “Best Fitted Line”?
Answer: A) yes if there is an agreement between the data set and the model
B) no, if there is no agreement between the data set and the model
Investigate
1. Linearity
a. Global Test for Linearity (aka “F-Test”)
Step 1:
i. (null hypothesis)H0: there is no significant linear relation between X and Y
(alternative hypothesis)H1: there is a significant linear relation between X
and Y
ii. Evidence will make one of these sentences disappear
b. Step 2: Level of Significance, α = P(rejecting H0 given that the evidence says
reject)
i. Probability of making a mistake
c. Step 3: Test Statistic, Fstat = MSR/MSE ~ Fdf1df2 → df1=dfR=1; df2=dfE=n-2
i. Fcritical = Fα ; df1 , df2 ; tolerable distance
ii. Fstat is the Measure from H0 to reality
d. Step 4: Decision Rule; Declare Rule for how to reject H0
i. If Fstat > Fcritical then reject Ho .
e. Step 5: Decision; Make a decision based on the evidence
i. Case A: since Fstat <= Fcritical we do not reject H0
ii. Case B: since Fstat > Fcritical , reject H0
f. Conclusion
Conclusion: we are x% confident that
● Case A —> we cannot use the best fitted line
● Case B —> we can use the best fitted line

2/1/24
Ŷ = b0 + b1X Using LSE method for the best fitted line. Can we use it?
Linearity Investigation
1.Global Test, “F-Test”:
a. Ho: no significant relationship between input and output
b. H1: there is a significant relationship between input and output
c. Level of Significance: α
● Test Statistic: FStat = MSR/MSE ~ Fdf1 , df2
a. Fcrit = Fα ; df1 , df2
● Decision Rule: if Fstat > Fcrit , then reject Ho
● Decision
a. Case A: since Fstat < Fcrit , we cannot reject Ho
b. Case B: Since Fstat > Fcrit , we reject Ho
● Conclusion: we are (1-α)% confident that
a. Case A: there is no linear relationship between input and output
(we cannot use the “best fitted line” for this statement)
b. Case B: there is a linear relationship between input and output between X and Y
(we may use the “best fitted line” for this statement → continue investigation)
2.Individual Test for each predictor (x-variable)
● A. “t-test” for the “partial slope” , β1
○ Ho: β1 = 0
○ H1: β1 ≠ 0
○ Level of Significance: α
● Test statistic: tstat = (b1 - β1) / sb1 ~ tdf
○ Sb1 = SY*X / √SSX ; df=n-2
● Decision Rule:
○ If tstat > tcrit, reject Ho
● Decision
○ Case A: since tstat < tcrit , we cannot reject Ho
○ Case B: since t stat > tcrit , we reject Ho
● B. Construct a (1-α)% confidence interval estimator for β1:
*A* b1 - (tcrit)(Sb1)<=β1<=b1 + (tcrit)(Sb1) *B*
○ Tcrit is number of steps in the construct; tcrit = tα/2 ; (n-2)
○ Case A: if A and B have different signs, then β1 = 0. We do not need X
○ Case B: if A and B have the same sign, then β1 ≠ 0 → we need X
Normality
Use the Visual Intersection tool, known as the “Normal Probability Plot for Y”
● On graph, match estimated quantities for Y variable (on y axis) with actual quantities of
the data set Y values
● P(ai <= Yi <= bi) ; N(–Y– , MSE)
● If looks like line, actual values of Y comes from normal distribution “normal probability
plot of Y”
Homoscedasticity
● For each i: ei = Yi - Ŷi
● If the data points are along the spread from Ymin to Ymax are relatively even with parallel
trend lines (looking like this = ), Y is homoscedastic, meaning Y has a constant variance
○ “Residual Plot”
● If the data points are more scattered and dont have even parallel trend lines (lets say lines
that diverge from min to max and look like this < ), Y is heteroscedastic
○ Can also be lines that come together looking like this >
Independence
Time Series
● Observations may have 2 kinds of auto correlations. We’re only focused on positive
○ a: negative, we dont care ab this one
○ b: positive
● Case A: there is evidence for positive autocorrelation → we cannot use this data for
linear regression
● Case B: no there is no evidence for positive autocorrelation → we can use this data for
linear regression
Durbin-Watson Test
● nΣi=2(ei - ei-1)2
● Ŷ = b0 + b1X
i Data e

1 X1 , Y1 Ŷ1 Y1 - Ŷ1 = e1

2 X2 , Y2 Ŷ2 Y2 - Ŷ2 = e2

3 X3 , Y3 Ŷ3 Y3 - Ŷ3 = e3

n Xn , Yn Ŷn Yn - Ŷn = en
● (α, n, k) → (dupper = “du” and dlower = “dl”)
○ n is the number of observations
● Test Statistic: Dstat = (nΣi=2(ei - ei-1)2) / nΣi=2 ei2)
● Decision
○ Case A (proceed): if Dstat > du , then there is no evidence of positive
autocorrelation between observations → observations are independent
○ Case B (stop): if Dstat < dl , then there is strong evidence of positive correlation
between observations
○ Case C: if dl <=Dstat<= du , we do not have sufficient info to make any decisions

You might also like