Statistical Methods
Statistical Methods
1/25/24
Simple Linear Regression Continued
Model
● Outcome represented by Y
● Value of cause denoted by X
1. Y = β0 + β1X + ε
2. ε ~ N (0, σ2) THIS IS OUR MODEL
● Beta1 is positive, positive relationship between X and Y
● Beta1 = 0, no relationship between X and Y
● Beta1 is negative, negative or inverse relationship between X and Y
● Cause not only factor to generate an outcome, external factors as well “ε”
a. Does not contribute anything at all in the long run to the average response value
● Ŷ = β^0 + β^1X
● Data: {number of independent observations}
○ Observation (i) (x , y)
■ i=1, observation 1 (x1 , y1) → Ŷ1 = β^0 + β^1X1
■ i=2, observation 2 (x2 , y2) → Ŷ2 = β^0 + β^1X2
■ i=n, observation n (xn , yn) → Ŷn = β^0 + β^1Xn
○ Compute sample averages (X,Y)
○ “Centroid” center of the data set, equilibrium of the data points balance
○ Prediction Error for each observation (e): “Residual”; e = Y - Ŷ
■ e1 = Y1 - Ŷ1 = Y1 - [ β^0 + β^1X1]
■ e2 = Y2 - Ŷ2 = Y2 - [ β^0 + β^1X2]
■ en = Yn - Ŷn = Yn - [ β^0 + β^1Xn]
○ minimize{β^0 , β^1} {nΣi=1 ei }
■ instead of adding all error terms, adds all square error terms
○ minimize{β^0 , β^1} nΣi=1 ei2 : least squared error method of estimating parameters
■ “LSE” Method
● “Measure of Variations” captured by a statistic called the “Sum of Squares”
○ SSX: nΣi=1 (Xi - X̄)2
○ SSY: nΣi=1 (Yi - Ȳ)2
○ SXY: nΣi=1 (Xi - X̄)(Yi - Ȳ)
○ (SXY/SSX) represents what happens to the value of Y when u increase the value
of X by 1 unit.
■ Amount of change in Y value in data set when increasing X by 1 or more
○ 1. β^1 = (SXY/SSX) = b1
○ 2. β^0 = Ȳ - b1X̄ = b0
● Ŷ = b0 + b1X : best fitted line to the data set line that gets as close as possible to the data
points
● WARNING:
——Xinput———Xmin————Xinput————Xmax——Xinput———>
Linear line between Xmin and Xmax so we can find the middle Xinput with Ŷ = b0 + b1X
○ Cannot find relationship between X and Y if choose Xinput outside of the max and
min points.
○ Stay within the observation, can go a bit outside but be very careful.
○ Observations are supported by reality, outside of them we don’t know what will
happen
Question: Can we use our equation?
Answer: must have evidence of agreement between the model assumptions and data set recorded
information
Model Assumptions (must see evidence of these in the data set)
1. Linearity - must show evidence of linear relationships
2. Normality - normal probability distribution (bell curve)
3. Homoscedasticity - variable has same spread as all other constant variables.
Same Variance for all Y values no matter the Y value
a. Scedasticity, refers to the spread, how much things are scattered around
4. Independence
Performance Measures
● Measure of Variations for Y:
○ Total variations captured in data set:
(Total) SST = nΣi=1 (Yi - Ȳ)2 ; dft = (n-1)
○ Let X be the number of children a person has
X1 = 0, X2 = 1, X3=9. find avg number of children.
X̄ = (X1 + X2 + X3)/3 = (0+1+9)/3 = 10/3 = 3.333 children/person
○ (Regression): SSR = nΣi=1 (Ŷi - Ȳ)2 ; dfR = 1
○ Random (Error): SSE = nΣi=1 (Yi - Ŷi)2 ; dfE = n-2
○ SST = SSR + SSE; dfT = dfR + dfE
● Variance Estimator: known as “Mean Squares” denoted by MS
MS = SS/df
○ MSE = SSE/dfE = SSE/(n-2) = ^σ2
○ ^σ = Standard Deviation Estimator
2/1/24
Ŷ = b0 + b1X Using LSE method for the best fitted line. Can we use it?
Linearity Investigation
1.Global Test, “F-Test”:
a. Ho: no significant relationship between input and output
b. H1: there is a significant relationship between input and output
c. Level of Significance: α
● Test Statistic: FStat = MSR/MSE ~ Fdf1 , df2
a. Fcrit = Fα ; df1 , df2
● Decision Rule: if Fstat > Fcrit , then reject Ho
● Decision
a. Case A: since Fstat < Fcrit , we cannot reject Ho
b. Case B: Since Fstat > Fcrit , we reject Ho
● Conclusion: we are (1-α)% confident that
a. Case A: there is no linear relationship between input and output
(we cannot use the “best fitted line” for this statement)
b. Case B: there is a linear relationship between input and output between X and Y
(we may use the “best fitted line” for this statement → continue investigation)
2.Individual Test for each predictor (x-variable)
● A. “t-test” for the “partial slope” , β1
○ Ho: β1 = 0
○ H1: β1 ≠ 0
○ Level of Significance: α
● Test statistic: tstat = (b1 - β1) / sb1 ~ tdf
○ Sb1 = SY*X / √SSX ; df=n-2
● Decision Rule:
○ If tstat > tcrit, reject Ho
● Decision
○ Case A: since tstat < tcrit , we cannot reject Ho
○ Case B: since t stat > tcrit , we reject Ho
● B. Construct a (1-α)% confidence interval estimator for β1:
*A* b1 - (tcrit)(Sb1)<=β1<=b1 + (tcrit)(Sb1) *B*
○ Tcrit is number of steps in the construct; tcrit = tα/2 ; (n-2)
○ Case A: if A and B have different signs, then β1 = 0. We do not need X
○ Case B: if A and B have the same sign, then β1 ≠ 0 → we need X
Normality
Use the Visual Intersection tool, known as the “Normal Probability Plot for Y”
● On graph, match estimated quantities for Y variable (on y axis) with actual quantities of
the data set Y values
● P(ai <= Yi <= bi) ; N(–Y– , MSE)
● If looks like line, actual values of Y comes from normal distribution “normal probability
plot of Y”
Homoscedasticity
● For each i: ei = Yi - Ŷi
● If the data points are along the spread from Ymin to Ymax are relatively even with parallel
trend lines (looking like this = ), Y is homoscedastic, meaning Y has a constant variance
○ “Residual Plot”
● If the data points are more scattered and dont have even parallel trend lines (lets say lines
that diverge from min to max and look like this < ), Y is heteroscedastic
○ Can also be lines that come together looking like this >
Independence
Time Series
● Observations may have 2 kinds of auto correlations. We’re only focused on positive
○ a: negative, we dont care ab this one
○ b: positive
● Case A: there is evidence for positive autocorrelation → we cannot use this data for
linear regression
● Case B: no there is no evidence for positive autocorrelation → we can use this data for
linear regression
Durbin-Watson Test
● nΣi=2(ei - ei-1)2
● Ŷ = b0 + b1X
i Data e
1 X1 , Y1 Ŷ1 Y1 - Ŷ1 = e1
2 X2 , Y2 Ŷ2 Y2 - Ŷ2 = e2
3 X3 , Y3 Ŷ3 Y3 - Ŷ3 = e3
n Xn , Yn Ŷn Yn - Ŷn = en
● (α, n, k) → (dupper = “du” and dlower = “dl”)
○ n is the number of observations
● Test Statistic: Dstat = (nΣi=2(ei - ei-1)2) / nΣi=2 ei2)
● Decision
○ Case A (proceed): if Dstat > du , then there is no evidence of positive
autocorrelation between observations → observations are independent
○ Case B (stop): if Dstat < dl , then there is strong evidence of positive correlation
between observations
○ Case C: if dl <=Dstat<= du , we do not have sufficient info to make any decisions