Da Public Slides ch09 v3 2023
Da Public Slides ch09 v3 2023
regression result
Data Analysis for Business, Economics, and Policy 2 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Generalizing: reminder
I We have uncovered some pattern in our data. We are interested in generalize the
results.
Data Analysis for Business, Economics, and Policy 4 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 5 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
The simple SE formula of the slope is I A smaller standard error translates into
I narrower confidence interval,
Std [e] I estimate of slope coefficient with more
SE (β̂) = √
nStd [x] precision.
I Where: I More precision if
I Residual: e = y − α̂ − β̂x I smaller the standard deviation of the
I Std[e], the standard deviation of the residual – better fit, smaller errors.
regression residual, I larger the standard deviation of the
I Std[x], the standard deviation of the explanatory variable – more variation in x
explanatory
√ variable, is good.
I n the square root of the number of I more observations are in the data.
observations in the data.
√
I Smaller sample – may use n − 2.
I This formula is correct assuming
homoskedasticity
Data Analysis for Business, Economics, and Policy 6 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Heteroskedasticity Robust SE
I Simple SE formula is not correct in general.
I Homoskedasticity assumption: the fit of the regression line is the same across the
entire range of the x variable
I In general this is not true
I Heteroskedasticity: the fit may differ at different values of x so that the spread of
actual y around the regression is different for different values of x
Data Analysis for Business, Economics, and Policy 7 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 8 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
H0 : βtrue = 0, HA : βtrue 6= 0
Data Analysis for Business, Economics, and Policy 9 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Practical guidance:
I Choose a critical value.
I p-vale, the probability of a false positive in our dataset
I Balancing act: false positive (FP) and negative (FN)
I Higher critical value
I FP: less likely (less likely rejection of the null).
I FN: more likely (high risk of not rejecting a null even though it’s false)
Data Analysis for Business, Economics, and Policy 10 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 11 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 12 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 13 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 14 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
p-Hacking
I Very often many steps lead to a regression analysis
I Many: arbitrary decisions
I Often we work with a bias: looking to reinforce expectations
I Show a "significant" result.
Data Analysis for Business, Economics, and Policy 16 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Prediction uncertainty
I Goal: predicting the value of y for observations outside the dataset, when only the
value of x is known.
I We predict y based on coefficient estimates, which are relevant in the general
pattern/population. With linear regression you have a simple model:
yi = α̂ + β̂xi + i
I The estimated statistic here is a predicted value for a particular observation ŷj . For
an observation j with known value xj this is
ŷj = α̂ + β̂xj
I Two kinds of intervals:
I Confidence interval for the predicted value/regression line - uncertainty about α̂, β̂
I Prediction interval - uncertainty about α̂, β̂ and i
Data Analysis for Business, Economics, and Policy 17 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
I Confidence interval (CI) of the predicted value = the CI of the regression line.
I The predicted value ŷj is based on α̂ and β̂ only.
I The CI of the predicted value combines the CI for α̂ and the CI for β̂.
I What value to expect if we know the value of xj and we have estimates of
coefficients α̂ and β̂ from the data.
I The 95% CI of the predicted value - 95%CI (ŷj ) is
I the value estimated from the sample
I plus and minus its standard error.
Data Analysis for Business, Economics, and Policy 18 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
s
1 (xj − x̄)2
SE (ŷj ) = Std[e] +
n nVar [x]
Data Analysis for Business, Economics, and Policy 19 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 20 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Prediction interval
Data Analysis for Business, Economics, and Policy 21 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 22 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 23 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
External validty
Data Analysis for Business, Economics, and Policy 24 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
External validity
Data Analysis for Business, Economics, and Policy 25 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
I To learn about external validity, we always need additional data, on say, other
countries or time periods.
I We can then repeat regression and see if slope is similar!
Data Analysis for Business, Economics, and Policy 26 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
I Here we ask different questions: whether we can infer something about the
price–distance pattern for situations outside the data:
I Is the slope coefficient close to what we have in Vienna, November, weekday:
I Other dates (focus in class)
I Other cities
I Other type of accommodation: apartments
I Compare them to our benchmark model result
I Learn about uncertainty when using model to some types of external validity.
Data Analysis for Business, Economics, and Policy 27 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 28 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Benchmark model
The benchmark model is a spline with a knot at 2 miles.
Data is restricted to 2017, November weekday in Vienna, 3-4 star hotels, within 8 miles.
Comparing dates
Data Analysis for Business, Economics, and Policy 30 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 31 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 32 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 33 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 34 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 35 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 36 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
I 17% difference on average in per hour earnings between men and women
I For linear regression analysis, we will use ln wage to compare relative difference.
Data Analysis for Business, Economics, and Policy 37 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
I One key reason for gap could be women being sectors / occupations that pay less.
Focus on a single one: Computer science occupations, N = 4, 740
ln(w )E = α + β × Gfemale
I We regressed log earnings per hour on G binary variable that is one if the
individual is female and zero if male.
I The log-level regression estimate is β̂ = −0.1475
I female computer science field employee earns 14.7 percent less, on average, than
male with the same occupation in this dataset.
I Statistical inference based on 2014 data.
I SE: .0177; 95% CI: [-.182 -.112]
I Simple vs robust SE - Here no practical difference.
Data Analysis for Business, Economics, and Policy 38 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 39 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 40 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 41 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 42 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 43 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 44 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Data Analysis for Business, Economics, and Policy 45 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Model:
Variables ln wage
I ln wage = α + βfemale
I Only one industry: market Female -0.11
analysts, N = 281 (0.062)
I Robust standard errors in Constant 3.31**
(0.049)
parentheses *** p<0.01, **
p<0.05, * p<0.1. Observations 281
R-squared 0.012
Data Analysis for Business, Economics, and Policy 46 / 47 Gábor Békés (Central European University)
Generalizing Results Testing, p-values Intervals for Predicted Values External validity CS:B1 CS:A1 CS:A2 CS:A3 CS:A4
Model:
VARIABLES ln wage
I ln wage = α + f (age)
age 0.014**
I Only one industry: (0.003)
market analysts, Constant 2.732**
N = 281 (0.101)
I Robust standard
Observations 281
errors in parentheses R-squared 0.098
*** p<0.01, **
p<0.05, * p<0.1.
Data Analysis for Business, Economics, and Policy 47 / 47 Gábor Békés (Central European University)