0% found this document useful (0 votes)
29 views

UDJ Cheat Sheet - Merged

1. There are several probability distributions that can be used to calculate probabilities depending on whether the variable is discrete or continuous. The key distributions are binomial, Poisson, and normal. 2. Hypothesis testing involves stating the null and alternative hypotheses, calculating a test statistic such as z-score, choosing a significance level, and comparing the test statistic to critical values to determine whether to reject or fail to reject the null hypothesis. 3. Confidence intervals provide a range of values that is likely to contain the true population parameter based on a sample. For means, the margin of error depends on the standard deviation and sample size. For proportions, the margin of error decreases as the sample size increases.

Uploaded by

dew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

UDJ Cheat Sheet - Merged

1. There are several probability distributions that can be used to calculate probabilities depending on whether the variable is discrete or continuous. The key distributions are binomial, Poisson, and normal. 2. Hypothesis testing involves stating the null and alternative hypotheses, calculating a test statistic such as z-score, choosing a significance level, and comparing the test statistic to critical values to determine whether to reject or fail to reject the null hypothesis. 3. Confidence intervals provide a range of values that is likely to contain the true population parameter based on a sample. For means, the margin of error depends on the standard deviation and sample size. For proportions, the margin of error decreases as the sample size increases.

Uploaded by

dew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

What is the probability of…?

1. Identify whether you have a discrete or continuous variable.


2. If discrete, which distribution should you use? Make sure to state this. Building Confidence Intervals
a. Binomial: We are looking at a sample of a population.
• 2 possible outcomes – success/failure Characteristic Population Sample
• p: probability of success – make sure p£0.5 Size N n
• q: probability of failure: q=1-p Mean µ 𝑥̅
• n: the number of trials or the size of the population SD s S
• x: outcome – can be any round number up to n Proportion P 𝑃$ = x/n
• Summary: x ~ Bin(n;p) I. Averages
• Questions will be, given x ~ Bin(6;0.3) Two possibilities: n³30 or n<30
o What is P(x=2)? 1. n³30
Table: n=6, p=0.3, so P(x=2)=0.3241 = 32.41% a. “Per the Central Limit Theorem, if n is large (n³30) then the
o What is P(x>3)? P(x>4) = P(x=5) + P(x=6) variable 𝑥̅ follows a Normal distribution such that: 𝑥̅ ~ Norm(µ;
Table: n=6, p=0.3, so P(x=5) + P(x=6) = 0.0102 + 0 = 1.02%µW; sW sx ̅ = s/Ön)”
b. Poisson: b. s?
• Events are coming at constant rate l during an interval I i. We know s of the population
• l (=µ): average number of events that occur during the initial interval • Questions will be, given a sample n=32 such that 𝑥̅ ~ Norm(10; 2)
• s = Öl and s=2.5, build a 95% confidence interval for µ.
• x: number of actual events – x has no limit o We are looking for: {𝑥̅ – E; 𝑥̅ + E}
• Questions will be, given l1=5 events in 3 hours o 95% confidence interval means a = 0.05 and a/2 = 0.025
o What is P(x=2) in 1 hour? o Table: Za/2 = 1.96
Calculate l2 = 5/3 » 1.7 o E = Za/2 * s/Ön = 1.96 * 2.5/Ö32 = 0.866
Table: l2 = 1.7, x = 0.2640 = 26.40% o {𝑥̅ – E; 𝑥̅ + E} = {10 – 0.866; 10 + 0.866} = {9.134; 10.866}
o What is P(x>2) in 1 hour? = 1 - [P(x=2) + P(x=1)] ii. We do not know s of the population – we are given s
3. If continuous, Normal distribution • Questions will be, given a sample n=32 such that 𝑥̅ ~ Norm(10; 2),
• x ~ Norm(µ;s) build a 95% confidence interval for µ.
• Questions will be, given x ~ Norm(3;0.5) o We are looking for: {𝑥̅ – E; 𝑥̅ + E}
o What is P(x>2)? Note: P(x>2) = P(x³2) and P(x=2) = Æ o 95% confidence interval means a = 0.05 and a/2 = 0.025
Calculate z-score: Z = [x-µ]/s = [2-3]/0.5 = -1/0.5 = -2 o Table: Za/2 = 1.96
Table: Z=-2, F(Z=-2) = 0.9772 = 97.72% o “Because we do not know s, we use s to approximate s”
o What is P(x<2)? 1-P(x>2) = 1-0.9772 = 2.28% o E = Za/2 * s/Ön = 1.96 * 2/Ö32 = 0.693
o {𝑥̅ – E; 𝑥̅ + E} = {10 – 0.693; 1 + 0.693} = {9.307; 10.693}
Combination of two random variables 2. n<30
• You have two random variables X (µX;sX) and Y (µY;sY) a. “n is small, so we must assume that the population is normally
• You are given the correlation variant -1£ rXY £1 distributed in order to apply the Central Limit Theorem.”
Note that X and Y independent means rXY = 0 b. s?
• You are given a combination of the two W such that W = a + bX + cY i. We know s of the population
• W = a + b*E[X] + c*E[Y] = a + b*µX + c*µY sW = b*sX + c*sY + rXY*b*c*sX*sY • Questions will be, given a sample n=10 such that 𝑥̅ ~ Norm(10; 2)
• Questions will be, given X (5;2) and Y (6;1), rXY = 0.5 and W = 2 + X - Y and s=2.5, build a 95% confidence interval for µ.
o What is P(W>2)? • Same as b.i (E = Za/2 * s/Ön).
1. “If X and Y follow a Normal Distribution, you can assume that W also ii. We do not know s of the population – we are given s
follows a Normal Distribution such that W ~ Norm(µW; sW)” • Questions will be, given a sample n=10 such that 𝑥̅ ~ Norm(10; 2)
2. Calculate µW; sW: and s=2.5, build a 95% confidence interval for µ.
µW = 2 + 1*5 + (-1)*6 = 1 o “Because we have a small sample size, and because we do not
sW = (1*2)2 + (-1*1)2 + 0.5*1*(-1)*2*1 = 4 know s, we use the t-distribution and we use s to approximate s.”
3. Calculate Z-score: Z=[x-µW]/sW = [2-1]/4 = 0.25 o We are looking for: {𝑥̅ – E; 𝑥̅ + E} where E = tn-1, a/2 * s/Ön
4. F(Z=0.25) = 40.13% o n-1 = 9, and 95% confidence interval means a = 0.05 and a/2 =
0.025
o Table: t9, 0.025 = 2.262
Hypothesis Testing o E = tn-1, a/2 * s/Ön = 2.262 * 2/Ö10 = 0.693
We are trying to decide whether what we are given (either µ0 or P0) is within a o {𝑥̅ – E; 𝑥̅ + E} = {10 – 0.693; 1 + 0.693} = {9.307; 10.693}
certain confidence interval of the true µ or P. II. Proportions
1. We state our hypothesis: H0: µ = µ0 or P = P0 and HA: µ ¹ µ0 or P ¹ P0 We are looking at a sample proportion.
2. We calculate the test statistic: n will always be larger than 30.
a. If we are looking at averages, s known: Zobs. = [𝑥̅ - µ] / [s/Ön] • Questions will be, sample n=32 and we have noticed that
b. If we are looking at averages, n>30, s unknown: Zobs. = [𝑥̅ - µ] / [s/Ön] 𝑃$ =0.3, build a 95% confidence interval for P.
c. [If we are looking at averages, n<30, s unknown: tobs. = [𝑥̅ - µ] / [s/Ön]] ! o We are looking for {𝑃$ – E; 𝑃$ + E} where
will not be used. E = Za/2 * Ö{[(x/n) * (1 - x/n)] / n}
d. If we are looking at proportions, Zobs. = [(x/n) – P0] / Ö[P0*(1-P0) / n] o 95% confidence interval means a = 0.05 and a/2 = 0.025
3. We choose a level of significance a - usually 5% or 1% o Table: Za/2 = 1.96
a. For a=0.05, Za/2 = 1.96 o E = 1.96 * Ö(0.3*0.7/32) = 0.1588
b. For a=0.01, Za/2 = 2.58 o {𝑃$ – E; 𝑃$ + E} = {0.3 – 0.1588; 0.3 + 0.1588} = {14.12%; 45.88%}
4. We compare our findings and decide. • Questions will be, keeping the 95% CI, what should the sample
a. If Zobs. > Za/2 or if Zobs. < -Za/2, we reject H0. size n be to reduce the margin of error to 10%? E=0.1. Only for
b. If -Za/2 < Zobs. < Za/2: “I do not have sufficient evidence to reject H0 Proportions!
therefore I accept it.” o “Because we need a new n, we cannot use the original 𝑃$ given.
5. Alternatively, you can use P-values. Note: the smaller the p-value, the less Let’s be conservative and use x/n = 0.5 (which maximizes E) and
likely H0. we apply the formula n = (Za/2 / 2E)2”
a. P-value = 2*Prob(Z>Zobs.) = 2*F(Zobs.)
o n = (Za/2 / 2E)2 = (1.96 /0.20)2 = 96.04
b. P-value > a, “I do not have sufficient evidence to reject H0 therefore I
accept it.”
c. P-value < a, reject H0.
6. Certain questions will ask you to find the “level of significance” at which H0
can be accepted. Here you will need to find a < P-value.
Building a regression model
Regression: a linear relationship between one variable you would like to predict (the dependent variable) and one or more
independent variables (X1, X2, X3…)
Y = A + B1X1 + B2X2 + B3X3 …
1. Test that all coefficients (B1, B2, B3…) are different than 0. This is to ensure there is a relationship between Xi and Y.
a. H0: Bi = 0 HA: Bi ¹ 0
b. Once you prove HA, you can say that yes, this independent variable Xi has an impact on Y.
c. t-statistic = bi – Bi / SE(bi)
However, because you are testing for Bi = 0, t-statistic = bi / SE(bi)
Dependent variable Y
Ind. Variables Coeff. SE t-stat p-value 0.05 significance?
Xi Bi SE(Bi) Bi / SE (Bi) p-value = 2*prob(Z > |t- Can you reject H0?
stat|) p-value > a - I cannot reject H0 and
p-value = 2*F (|t-stat|) accept that Bi = 0: Not significant.
p-value < a - I reject H0 and accept
that Bi ¹ 0: Yes, Xi has an impact on Y
“If X1 increases by 1 [unit], all other independent variables remaining constant, Y will increase by on average B1 [units].”
2. Deal with the risk of multicollinearity
a. Multicollinearity means that the two independent variables have the same explanation with regards to the dependant – one
of the two is therefore redundant. Keeping both in this case, rather than improving it, distorts the model.
b. There is a risk of multicollinearity if the correlation coefficient between two independent variables has an absolute value
greater than 0.7: |r|> 0.7
c. For all independent variables that have a high correlation coefficient (above 0.7 in absolute terms), look at whether they are
both significant, one only is significant, or neither is significant.
d. Both are significant: consider whether the model makes sense (eg, it makes sense that for size of a business increasing profit
increases too, and for number of employees increases profit decreases – keeping both makes sense). If it doesn’t make
sense, remove the one with the highest p-value
e. Only one is significant: remove the one with the highest p-value from the model.
f. Neither is significant: remove the one with the highest p-value from the model.
• Questions will be, what can you infer from the correlation matrix of the variables?
3. Choose the best model between those models that only have variables that are significant & compare the adjusted R2. The
best is the one with the highest adjusted R2. Note that the R-Square measures how well the model explains the variations of
the dependent variable. The closer the R-square is to the regression line, the higher it is, and the better the model.
• Questions will be, what is the best model?
4. Compute the forecast.

Looking at graphs
State:
1. What has been plotted
2. What assumption you are testing with this kind of plot.
3. Your conclusion based on this kind of plot.
4. What might be the reasons
Assumption on the errors (residuals)
Plot Type: either:
(i) residuals ei against time or observations number; or
(ii) residuals versus lagged residuals
Testing: whether errors are random (eg, not autocorrelated).
Conclusion: Plot type (i), they fluctuate randomly along the horizontal axis or plot type (ii), you have a cloud of data, that means
they are not autocorrelated.
Reasons: either (a) problem of non-linearity between the dependent variable and each independent variable – we should try
transformation; or (b) we are missing an independent variable in the model.

Plot type: residuals ei against predicted value


Testing: whether errors are homoscedastic (eg, there is a constant variance)
Conclusion: If errors are all in one corridor, homoscedastic. If errors are in an increasing corridor, heteroscedastic.
Reasons: problem of non-linearity between the dependent variable and each independent variable – we should try
transformation and split the data into groups and run separate regression for each group.

Plot type: histogram of the residuals with the superimposed normal curve
Testing: whether errors are normally distributed with a mean of 0
Conclusion: “I assume that the errors are roughly normally distributed.”
Reasons: not enough observations. Try to get more.

You might also like