DS-2, Week 4 - Lectures
DS-2, Week 4 - Lectures
1 LEARNING OBJECTIVES
1.1 What is expected from you to learn from this lecture?
• Learn about different statistical tests in data science.
• How to calculate these statistical tests in R?
4.2 Z-Test
• It is a parametric test of hypothesis testing.
• It is essentially, testing the significance of the difference of the mean values when the sample size is
large (i.e, greater than 30) and when the population variance is known.
• Assumptions of this test are as follows:
– Population distribution is normal.
– Samples are random and independent.
– The sample size is large.
– Population standard deviation is known.
𝑥̄ − 𝜇
𝑧= √
𝜎/ 𝑛
where 𝑥̄ is sample mean, 𝜇 is population mean, 𝜎 is population standard deviation, and 𝑛 is sample size.
𝑥1̄ − 𝑥2̄
𝑧=
√𝜎1 /𝑛1 + 𝜎22 /𝑛2
2
4.3 F-Test
• It is a parametric test of hypothesis testing based on Snedecor F-distribution.
• It is a test for the null hypothesis that two normal populations have the same variance.
• F-test is regarded as a comparison of equality of sample variances.
• F-statistic is simply a ratio of two variances.
• By changing the variance in the ratio, F-test has become a very flexible test. It can then be used to:
– Test the overall significance for a regression model.
– To compare the fits of different models, and
– To test the equality of means.
• Assumptions of this test are as follows:
– Population distribution is normal
– Samples are random and independent
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2
𝜒2 = ∑
𝐸𝑖𝑗
When consulting the significance tables, the smaller values of 𝑈1 and 𝑈2 are used. The sum of two values is
given by,
𝑛1 (𝑛1 + 1) 𝑛 (𝑛 + 1)
𝑈1 + 𝑈2 = 𝑅1 − + 𝑅2 − 2 2
2 2
𝑁(𝑁+1)
Knowing that 𝑅1 + 𝑅2 = 2 and 𝑁 = 𝑛1 + 𝑛2 , and doing some algebra, we can find that the sum is:
𝑈1 + 𝑈 2 = 𝑛 1 ∗ 𝑛 2
𝑘
12 𝑇2
𝐻=[ ∑ 𝑖 ] − 3(𝑛 + 1)
𝑛(𝑛 + 1) 𝑖=1 𝑛𝑖
where, 𝑛 = sum of sample sizes for all samples, 𝑘 = number of samples, 𝑇𝑖 = sum of ranks in the 𝑖𝑡ℎ sample,
and 𝑛𝑖 = size of the 𝑖𝑡ℎ sample
OVERALL CONCLUSIONS
If the value of the test statistic is greater (>) than the table (critical) statistic value, then Reject the
Null Hypothesis.