0% found this document useful (0 votes)
54 views

DS-2, Week 4 - Lectures

This document discusses common statistical tests used in data science, including parametric tests like the z-test, F-test, and ANOVA, as well as non-parametric tests like the chi-square test, Mann-Whitney U test, and Kruskal-Wallis H test. It explains what each test is used for, its assumptions and calculations. The key statistical tests covered are z-test for comparing means, F-test for comparing variances, ANOVA for comparing multiple group means, chi-square test for assessing goodness of fit, Mann-Whitney U test and Kruskal-Wallis H test as non-parametric alternatives to t-test and ANOVA.

Uploaded by

Prerana Varshney
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

DS-2, Week 4 - Lectures

This document discusses common statistical tests used in data science, including parametric tests like the z-test, F-test, and ANOVA, as well as non-parametric tests like the chi-square test, Mann-Whitney U test, and Kruskal-Wallis H test. It explains what each test is used for, its assumptions and calculations. The key statistical tests covered are z-test for comparing means, F-test for comparing variances, ANOVA for comparing multiple group means, chi-square test for assessing goodness of fit, Mann-Whitney U test and Kruskal-Wallis H test as non-parametric alternatives to t-test and ANOVA.

Uploaded by

Prerana Varshney
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

MDS 202: Data Science II with R

Lectures 10-12: What are Statistical Tests?

Dr. Shatrughan Singh∗

Week 4 (27 Feb. - 3 March) 2023

1 LEARNING OBJECTIVES
1.1 What is expected from you to learn from this lecture?
• Learn about different statistical tests in data science.
• How to calculate these statistical tests in R?

4 Common Statistical Tests


Continues from the previous lectures ….

4.2 Z-Test
• It is a parametric test of hypothesis testing.
• It is essentially, testing the significance of the difference of the mean values when the sample size is
large (i.e, greater than 30) and when the population variance is known.
• Assumptions of this test are as follows:
– Population distribution is normal.
– Samples are random and independent.
– The sample size is large.
– Population standard deviation is known.

4.2.1 One-Sample Z-Test


To compare a sample mean with that of the population mean.

𝑥̄ − 𝜇
𝑧= √
𝜎/ 𝑛
where 𝑥̄ is sample mean, 𝜇 is population mean, 𝜎 is population standard deviation, and 𝑛 is sample size.

4.2.2 Two-Sample Z-Test


To compare the means of two different samples.

𝑥1̄ − 𝑥2̄
𝑧=
√𝜎1 /𝑛1 + 𝜎22 /𝑛2
2

∗ Amity University Rajasthan (Jaipur), [email protected]

Dr. S. Singh MDS 202: Lec —> 10-12 1


where 𝑥1̄ and 𝑥2̄ are sample means, 𝜎1 and 𝜎2 are sample standard deviations, and 𝑛1 and 𝑛2 are sample
sizes for groups, 1 and 2, respectively.

4.3 F-Test
• It is a parametric test of hypothesis testing based on Snedecor F-distribution.
• It is a test for the null hypothesis that two normal populations have the same variance.
• F-test is regarded as a comparison of equality of sample variances.
• F-statistic is simply a ratio of two variances.
• By changing the variance in the ratio, F-test has become a very flexible test. It can then be used to:
– Test the overall significance for a regression model.
– To compare the fits of different models, and
– To test the equality of means.
• Assumptions of this test are as follows:
– Population distribution is normal
– Samples are random and independent

4.3.1 F-Test is calculated as:


𝑠21
𝐹 =
𝑠22
where,
𝑛
∑ (𝑥𝑖 − 𝑋)̄ 2
𝑖=1
𝑠2 =
𝑛−1
̄
𝑥𝑖 = value of observations from 𝑖 = 1 to 𝑛, 𝑋 is mean of the samples, and 𝑛 is number of observations.

4.4 ANOVA: ANalysis Of VAriance


• It is a parametric test of hypothesis testing.
• It is an extension of the T-Test and Z-test.
• It is used to test the significance of the differences in the mean values among more than two sample
groups.
• It uses F-test to statistically test the equality of means and the relative variance between them.
• Assumptions of this test:
– Population distribution is normal
– Samples are random and independent
– Homogeneity of sample variance
• One-Way ANOVA and Two-Way ANOVA are different types of the ANOVA test.
• F-test to determine whether the variability between group means is larger than the variability of the
observations within} the groups.

𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒


𝐹 =
𝑤𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒

4.5 Chi-Square Test


• It is a non-parametric test of hypothesis testing.
• As a non-parametric test, chi-square can be used:
– Test of ‘goodness of fit’ test which determines whether a particular distribution fits the observed
data or not.
– As a test of independence of two variables.
• It helps in assessing the goodness of fit between a set of observed and theoretically expected values.
Hence, making a comparison between the expected frequencies and the observed frequencies.

Dr. S. Singh MDS 202: Lec —> 10-12 2


In a nutshell, greater the difference, the greater is the value of chi-square.
• If there is no difference between the expected and observed frequencies, then the value of chi-square is
equal to zero.
• This test is calculated as:
(𝑂 − 𝐸)2
𝜒2 = ∑
𝐸
where 𝑂 = Observed frequencies, and 𝐸 = Expected frequencies
• Conditions for chi-square test:
– Randomly collect and record the observations.
– All the entities in a sample must be independent.
– No one of the groups should contain very few items (say, less than 10).
– The reasonably large overall entities. Normally, it should be at least 50, however small the number
of groups may be.
• Chi-square as a parametric test is used as a test for population variance based on sample variance.
• If we take each one of a collection of sample variances, divide them by the known population variance
and multiply these quotients by (𝑛-1), where 𝑛 means the number of items in the sample, we get the
values of chi-square.
• It is calculated as:
𝑆2
𝜒2 = (𝑛 − 1)
𝜎2

where, S = Sample variance, 𝜎 = Hypothesized population variance, and 𝑛 = sample size;


Or,

(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2
𝜒2 = ∑
𝐸𝑖𝑗

where, 𝑂 = Observed frequencies, and 𝐸 = Expected frequencies.

4.6 Mann-Whitney ‘U’ Test


• It is a non-parametric test of hypothesis testing.
• This test is used to investigate whether two independent samples were selected from a population
having the same distribution.
• It is a true non-parametric counterpart of the T-test and gives the most accurate estimates of signifi-
cance, especially when sample sizes are small and the population is NOT normally distributed.
• It is based on the comparison of every observation in the first sample with every observation in the
other sample.
• Test statistic is: U
• Maximum value of U is 𝑛1 ∗ 𝑛2 and the minimum value is zero.
• This test is also knows as:
– Mann-Whitney Wilcoxon Rank Sum Test1
• Mathematically U is calculated as:
𝑛 (𝑛 + 1)
𝑈1 = 𝑅 1 − 1 1
2
1 However, Wilcoxon Signed Rank Test is different than Wilcoxon Rank Sum Test as the Rank Sum Test involves indepen-

dent samples compared to dependent samples in Signed Rank Test.

Dr. S. Singh MDS 202: Lec —> 10-12 3


𝑛2 (𝑛2 + 1)
𝑈2 = 𝑅2 −
2
where, 𝑅 = Sum of ranks, 𝑛1 and 𝑛2 are sample sizes for samples, 1 and 2, respectively.

When consulting the significance tables, the smaller values of 𝑈1 and 𝑈2 are used. The sum of two values is
given by,

𝑛1 (𝑛1 + 1) 𝑛 (𝑛 + 1)
𝑈1 + 𝑈2 = 𝑅1 − + 𝑅2 − 2 2
2 2
𝑁(𝑁+1)
Knowing that 𝑅1 + 𝑅2 = 2 and 𝑁 = 𝑛1 + 𝑛2 , and doing some algebra, we can find that the sum is:

𝑈1 + 𝑈 2 = 𝑛 1 ∗ 𝑛 2

4.7 Kruskal Wallis ‘H’ Test


• It is a non-parametric test of hypothesis testing.
• This test is used for comparing two or more independent samples of equal or different sample sizes.
• It extends the Mann-Whitney ‘U’ Test which is used to comparing only two groups.
• This is a non-parametric equivalent of One-Way ANOVA (parametric) test. Actually, it is a One-
Way ANOVA on RANKS.
• Assumption of normality is not required for this test.
• Test statistic is: H

𝑘
12 𝑇2
𝐻=[ ∑ 𝑖 ] − 3(𝑛 + 1)
𝑛(𝑛 + 1) 𝑖=1 𝑛𝑖

where, 𝑛 = sum of sample sizes for all samples, 𝑘 = number of samples, 𝑇𝑖 = sum of ranks in the 𝑖𝑡ℎ sample,
and 𝑛𝑖 = size of the 𝑖𝑡ℎ sample

OVERALL CONCLUSIONS
If the value of the test statistic is greater (>) than the table (critical) statistic value, then Reject the
Null Hypothesis.

End of the Lecture !!

Dr. S. Singh MDS 202: Lec —> 10-12 4

You might also like