Chapter3 Notes
Chapter3 Notes
Ex: The weights of a group of students (in lbs) are given below:
135, 105, 118, 163, 172, 183, 122, 150, 121, 162
Mean= (135+105+118+163+172+183+122+150+121+162)/10
Sample Median:
The number such that half of the observations are smaller and half are larger, i.e., the midpoint of
a distribution.
The following data represent the number of weeks it took seven individuals to obtain their
driver's licenses. Find the sample median.
2, 110, 5 , 7, 6, 7, 3
Solution
First arrange the data in increasing order.
2,3,5,6, 7, 7, 110
Since the sample size is 7, it follows that the sample median is the fourth-smallest value. That is,
the sample median number of weeks it took to obtain a driver's license is m=6 weeks.
The following data represent the number of days it took 6 individuals to quit smoking after
completing a course designed for this purpose.
1, 2, 3, 5, 8, 100
Sample Variance:
The sample variance 𝑆 2 is used to describe the variation around the mean. We use
Sampling Distribution:
In general, the sampling distribution of a given statistic is the distribution of the values taken by
the statistic in all possible samples of the same size form the same population.
In other words, if we repeatedly collect samples of the same sample size from the population,
compute the statistics (mean, standard deviation, proportion), and then draw a histogram of those
statistics, the distribution of that histogram tends to have is called the sample distribution of that
statistics (mean, standard deviation, proportion).
Let X¯ be the mean of a random sample of size 50 drawn from a population with mean 112
and standard deviation 40. Find the mean and standard deviation of X¯?
As an example, suppose that the daily maximum temperature in the month of January, follows a
normal distribution, with a mean of 22 degrees Celsius and a standard deviation of 1.5 degrees.
Then, in line with the comments for situation 1, for samples of size n = 5, the sampling
distribution of ¯X will be normal, with mean 22 and standard error 1.5/√5 ≈ 0.671. Standard
error formula, s/√n
In this example, the sampling distribution of ¯X is clearly a taller, thinner normal distribution
than the one tied to the observations.
You can now ask various probability questions; note that distinguishing between the
measurement distribution and the sampling distribution is important. For example, the following
code provides Pr(X < 21.5), the probability that a randomly chosen day in January has a
maximum temperature of less than 21.5 degrees:
R> pnorm(21.5,mean=22,sd=1.5)
[1] 0.3694413
The next bit of code provides the probability that the sample mean will be less than 21.5 degrees,
Pr(¯X < 21.5), based on a sample of five random days in January:
R> pnorm(21.5,mean=22,sd=1.5/sqrt(5))
[1] 0.2280283
The standard error is the standard deviation of a sample population. It measures the
accuracy with which a sample represents a population.
The standard error of a statistic is the standard deviation of its sampling distribution or an
estimate of that standard deviation. If the statistic is the sample mean, it is called the standard
error of the mean.
HYPOTHESIS TESTING:
A hypothesis is an assumption made by the researchers that are not mandatory true. In simple
words, a hypothesis is a decision taken based on the data of the population collected for any
experiment. It is not mandatory for this assumption to be true every time. Hypothesis testing, in a
way, is a formal process of validating the hypothesis made by the researcher.
To perform hypothesis testing, a random sample of data from the population is taken and testing
is performed. Based on the results of the testing, the hypothesis is either selected or rejected.
This concept is known as Statistical Inference.
P-Value or probability value is a number that denotes the likelihood of your data having
occurred under the null hypothesis of your statistical test.
To give you an example of hypothesis testing, suppose I told you that 7 percent of a certain
population was allergic to dust. You then randomly selected 20 individuals from that population
and found that 18 of them were allergic to dust. Assuming your sample was balanced and truly
reflective of the population, what would you then think about my claim that the true proportion
of allergic individuals is 7 percent?
Naturally, you would doubt the correctness of my claim. In other words, there is such a small
probability of observing 18 or more successes out of 20 trials for a set success rate of 0.07 that
you can state that you have statistical evidence against the claim that the true rate is 0.07. Indeed,
when defining X as the number of allergic individuals out of 20 by assuming
X ∼ BIN(20,0.07), evaluating Pr(X ≥ 18) gives you the precise p-value, which is tiny.
R> dbinom(18,size=20,prob=0.07) + dbinom(19,size=20,prob=0.07) +
dbinom(20,size=20,prob=0.07)
[1] 2.69727e-19
This p-value represents the probability of observing the results in your sample, X = 18, or a more
extreme outcome (X = 19 or X = 20), if the chance of success was truly 7 percent.
Components of a Hypothesis Test:
Hypotheses:
As the name would suggest, in hypothesis testing, formally stating a claim and the subsequent
hypothesis test is done with a null and an alternative hypothesis.
The null hypothesis is interpreted as the baseline or nochange hypothesis and is the claim that is
assumed to be true.
The alternative hypothesis is the conjecture that you’re testing for, against the null hypothesis.
i.e An alternative hypothesis would be considered valid if the null hypothesis is fallacious.
In general, null and alternative hypotheses are denoted 𝐻𝑜 and 𝐻𝐴 , respectively, and they are
written as follows:
𝑯𝒐 : . . .
𝑯𝑨 : . . .
The null hypothesis is often (but not always) defined as an equality, =, to a null value.
Conversely, the alternative hypothesis (the situation you’re testing for) is often defined in terms
of an inequality to the null value.
When 𝑯𝑨 is defined in terms of a less-than statement, with <, it is onesided; this is also
called a lower-tailed test.
When 𝑯𝑨 is defined in terms of a greater-than statement, with >, it is one-sided; this is
also called an upper-tailed test.
When 𝑯𝑨 is merely defined in terms of a different-to statement, with ,, it is two-sided;
this is also called a two-tailed test.
Test Statistic:
Once the hypotheses are formed, sample data are collected, and statistics are calculated
according to the parameters detailed in the hypotheses. The test statistic is the statistic that is
compared to the appropriate standardized sampling distribution to yield the p-value.
Specifically, the test statistic is determined by both the difference between the original sample
statistic and the null value and the standard error of the sample statistic.
p-value:
The p-value is the probability value that is used to quantify the amount of evidence, if any,
against the null hypothesis. More formally, the p-value is found to be the probability of
observing the test statistic, or something more extreme, assuming the null hypothesis is true.
Put simply, the more extreme the test statistic, the smaller the p-value. The smaller the p-value,
the greater the amount of statistical evidence against the assumed truth of 𝑯𝑶 .
Significance Level:
For every hypothesis test, a significance level, denoted α, is assumed. This is used to qualify the
result of the test. The significance level defines a cutoff point, at which you decide whether there
is sufficient evidence to view 𝑯𝑶 as incorrect and favour 𝑯𝑨 instead.
If the p-value is greater than or equal to α, then you conclude there is insufficient
evidence against the null hypothesis, and therefore you retain 𝑯𝑶 when compared to 𝑯𝑨 .
If the p-value is less than α, then the result of the test is statistically significant. This
implies there is sufficient evidence against the null hypothesis, and therefore you reject
𝑯𝑶 in favor of 𝑯𝑨 .
Common or conventional values of α are α = 0.1, α = 0.05, and α = 0.01.
Four Step Process of Hypothesis Testing:
There are 4 major steps in hypothesis testing:
1. State the hypothesis- This step is started by stating null and alternative hypothesis which
is presumed as true.
2. Formulate an analysis plan and set the criteria for decision- In this step, a
significance level of test is set. The significance level is the probability of a false
rejection in a hypothesis test.
3. Analyse sample data- In this, a test statistic is used to formulate the statistical
comparison between the sample mean and the mean of the population or standard
deviation of the sample and standard deviation of the population.
4. Interpret decision- The value of the test statistic is used to make the decision based on
the significance level. For example, if the significance level is set to 0.1 probability, then
the sample mean less than 10% will be rejected. Otherwise, the hypothesis is retained to
be true.
Testing Means:
The validity of hypothesis tests involving sample means. Here we use the t-distribution instead of
the normal distribution.
Suppose a businessman with two sweet shops in a town wants to check if the average number of
sweets sold in a day in both stores is the same or not.
So, the businessman takes the average number of sweets sold to 15 random people in the
respective shops. He found out that the first shop sold 30 sweets on average whereas the second
shop sold 40. So, from the owner’s point of view, the second shop was doing better business than
the former. But the thing to notice is that the data set is based on a mere number of random
people and they cannot represent all the customers. This is where T-testing comes into play it
helps us to understand whether the difference between the two means is real or simply by
chance.
In R programming, you can perform hypothesis testing using various built-in functions.
Here’s an overview of some commonly used hypothesis testing methods in R:
1. T-test (one-sample, paired, and independent two-sample)
2. Chi-square test
3. ANOVA (Analysis of Variance)
4. Wilcoxon signed-rank test
5. Mann-Whitney U test
Single Mean:
One-Sample t-Test:
The One-Sample T-Test is used to test the statistical difference between a sample mean and a
known or assumed/ hypothesized value of the mean in the population. OR
The one-sample t-test is used to compare the mean of a sample to a known value (usually a
population mean) to see if there is a significant difference.
Example:
Manufacturer of a snack was interested in the mean, net weight of contents in an advertised 80-
gram pack. Say that a consumer calls in with a complaint—over time they have bought and
precisely weighed the contents of 44 randomly selected 80-gram packs from different stores and
recorded the weights as follows:
R> snacks <- c(87.7,80.01,77.28,78.76,81.52,74.2,80.71,79.5,77.87,81.94,80.7,
82.32,75.78,80.19,83.91,79.4,77.52,77.62,81.4,74.89,82.95,
73.59,77.92,77.18,79.83,81.23,79.28,78.44,79.01,80.47,76.23,
78.89,77.14,69.94,78.54,79.7,82.45,77.29,75.52,77.21,75.99,
81.94,80.41,77.7)
The customer claims that they’ve been shortchanged (giving them less than they deserve)
because their data cannot have arisen from a distribution with mean μ = 80, so the true mean
weight must be less than 80.
To investigate this claim, the manufacturer conducts a hypothesis test using a significance level
of α = 0.05.
First, the hypotheses must be defined, with a null value of 80 grams.
Remember, the alternative hypothesis is “what you’re testing for”; in this case, 𝑯𝑨 is that μ is
smaller than 80.
The null hypothesis, interpreted as “no change,” will be defined as μ = 80: that the true mean is
in fact 80 grams.
These hypotheses are formalized like this:
𝑯𝑶 : μ = 80
𝑯𝑨 : μ < 80
Second, the mean and standard deviation must be estimated from the sample.
R> n <- length(snacks)
R> snack.mean <- mean(snacks)
R> snack.mean
[1] 78.91068
R> snack.sd <- sd(snacks)
R> snack.sd
[1] 3.056023
The question your hypotheses seek to answer is this: given the estimated standard deviation,
what’s the probability of observing a sample mean (when n = 44) of 78.91 grams or less if the
true mean is 80 grams? To answer this, you need to calculate the relevant test statistic.
Formally, the test statistic T in a hypothesis test for a single mean with respect to a null value of
μ𝑶 is given as
based on a sample of size n, a sample mean of ¯x, and a sample standard deviation of s (the
denominator is the estimated standard error of the mean).
Assuming the relevant conditions have been met, T follows a t-distribution with ν = n − 1
degrees of freedom.
In R, the following provides you with the standard error of the sample mean for the snacks data:
R> snack.se <- snack.sd/sqrt(n)
R> snack.se
[1] 0.4607128
Then, T can be calculated as follows:
R> snack.T <- (snack.mean-80)/snack.se
R> snack.T
[1] -2.364419
Finally, the test statistic is used to obtain the p-value.
p-value is the probability that you observe T or something more extreme.
pt(snack.T, df= n-1)
[1] 0.01132175
Here p-value (0.01132175) is less than α (0.5). This implies there is sufficient evidence against
the null hypothesis, and therefore you reject 𝑯𝑶 in favor of 𝑯𝑨 .
R Function: t.test( )
The result of the one-sample t-test can also be found with the built-in t.test( ) function.
Syntax: t.test(x, mu, alternative="less")
Parameters:
data: snacks
t = -2.3644, df = 43, p-value = 0.01132
alternative hypothesis: true mean is less than 80
95 percent confidence interval:
-Inf 79.68517
sample estimates:
mean of x
78.91068
Example:
# Data
data <- c(12, 10, 15, 14, 18, 20, 11, 9, 17, 13)
# Hypothesis test
t.test(data, mu = 15)
A confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interv
al is computed at a designated confidence level; the 95% confidence level is most common, but o
ther levels, such as 90% or 99%, are sometimes used.
A confidence interval is the mean of your estimate plus and minus the variation in that estimate.
This is the range of values you expect your estimate to fall between if you redo your test, within
a certain level of confidence.
Confidence, in statistics, is another way to describe probability. For example, if you construct a c
onfidence interval with a 95% confidence level, you are confident that 95 out of 100 times the est
imate will fall between the upper and lower values specified by the confidence interval.
Two Means
Two-sample t-test:
Unpaired/Independent Samples
The two-sample t-test is used to compare the means of two independent samples to see if there is
a significant difference.
In two sample T-Testing, the sample vectors are compared. If var. equal = TRUE, the test
assumes that the variances of both the samples are equal.
Syntax: t.test(x, y)
Parameters: x and y: Numeric vectors
Example: After collecting a sample of 44 packs from the original manufacturer (label this sample
size n1), the unhappy consumer goes out and collects n2 = 31 randomly selected 80-gram packs
from a rival snack manufacturer. This second set of measurements is stored as snacks2.
R> snacks2 <- c(80.22,79.73,81.1,78.76,82.03,81.66,80.97,81.32,80.12,78.98,
79.21,81.48,79.86,81.06,77.96,80.73,80.34,80.01,81.82,79.3,
79.08,79.47,78.98,80.87,82.24,77.22,80.03,79.2,80.95,79.17,81)
We have already know the mean and standard deviation of the first sample of size n1 = 44 —
these are stored as snack.mean (around 78.91) and snack.sd (around 3.06), respectively—think of
these as ¯x1 and s1. Compute the same quantities, ¯x2 and s2, respectively, for the new data.
R> snack2.mean <- mean(snacks2)
R> snack2.mean
[1] 80.1571
R> snack2.sd <- sd(snacks2)
R> snack2.sd
[1] 1.213695
Let the true mean of the original sample be denoted with μ1 and the true mean of the new sample
from the rival company packs be denoted with μ2.
We are interested in testing whether there is statistical evidence to support the claim that μ2 is
greater than μ1.
This suggests the hypotheses of 𝑯𝑶 : μ1 = μ2 and 𝑯𝑨 : μ1 < μ2, which can be written as follows:
𝑯𝑶 : μ2 − μ1 = 0
𝑯𝑨 : μ2 − μ1 > 0
R> t.test(x=snacks2,y=snacks ,alternative="greater", conf.level=0.9)
Welch Two Sample t-test
data: snacks2 and snacks
t = 2.4455, df = 60.091, p-value = 0.008706
alternative hypothesis: true difference in means is greater than 0
The sample proportion is a random variable: it varies from sample to sample in a way that cannot
be predicted with certainty. Viewed as a random variable it will be written Pˆ.
It has a mean μPˆ and a standard deviation σPˆ.
Here are formulas for their values.
The general rules regarding the setup and interpretation of hypothesis tests for sample
proportions remain the same as for sample means. In this introduction to Z-tests, you can
consider these as tests regarding the true value of a single proportion or the difference between
two proportions.
We may also consider sample proportions, interpreted as the mean of a series of n binary trials,
in which the results are success (1) or failure (0).
Single Proportion (One-Sample Z-Test):
The One proportion Z-test is used to compare an observed proportion to a theoretical one when
there are only two categories.
For example, we have a population of people containing half male and half females (p = 0.5 =
50%). Some of these people (n = 160) have heart diseases, including 95 males and 65 females.
We want to know, whether heart diseases affects more males than females? So in this problem:
The number of successes (male with heart diseases) is 95
The observed proportion (po) of the male is 95/160
The observed proportion (q) of the female is 1 – po
The expected proportion (pe) of the male is 0.5 (50%)
The number of observations (n) is 160
The Formula for One-Proportion Z-Test
The test statistic (also known as z-test) can be calculated as follow:
where,
po: the observed proportion
q: 1 – po
pe: the expected proportion
n: the sample size
R Function: prop.test( )
prop.test( ) can be used for testing the null that the proportions (probabilities of success) in
several groups are the same, or that they equal certain given values.
Sytanx:
prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
p
0.5333333
The p-value of the test is 4.486269, which is less than the significance level alpha = 0.05. The cla
im that 30 out of 70 People recommend Street Food to their friends is not accurate.
Example:
Suppose that current vitamin pills cure 80% of all cases. A new vitamin pill has been discovered
or made. In a sample of 150 patients with the lack of vitamins that were treated with the new vita
mins, 95 were cured. Do the results of this study support the claim that the new vitamins have a h
igher cure rate than the existing vitamins?
Solution: Now given x=.95, P=.80, n=160. Let’s use the function prop.test() in R.
prop.test(x = 95, n = 160, p = 0.8,
correct = FALSE)
Two-Proportions Z-Test.
A two-proportion z-test allows us to compare two proportions to see if they are the same.
It calculates the range of values that is likely to include the difference between the population pro
portions. The two-proportion z-test is used to compare two observed proportions
Syntax:
prop.test(x, n, p = NULL, alternative = c(“two.sided”, “less”, “greater”), correct = TRUE)
Parameters:
x = number of successes and failures in data set.
n = size of data set.
p = probabilities of success. It must be in the range of 0 to 1.
alternative = a character string specifying the alternative hypothesis.
correct = a logical indicating whether Yates’ continuity correction should be applied wher
e possible.
Let’s say we have two groups of students A and B. Group A with an early morning class of 400
students with 342 female students. Group B with a late class of 400 students with 290 female
students. Use a 5% alpha level. We want to know, whether the proportions of females are the
same in the two groups of the student.
> prop.test(x = c(342, 290),
+ n = c(400, 400))
If we want to test whether the observed proportion of Females in group A(pA) is greater than the
observed proportion of Females in group(pB), then the command is:
> prop.test(x = c(342, 290),
+ n = c(400, 400),
+ alternative = "greater")
Example 2
ABC company manufactures tablets. For quality control, two sets of tablets were tested. In the
first group, 32 out of 700 were found to contain some sort of defect. In the second group, 30 out
of 400 were found to contain some sort of defect. Is the difference between the two groups
significant? Use a 5% alpha level.
prop.test(x = c(32, 30),
n = c(700, 400))
Thus as a result The p-value of the test is 0.0587449 is greater than the significance level of
alpha, which is 0.05. That means there is no significant difference between the Two Proportions.
hypothesis. For every decision the truth can be either, the convict is really guilty and the convict
is not guilty in reality. Hence the two types of Errors.
Ho = Not Guilty
Ha = Guilty
In the above example,
Type I Error will be: Innocent in Jail
Type II Error will be: Guilty Set Free
ANALYSIS OF VARIANCE:
It is used to compare multiple means in a test for equivalence. In that sense, it’s a straightforward
extension of the hypothesis test comparing two means. There’s a continuous variable from which
the means of interest are calculated, and there’s at least one categorical variable that tells you
how to define the groups for those means.
An analysis of variance (ANOVA) tests whether statistically significant differences exist
between more than two samples. For this purpose, the means and variances of the respective
groups are compared with each other. In contrast to the t-test, which tests whether there is a
difference between two samples, the ANOVA tests whether there is a difference between more
than two groups.
Types of ANOVA Test in R
1 One-way ANOVA:
There are many situations where you need to compare the mean between multiple groups. For
instance, the marketing department wants to know if three teams have the same sales
performance.
Team: 3 level factor: A, B, and C
Sale: A measure of performance
The ANOVA test can tell if the three groups have similar performances.
Hypothesis in one-way ANOVA test
H0: The means between groups are identical
HA: At least, the mean of one group is different
For instance, imagine we're analysing the performance scores of students from three different
schools. We want to determine if there are any statistically significant differences in the mean
scores among these schools. Here's how you can perform a one-way ANOVA in R:
library(stats)
# Create a dataset with a dependent variable 'scores' and an independent variable 'school'
data <- data.frame(
scores = c(85, 72, 90, 78, 91, 88, 65, 76, 82),
school = factor(c("A", "A", "B", "B", "C", "C", "A", "B", "C"))
)
print(data)
# Perform one-way ANOVA
result <- aov(scores ~ school, data = data)
# Display ANOVA table
print(summary(result))
Output:
scores school
1 85 A
2 72 A
3 90 B
4 78 B
5 91 C
6 88 C
7 65 A
8 76 B
9 82 C
Df Sum Sq Mean Sq F value Pr(>F)
school 2 254.9 127.44 2.108 0.203
Residuals 6 362.7 60.44
Residuals: Relative deviations from the group mean, are often known as residuals and
their summary statistics
A small p-value in the summary indicates significant differences among schools' mean
scores.
The output provides valuable information, including the F-statistic and p-value. A low p-
value (typically ≤ 0.05) indicates that there are significant differences in mean scores
among the schools.
Df. Refers to the degree of freedom for independent variables which is calculated by
subtracting 1 from the total levels, as well as residuals which are calculated by
subtracting 1 and the total number of levels with the total observations.
Sum Sq: Refers to the total variation between overall and group means.
Mean Sq: Refers to the mean of Sum Sq which is calculated by dividing the sum of
squares by the degrees of freedom.
F-value: Refers to the mean square of independent variables divided by the mean square
of residuals. In simple terms, it is the test statistic from the F-test. If the value is larger,
the chances of the independent variable being real is higher.
Pr(>F): Refers to how likely the F-value being calculated would occur if the null
hypothesis is true.
Finally, depending on how high or low the P-value of the summary is, we will be able to
understand what impact the independent variable had on the final result
Two-way ANOVA
Two-way ANOVA becomes relevant when two categorical independent variables influence a
dependent variable. This type of ANOVA delves into interactions between these variables,
offering insights into how their combined effects impact the outcome.
Suppose we're investigating the effects of both gender and treatment on patient recovery times.
Here's how you can execute a two-way ANOVA in R:
data <- data.frame(
recovery_time = c(10, 12, 15, 14, 9, 11, 16, 18),
gender = factor(c("Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female")),
It then applies two-way ANOVA using aov() with interaction (*) between gender and
treatment and summarizes the result. Interaction effects are revealed in the summary, indicating
whether the influence of one variable depends on another's level.