0% found this document useful (0 votes)
31 views32 pages

statistics notes part -3

The document discusses the concept of hypothesis in statistics, explaining types such as simple, composite, null, and alternate hypotheses, as well as hypothesis testing procedures. It covers critical regions, Type I and Type II errors, significance levels, p-values, and confidence intervals, providing examples for clarity. The document concludes with guidelines for selecting appropriate statistical tests based on sample size and variance.

Uploaded by

Logess War
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views32 pages

statistics notes part -3

The document discusses the concept of hypothesis in statistics, explaining types such as simple, composite, null, and alternate hypotheses, as well as hypothesis testing procedures. It covers critical regions, Type I and Type II errors, significance levels, p-values, and confidence intervals, providing examples for clarity. The document concludes with guidelines for selecting appropriate statistical tests based on sample size and variance.

Uploaded by

Logess War
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Hypothesis

In our daily life, we often hear statements like Dhoni is the better captain than his contemporaries,
Or Motorcycle company claiming that a certain model gives an average mileage of 100Km per liter or
Toothpaste company claiming to be the number one brand suggested by dentists.

Let’s suppose you must purchase a motorcycle and you heard about the above claim made by the
Motorcycle company. Would you just go and buy it or rather look for proof of it? There must be a
parameter based on which one would judge the correctness of the statement made. In this case our
parameter will be the Average mileage, which you will use to check if the statement made is true or
just a hoax.

A hypothesis is a statement, assumption or claim about the value of the parameter (mean,
variance, median etc.).

A hypothesis is an educated guess about something in the world around you. It should be testable,
either by experiment or observation.

Like, if we make a statement that “Dhoni is the best Indian Captain ever.” This made the assumption
that we are making based on the average wins and loses team had under his captaincy. We can test
this statement based on all the match data.

Simple and Composite Hypothesis

When a hypothesis specifies an exact value of the parameter, it is a simple hypothesis and if it
specifies a range of values then it is called a composite hypothesis.

e.g., Motorcycle company claiming that a certain model gives an average mileage of 100Km per liter,
this is a case of simple hypothesis.

The average age of students in a class is greater than 20. This statement is a composite hypothesis.

Null Hypothesis
The null hypothesis is the hypothesis to be tested for possible rejection under the assumption that it
is true. The concept of the null is like innocent until proven guilty We assume innocence until we
have enough evidence to prove that a suspect is guilty.

It is denoted by H0.
Alternate Hypothesis
The alternative hypothesis complements the Null hypothesis. It is opposite of the null hypothesis
such that both Alternate and null hypothesis together cover all the possible values of the population
parameter.

It is denoted by H1.

Let’s understand this with an example:

A soap company claims that its product kills on an average 99% of the germs. To test the claim of
this company we will formulate the null and alternate hypothesis.

Null Hypothesis(H0): Average =99%

Alternate Hypothesis(H1): Average is not equal to 99%.

Note: The thumb rule is that statement containing equality is the null hypothesis.

Hypothesis Testing

When we test a hypothesis, we assume the null hypothesis to be true until there is sufficient
evidence in the sample to prove it false. In that case we reject the null hypothesis and support the
alternate hypothesis.

If the sample fails to provide sufficient evidence for us to reject the null hypothesis, we cannot say
that the null hypothesis is true because it is based on just the sample data. For saying the null
hypothesis is true we will have to study the whole population data.

One Tailed and Two Tailed Tests

If the alternate hypothesis gives the alternate in both directions (less than and greater than) of the
value of the parameter specified in null hypothesis, it is called Two tailed test.

If the alternate hypothesis gives the alternate in only one direction (either less than or greater than)
of the value of the parameter specified in null hypothesis, it is called One tailed test.

e.g., if H0: mean= 100 H1: mean not equal to 100

here according to H1, mean can be greater than or less than 100. This is an example of Two tailed
test

Similarly, if H0: mean>=100 then H1: mean< 100

Here, mean is less than 100 is called One tailed test.


Critical Region

The critical region is that region in the sample space in which if the calculated value lies then we
reject the null hypothesis.

Let’s understand this with an example:

Suppose you are looking to rent an apartment. You listed out all the available apartments from
different real state websites. You have a budget of Rs. 15000/ month. You cannot spend more than
that. The list of apartments you have made has prices ranging from 7000/month to 30,000/month.

You select a random apartment from the list and assume below hypothesis:

H0: You will rent the apartment.

H1: You won’t rent the apartment.

Now, since your budget is 15000, you must reject all the apartments above that price.

Here all the Prices greater than 15000 become your critical region. If the random apartment’s price
lies in this region, you must reject your null hypothesis and if the random apartment’s price doesn’t
lie in this region, you do not reject your null hypothesis.

The critical region lies in one tail or two tails on the probability distribution curve according to the
alternative hypothesis. Critical region is a pre-defined area corresponding to a cut off value in
probability distribution curve. It is denoted by α.

Critical values are values separating the values that support or reject the null hypothesis and are
calculated based on alpha.

We will see more examples later and it will be clear how do we choose α.
Based on the alternative hypothesis, three cases of critical region arise:

Case 1) This is double tailed test.

Case 2) This scenario is also called Left-tailed test.


Case 3) This scenario is also called Right-tailed test.

Type I and Type II Error

Decision H0 True H0 False


Reject H0 Type I error Correct Decision
Do not reject H0 Correct Decision Type II error

A false positive (type I error) — when you reject a true null hypothesis.

A false negative (type II error) — when you accept a false null hypothesis.

The probability of committing Type I error (False positive) is equal to the significance level or size of
critical region α.

α= P [rejecting H0 when H0 is true]

The probability of committing Type II error (False negative) is equal to the beta β and is called
‘power of the test’.

β = P [not rejecting H0 when h1 is true]

Example:

A person is arrested on the charge of being guilty of burglary. A jury of judges has to decide if guilty
or not guilty.

H0: Person is innocent.

H1: Person is guilty.


Type I error will be if the Jury convicts the person [rejects H0] although the person was innocent [H0
is true].

Type II error will be the case when Jury released the person [Do not reject H0] although the person is
guilty [H1 is true].

Level of Significance(α) :

It is the probability of type 1 error. It is also the size of the critical region.

Generally, a strong control on α is desired and in tests it is prefixed at very low levels like 0.05(5%) or
01(1%).

If H0 is not rejected at a significance level of 5%, then one can say that our null hypothesis is true
with 95% assurance.

Steps involved in Hypothesis testing:


1) Setup the null hypothesis and the alternate hypothesis.
2) Decide a level of significance i.e., alpha = 5% or 1%
3) Choose the type of test you want to perform as per the sample data (z test, t test, chi
squared etc.) (we will study all the tests in next section)
4) Calculate the test statistics (z-score, t-score etc.) using the respective formula of test chosen.
5) Obtain the critical value for in the sampling distribution to construct the rejection region of
size alpha using z-table, t-table, chi table etc.
6) Compare the test statistics with the critical value and locate the position of the calculated
test statistics i.e., is it in rejection region or non-rejection region.
7) I) If the critical value lies in the rejection region, we will reject the hypothesis i.e., sample
data provides sufficient evidence against the null hypothesis and there is significant
difference between hypothesized value and observed value of the parameter.
II) If the critical value lies in the non- rejection region, we will not reject the hypothesis i.e.,
sample data does not provide sufficient evidence against the null hypothesis and the
difference between hypothesized value and observed value of the parameter is due to
fluctuation of the sample.

p-value

Let’s suppose we are conducting a hypothesis test at a significance level of 1%.

Where, H0: mean<X (we are just assuming a scenario of 1 tail test.)

We obtain our critical value (based on the type of test we are using) and find that our test statistics
are greater than the critical value. So, we must reject the null hypothesis here since it lies in the
rejection region. Now if the null hypothesis is rejected at 1%, then for sure it will get rejected at the
higher values of significance level, say 5% or 10%.

What if we take significance level lower than 1%, would we have to reject our hypothesis then also?
Yes, there might be a chance that the above scenario can happen and here comes “p-value” in play.

p-value is the smallest level of significance at which a null hypothesis can be rejected. (p < alpha)

That’s why many tests now a days gives p-value, and it is more preferred since it gives out more
information than the critical value.

For right tailed test:

p-value = P [Test statistics >= observed value of the test statistic]

For left tailed test:

p-value = P [Test statistics <= observed value of the test statistic]

For two tailed tests:

p-value = 2 * P[Test statistics >= |observed value of the test statistic|]

Decision making with p-value.

The p-value is compared to the significance level(alpha) for decision making on null hypothesis.

If p-value is greater than alpha, we do not reject the null hypothesis.

If p-value is smaller than alpha, we reject the null hypothesis.

Confidence Intervals
A confidence interval, in statistics, refers to the probability that a population parameter will fall
between two set values. Confidence intervals measure the degree of uncertainty or certainty in a
sampling method. A confidence interval can take any number of probabilities, with the most
common being a 95% or 99% confidence level.

Calculating a Confidence Interval (Theory)

Suppose a group of researchers is studying the heights of high school basketball players. The
researchers take a random sample from the population and establish a mean height of 74 inches.
The mean of 74 inches is a point estimate of the population mean. A point estimate by itself is of
limited usefulness because it does not reveal the uncertainty associated with the estimate; you do
not have a good sense of how far away this 74-inch sample mean might be from the population
mean. What's missing is the degree of uncertainty in this single sample.

Confidence intervals provide more information than point estimates. By establishing a 95%
confidence interval using the sample's mean and standard deviation, and assuming a normal
distribution as represented by the bell curve, the researchers arrive at an upper and lower bound
that contains the true mean 95% of the time. Assume the interval is between 72 inches and 76
inches. If the researchers take 100 random samples from the population of high school basketball
players, the mean should fall between 72 and 76 inches in 95 of those samples.
If the researchers want even greater confidence, they can expand the interval to 99% confidence.
Doing so invariably creates a broader range, as it makes room for a greater number of sample
means. If they establish the 99% confidence interval as being between 70 inches and 78 inches, they
can expect 99 of 100 samples evaluated to contain a mean value between these numbers. A 90%
confidence level means that we would expect 90% of the interval estimates to include the
population parameter. Likewise, a 99% confidence level means that 95% of the intervals would
include the parameter.

The Confidence Interval is based on Mean and Standard Deviation and is given as:

For n>30

Confidence interval = X ± (z * s/√n)

where z critical value is derived from the z score table based on the confidence level.

X is the sample mean

s is sample standard deviation.

n is the sample size

We obtain these values from the z-score table only, but since the confidence levels are most of the
time fixed as the above values, we can use this table.

For n<30

Confidence interval = X ± (t * s/√n)

where t critical value is derived from the t score table based on the confidence level.

X is the sample mean

s is sample standard deviation.

n is the sample size.

We will see how to create confidence intervals in the examples to follow.

Now that we have got all the theory behind Hypothesis testing, let’s see different types of tests
that are used for testing. We have already seen examples on finding z-score and t-score, we will
see how they are used in the testing scenario.
General points for selection type of tests:

Population Type of
sample size Variance Normality of Sample Sample variance Test
Large (>30) Known Normal/Non-Normal Z-test
Use this to calculate t-
Large (>30) Unknown Normal score t-test
Use this to calculate z-
Large (>30) Unknown Unknown score Z-test
Small (<30) Known Normal Z-test
Use this to calculate t-
Small (<30) Unknown Normal score t-test

Note: We will learn about other non-parametric tests and their cases late

Hypothesis Testing for Large Size Samples

Thumb rule: A sample of size greater than 30 is considered a large sample and as per central limit
theorem we will assume that all sampling distributions follow a normal distribution.

We are familiar with the steps of hypothesis testing as shown earlier. We also know, from the above
table, when to use which type of test.

Let’s start with a few practical examples to help our understanding more.

Note: We have learned in the previous section how to use the z-score table to calculate
probabilities, in this section we have some standard Significance level for which we need to find the
critical value(z-score). So instead of going through the whole table, we will just use the below
standardized critical value table for calculation purposes.
Q) A manufacturer of printer cartridges clams that a certain cartridge manufactured by him has a
mean printing capacity of at least 500 pages. A wholesale purchaser selects a sample of 100
printers and tests them. The mean printing capacity of the sample came out to be 490 pages with
a standard deviation of 30 printing pages.

Should the purchaser reject the claim of the manufacturer at a significant level of 5%?

Ans. population mean = 500


Sample mean = 490
Sample standard deviation = 30
Significance level(alpha) = 5% = 0.05
Sample size = 100
H0: Mean printing capacity >=500
H1: Mean printing capacity < 500
We can clearly see it is one tailed test (left tail).
Here, the sample is large with an unknown population variance. Since we don’t know about the
normality of the data, we will use the Z-test (from the table above).

We will use the sample variance to calculate the critical value.

Standard error (SE) = Sample standard deviation/ (sample size) * 0.5

= 30 / (100) *0.5 = 3

Z(test) = (Sample mean - population mean)/ (SE)

= (490-500)/3 = -3.33

Let’s find out the critical value at 5% significance level using the above Critical value table.

Z (0.05%) = - 1.645 (since it is left tailed test).

We can clearly see that Z(test) < Z (0.05%), that means our test value lies in the rejection region.

Thus, we can reject the null hypothesis i.e. the manufacturer’s claim at 5% significance level.

Using p-value to test the above hypothesis:

p-value = P[T<=-3.33] (we know p(-x) = 1 -p(x) also, remember that the p(x) represents the

cumulative probability from 0 to x)

let’s use z-table to find the p-value:


p-value = 1 – 0.9996 = 0.0004

Here, the p-value is less than the significance level of 5%. So, we are right to reject the null
hypothesis.

Q) A company used a specific brand of Tube lights in the past which has an average life of 1000
hours. A new brand has approached the company with new Tube lights with same power at a
lower price. A sample of 120 light bulbs were taken for testing which yielded an average of 1100
hours with standard deviation of 90 hours. Should the company give the contract to this new
company at a 1% significance level.

Also, find the confidence interval.

Ans. Population mean = 1000

Sample mean = 1010

Significance level = 1% = 0.01

Sample size = 120

Sample standard deviation = 90

H0: average life of tube lights >= 1000

H1: average life of tube lights < 1000

Here, the sample is large with an unknown population variance. Since, we don’t know about the
normality of the data, we will use the Z-test (from the table above).

Standard error (SE) = Sample standard deviation/ (sample size) * 0.5

= 90 / (120) *0.5 = 8.22

Z(test) = (Sample mean - population mean)/ (SE)

= (1010-1000)/8.22 = 1.22

Let’s find out the critical value at 1% significance level using the above Critical value table.

Z (0.01%) = -2.33(since it is left tailed test).


We can clearly see that Z(test) >Z (0.01%), that means our test value doesn’t lie in the rejection
region.

Thus, we cannot reject the null hypothesis i.e. the company can give the contract at 1% significance
level.

Using p-value to test the above hypothesis:

p-value = P[T<1.22]

p-value = 0.88

Here, the p-value is greater than the significance level of 1%. So, we do not reject the null
hypothesis.

Comparing two population samples mean using Z-test

The comparison of two population means is very common. The difference between the two samples
depends on both the means and the standard deviations. Very different means can occur by chance
if there is great variation among the individual samples. In order to account for the variation, we
take the difference of the sample means, X1(mean) - X2(mean), and divide by the standard error
(shown below) in order to standardize the difference.

Because we do not know the population standard deviations, we estimate them using the two
sample standard deviations from our independent samples. For the hypothesis test, we calculate the
estimated standard deviation i.e., standard error.

The standard error (SE) is:


Z is given as :

In this comarison case, our null assumpiton is that µ(1) = µ(2)

So, Z becomes = X1(mean)- X2(mean)/ (SE)

Q) In two samples of men from two different states A and B , the height of of 1000 men and 2000
men respectively are 76.5 and 77 inches. If population standard deviation for both states is same
and is 7 inches, can we assume that mean hieghts of both sates can be regarde same at 5% level of
significance.

Ans. n1 = 1000

n2 = 2000

X1(mean) = 76.5

X2(mean) = 77

S1=S2= 7

Let’s µ(1) = µ(2) be the mean heights of men from states A and B

H0: µ(1) = µ(2)

H1: µ(1) is not equal to µ(2)

Standard error(SE) = [((S1)^2/n1 )+((S2)^2/n2)]^0.5 = 0.27

Z(test) = X1(mean)- X2(mean)/ (SE) = (76.5-77)/0.27 = -1.85

Since, it is a two tailed test, we need to find critical value for 2.5% on each tail.

Z(2.5%) = 1.96 and Z(-2.5%) = -1.96

We can clearly see, Z(-2.5%) < Z(test) <Z(2.5%)

Thus, we cannot reject the null hypothesis.


Using p-value

p-value = 2* P[Z>=|-1.85|] = 2 * P[Z>=-1.85]

p-value = 2 * (1- 0.9678) (since we want z> 1.85) = 0.0644

We can clearly see, p-value is greater than 0.05% ,thus we cannot reject the null hypothesis.

Hypothesis Testing for Small Size Samples

In real world scenarios, large sample sizes are possible most of the time because of the limited
resources such as money. We generally do hypothesis testing based on small samples, only
assumption being the normality of the sample data.

We will see how to use t- tests in this section and how to use the t-score table (continued from the
topic of student t’s distribution).

All the steps involved are similar to the z-test, only we will calculate t-score instead of z-score.

Let’s start with an example:

Q) A type of manufacturer claims that the average life of a particular category of its type is
18000km when used under normal driving conditions. A random sample of 16 types was tested.
The mean and SD of life of the types in the sample were 20000 km and 6000 km respectively.
Assuming that the life of the tyres is normally distributed, test the claim of the manufacture at 1%
level of significance. Construct the confidence interval also.
Ans: population mean = 18000 km

Sample mean = 20000 km

Standard deviation = 6000 km

Sample size = 16

H0: population mean = 18000km

H1: population mean is not equal to 18000km (It will be a two tailed test.)

Since sample size is small, population variance is unknown and the sample is normally distributed,
we will used t-test for this.

Standard error = [6000/(16)^0.5] = 1500

t-score(test) = (20000 - 18000)/1500 = 1.33

Let’s find out the critical t- value, for significance level 1% (two tailed) and degree of freedom = 16-1
= 15

t(0.005) = 2.947 and t(-0.005) = -2.947

We can see that, t (- 0.005) < t-score(test) = 1.33 < t (0.005)

So, the value lies in non-rejection region and we cannot reject our null hypothesis.

Using the p-value

p-value = P[t>|1.33|]

degree of freedom = 15

let’s see the p-value from the table for the above values:
from the table we can see: 0.20 < p < 0.30

Here, p > significance level (1%), thus we cannot reject the null hypothesis.

Confidence interval = [20000 – 2.47*1500 , 20000 +2.47*1500]

= [ 16295, 23705]

Comparing two population samples mean using t-test

Just like the case we saw with z-test, t-test is actually more suitable for comparison of two
populations samples because in practice population standard deviations for both populations are
not always known.

We assume a normal distribution of samples and though the population standard deviations are
unknown, we assume them to be equal.

Also, samples are independent to each other.

Let’s assume two independent samples with size n1 and n2:

Degree of freedom = n1 + n2 -2

Standard Error(SE):

Variance(Sample) = (∑[X-X(mean)]^2 + ∑[Y-Y(mean)]^2))/(n1 + n2 -2)

Test statistic t in this case is given as:


Q) The means of two random samples of sizes 10 and 8 from two normal popultaion is 210.40 and
208.92. The sum of sqaures od deviation from their means is 26.94 and 24.50
respectively.Assuming population with equal variances, can we consider the normal populatiojns
have equal mean?(Significance level =5%)

Ans.

n1 =10 , n2= 8 , X(mean) = 210.40 , Y(mean) = 208.92

std. Deviation(sample) =[ (26.94 + 24.50)/(10 + 8 - 2)]^0.5 = 1.79

H0: Population means are equal

H1: Population means are not equal (two tailed test)

Standard errror = 1.79 * (1/10 + 1/8)^0.5 = 0.84

t(test) = X(mean)- Y(mean)/0.84 = 1.48/.84 = 1.76

Degree of freedom = 10 +8 -2 = 16

Let’s look for critical value in the t-table for significance 5%(two tailed) and d.o.f 16:

t(0.005) = 2.120 and t(-0.005) = -2.120

We can see that, t (- 0.005) < t-score(test) = 1.76 < t (0.005)

So, the value lies in non-rejection region and we cannot reject our null hypothesis.
Paired Sample t-Tests
A paired t-test is used to compare two population means where you have two samples which are not
independent e.g. Observations recorded on a patient before and after taking medicine, weight of a
person before and after they started working out etc.

Now, instead of two separate populations, we create a new column with difference of the
populations, and instead of testing equality of two population mean we test the hypothesis that
mean of the population difference is zero. Also, we assume the samples are of same size. Population
variances are not known and not necessarily equal.

Standard error = Deviation of differences/(n^0.5)

t= D(mean)/ standard error, where D(mean) is the men of the differences.

Q) A group of 20 students were tested to see how many of them have improved marks after a
special lecture on the subject.

marks before the lecture marks after the lecture Difference(D) (D-mean) ^2
18 22 4 3.24
21 25 4 3.24
16 17 1 1.44
22 24 2 0.04
19 15 -4 38.44
24 26 2 0.04
17 20 3 0.64
21 23 2 0.04
13 18 5 7.84
18 20 2 0.04
15 15 0 4.84
16 15 -1 10.24
18 21 3 0.64
14 16 2 0.04
19 22 3 0.64
20 24 4 3.24
12 18 6 14.44
22 25 3 0.64
14 18 4 3.24
19 18 -1 10.24
44 103.2
Difference mean =
2.2 5.43157895
Standard
Deviation 2.33057481
H0: Difference mean >= 0

H1: Difference mean < 0

Standard error = 2.33 / (20) ^0.5 = 0.52

t= 2.2 / 0.52 = 4.23

df (degree of freedom) = 19

At the significant level of 5%. 19 df and a one tail test, let’s calculate our critical level:

t (5%) = -1.729

Since t is greater than critical t, thus it lies in the non-rejection region and hence we cannot reject
the null hypothesis.

Testing of Hypothesis for population Variance Using Chi-Squared


test

Till now we were dealing with hypothesis testing for the means of various samples, but sometimes it
is also necessary or desired to test the variances of the population under study i.e. let’s we obtained
certain variance for a sample which is different than the population variance, now we need to find
out if the variances are within acceptable limit or does it varies more than the desired variance of
the population.

The chi-square test for variance is a non-parametric statistical procedure with a chi-square-
distributed test statistic that is used for determining whether the variance of a variable obtained
from a particular sample has the same size as the known population variance of the same variables.
The test statistic of the chi-square test for variance is calculated as follows:

where, n is sample size, s is sample deviation, σ is population std. deviation

As similar with other tests, the critical value is obtained through a chi table on the basis of degree of
freedom and significance level.

We will see about it with an example:

Q) The variance of a certain size of towel produced by a machine is 7.2 over a long period of time.
A random sample of 20 towels gave a variance of 8. You nee to check if the variability for towel
has increased at 5% level of significance, assuming a normally distributed sample.

Ans.

n = 20

sample variance = 8

population variance = 7.2

H0: variance <= 7.2


H1: variance > 7.2 (Right tailed test)

Using chi squared test,

ϗ-square = (20-1) * 8/7.2 = 21.11

Critical value for D.o.f = 19 and 5% significance level,

Critical value = 30.14

Here, the chi value is less than the critical value, thus we do not reject the null hypothesis.

Chi-Squared Test for Categorical Variables

The chi-square test is widely used to estimate how closely the distribution of a categorical variable
matches an expected distribution (the goodness-of-fit test), or to estimate whether two categorical
variables are independent of one another (the test of independence).

In mathematical terms, the χ2 variable is the sum of the squares of a set of normally distributed
variables.

Suppose that a particular value Z1 is randomly selected from a standardized normal distribution.
Then suppose another value Z2 is selected from the same standardized normal distribution. If there
are d degrees of freedom, then let this process continue until d different Z values are selected from
this distribution. The χ2 variable is defined as the sum of the squares of these Z values.
This sum of squares of d normally distributed variables has a distribution which is called
theχ2distribution with d degrees of freedom.

Chi Squared test For Goodness Of fit

Chi Square test for testing goodness of fit is used to decide whether there is any difference between
the observed (experimental) value and the expected (theoretical) value.

A goodness of fit test is a test that is concerned with the distribution of one categorical variable.

The null and alternative hypotheses reflect this focus:

H0: The population distribution of the variable is the same as the proposed distribution.

HA: The distributions are different

The chi-square statistic is calculated as:

Where, Observed= actual count values in each category

Expected= the predicted (expected) counts in each category if the null hypothesis were true.

Let’s see an example for better understanding:

Q) A survey conducted by a Pet Food Company determined that 60% of dog owners have only one
dog, 28% have two dogs, and 12% have three or more. You were not convinced by the survey and
decided to conduct your own survey and have collected the data below,

Data: Out of 129 dog owners, 73 had one dog and 38 had two dogs

Determine whether your data supports the results of the survey by the pet.

Use a significance level of 0.05

Ans: E(1 dog) =0.60

E(2 dog) = 0.28

E(3 dogs) = .12

H0: proportions of dogs is equal to survey data

H1: proportions of dogs is not equal to survey data


1 Dog 2 Dog 3 Dog Total
Observed 73 38 18 129
0.60*129 =
Expected 77.4 0.28 *129=36.12 0.12 * 129 = 15.48 129
Observed -Expected -4.4 1.88 2.52
(Observed -Expected) ^2 19.36 3.53 6.35

Chi statistics = 19.36/77.4 + 3.53/36.12 + 2.52/15.48 = 0.7533

Let’s see the critical value using d.o.f 2 and significance 5%:

Critical chi = 5.99

Here, our chi statistic is less than the critical chi. Thus, we will not reject the null hypothesis.

Analysis of Variance (ANOVA)

Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or
more groups are significantly different from each other by analyzing comparisons of variance
estimates. ANOVA checks the impact of one or more factors by comparing the means of different
samples.

When we have only two samples, t-test and ANOVA give the same results. However, using a t-test
would not be reliable in cases where there are more than 2 samples. If we conduct multiple t-tests
for comparing more than two samples, it will have a compounded effect on the type 1 error.

Assumptions in ANOVA

1) Assumption of Randomness: The samples should be selected in a random way such that
there is no dependence among the samples.
2) The experimental errors of the data are normally distributed.
3) Assumption of equality of variance (Homoscedasticity) and zero correlation: The variance
should be constant in all the groups and all the covariance among them are zero although
means vary from group to group.
One Way ANOVA

When we are comparing groups based on only one factor variable, then it said to be one-way
analysis of variance (ANOVA).

For example, if we want to compare whether or not the mean output of three workers is the same
based on the working hours of the three workers.

The ANOVA model:

Mathematically, ANOVA can be written as:

xij = μi + εij

where x are the individual data points (i and j denote the group and the individual observation), ε is
the unexplained variation and the parameters of the model (μ) are the population means of each
group. Thus, each data point (xij) is its group mean plus error.

Let’s understand the working procedure of One-way Anova with an example:

Sample(k) 1 2 3 Mean
1 x11 x12 x13 Xm1
2 x21 x22 x23 Xm2
3 x31 x32 x33 Xm3
4 x41 x42 x43 Xm4

Suppose we are given with the above data set; we have an independent variable x and 3 samples
with different values of x and each sample has its respective mean as shown in last column.

Grand Mean

Mean is a simple or arithmetic average of a range of values. There are two kinds of means that we
use in ANOVA calculations, which are separate sample means and the grand mean.

The grand mean (Xgm) is the mean of sample means or the mean of all observations combined,
irrespective of the sample.

Xgm = (Xm1 + Xm2 + Xm3 + Xm4 +………. Xmk)/k where, k is the number of samples.

For our dataset, k = 4

Xgm = (Xm1 + Xm2 + Xm3 + Xm4)/4


Between Group Variability (SST)

It refers to variations between the distributions of individual groups (or levels) as the values within
each group are different.

Each sample is looked at and the difference between its mean and grand mean is calculated to
calculate the variability. If the distributions overlap or are close, the grand mean will be similar to
the individual means whereas if the distributions are far apart, difference between means and grand
mean would be large.

Let’s calculate Sum of Squares for between group variability:

SSbetween = n1 * (Xm1 - Xgm)2 + n2 * (Xm2 - Xgm)2 + n3 * (Xm3 - Xgm)2 + . . . . . . . . . . . + nk * (Xmk - Xgm)2

where, n1, n2,....,nk are the number of observations in each sample

Degree of freedom for between group variability = number of samples – 1 = k-1

MeanSSbetween = SSbetween/k-1

In our dataset example we have k =4 and nk = 3, so for our dataset:

SSbetween = 3 * (Xm1 - Xgm)2 + 3 * (Xm2 - Xgm)2 + 3 * (Xm3 - Xgm)2 + 3 * (Xm4 - Xgm)2

MeanSSbetween(MSST) = SSbetween/ (4-1) = SSbetween/3


Within Group Variability (SSE)

It refers to variations caused by differences within individual groups (or levels) as not all the values
within each group are the same. Each sample is looked at on its own and variability between the
individual points in the sample is calculated. In other words, no interactions between samples are
considered.

We can measure Within-group variability by looking at how much each value in each sample differs
from its respective sample mean. So, first, we’ll take the squared deviation of each value from its
respective sample mean and add them up. This is the sum of squares for within-group variability.

Degree of Freedom for within variability:

Where, N is the total number of observations.

In our dataset example we have k =4 and N =12, so for our dataset:

SSwithin = (X11 - Xm1)2 + (X12 - Xm1)2 + (X13 - Xm1)2 +

(X21 - Xm2)2 + (X22 - Xm2)2 + (X23 - Xm2)2 +

(X31 - Xm3)2 + (X32 – Xm3)2 + (X33 – Xm3)2 +

(X41 - Xm4)2 + (X42 - Xm4)2 + (X43 - Xm4)2

Degree od Freedom = N-k = 12 -4 = 8

MeanSSwithin(MSSE) = SSwithin/ 8

Total Sum of Squares (TSS)

TSS = SSbetween + SSwithin = SST + SSE


Hypothesis In ANOVA

The Null hypothesis in ANOVA is valid when all the sample means are equal, or they don’t have any
significant difference. Thus, they can be considered as a part of a larger set of the population. On the
other hand, the alternate hypothesis is valid when at least one of the sample means is different from
the rest of the sample means. In mathematical form, they can be represented as:

where µ1 and µm belong to any two sample means out of all the samples considered for the test. In
other words, the null hypothesis states that all the sample means are equal, or the factor did not
have any significant effect on the results. Whereas the alternate hypothesis states that at least one
of the sample means is different from another.

To test the null hypothesis, test statistics is given by the F-statistic.

F-Statistic

The statistic which measures if the means of different samples are significantly different or not is
called the F-Ratio. The lower the F-Ratio, more similar are the sample means. In that case, we cannot
reject the null hypothesis.

F = MeanSSbetween / MeanSSwitihn

F = MSST / MSSE with k-1 and N-k degrees of freedom.

This above formula is intuitive. The numerator term in the F-statistic calculation defines the
between-group variability. As we read earlier, as between group variability increases, sample means
grow further apart from each other. In other words, the samples are more probable to belong to
totally different populations.

This F-statistic calculated here is compared with the F-critical value for making a conclusion.

F-critical is calculated using the F-table, degree of freedoms and Significance level.

If the observed value of F is greater than the F-critical value then we reject the null hypothesis.

Let’s see an example on One-way ANOVA analysis:


Q) In a survey conducted to test the knowledge of Mathematics among 4 different schools in city.
The sample data collected for the marks of students out of 10 is below:

School Marks
School 1 8 6 7 5 9
School 2 6 4 6 5 6 7
School 3 6 5 5 6 7 8 5
School 4 5 6 6 7 6 7

Ans:

H0: All the schools have equal means.

H1: Difference in means of schools is significant.

k=4

N = 24

School (S1) - (S2- (S3- (S4-


School 1(S1) School 2(S2) 3(S3) School 4(S4) S1_mean)^2 S2_mean)^2 S3_mean)^2 S4_mean)^2
8 6 6 5 1 0.111111556 0 1.361095556
6 4 5 6 1 2.777775556 1 0.027775556
7 6 5 6 0 0.111111556 1 0.027775556
5 5 6 7 4 0.444443556 0 0.694455556
9 6 7 6 4 0.111111556 1 0.027775556
7 8 7 1.777779556 4 0.694455556
5 1
Total 35 34 42 37 10 5.333333333 8 2.833333334
Mean 7 5.666666667 6 6.166666667
Grand
mean 6.208333333

SSbetween = 5 * (7-6.21) ^2 + 6 * (5.7 – 6.21)^2 + 7 * (6-6.21)^2 + 6 * (6.17 – 6.21)^2

= 4.99

MSST = 4.99 / (4-1) = 1.66

SSwithin = 10 + 5.33 + 8 + 2.83 = 26.16

MSSE = 26.16/(N-k) = 26.16/20 = 1.308

F-statistics = MSST/MSSE = 1.66 / 1.308 = 1.27


Critical F-value

At 5% significance and degree of freedom (3, 20):

F-critical = 3.098

Clearly, our F-statistics is less than F-critical. So, we cannot reject our null hypothesis.

Two Way ANOVA

Two-way ANOVA allows to compare population means when the populations are classified according
to two independent factors.

Example: We might like to look at SAT scores of students who are male or female (first factor) and
either have or have not had a preparatory course (second factor).

The Two-way ANOVA model:

Mathematically, ANOVA can be written as:

xij = μij + εij

where x are the individual data points (i and j denote the group and the individual observation), ε is
the unexplained variation and the parameters of the model (μ) are the population means of each
group. Thus, each data point (xij) is its group mean plus error.

Just like one-way model, we will calculate the sum of squares between, in this case there will be two
SSTs for both the categories and sum of squares of errors (within).

We calculate F-statistics for both the MSST and see which once greater value than F-critical and
compare them to find the effect of both categories on our assumption.
Example:

Below is the data of yield of crops based on temperature and salinity. Calculate the ANOVA for the
table.

Temperature (in Categorical variable salinity


F) 700 1400 2100 Total Mean(temp)
60 3 5 4 12 4
70 11 10 12 33 11
80 16 21 17 54 18
Total 30 36 33 99 11
Mean(salanity) 10 12 11 11

Ans:

Hypothesis for Temperature:

H0: Yield is same for all temperature.

H1: yield varies with temperature with significant difference.

Hypothesis for Salinity:

H0: Yield is same for all Salinity.

H1: yield varies with temperature with significant Salinity.

Grand mean = 11

N = 9, K =3, nt= 3, ns = 3

SSbetween_temp = 3 *(4-11)^2 + 3*(11-11)^2 + 3*(18-11)^2 = 294

MSSTtemp = 294 / 2 = 147

SSbetween_salanity = 3 *(10-11)^2 + 3*(12-11)^2 + 3*(11-11)^2 = 6

MSSTsalainity = 6 /2 = 3

In such question calculating SSE can be tricky, so instead of calculating SSE let’s calculate TSS then
we can subtract SST values from it and get SSE.

To calculate Total sum of squares, we need to find sum of the squares of difference of each value
from the grand mean.

TSS = (3-11)^2 + (5-11)^2 + (3-11)^2 +(4-11)^2 +(11-11)^2 +(10-11)^2 +(12-11)^2 +(16-11)^2 +(21-
11)^2 +( 17-11)^2

TSS = 312
SSE = TSS - SSbetween_temp - SSbetween_salanity = 312 – 294-6 = 12

Degree of freedom for SSE = (nt-1)( ns-1) =(3-1)(3-1) = 4

MSSE = SSE/4 = 3

F-Test For temperature

Ftemp = MSSTtemp/ MSSE = 14/3 = 49

F-Test For Salinity

Fsalinity = MSSTsalinity/MSSE = 3/3 = 1

F-critical for 5% significance and degree of freedom (k-1, (p-1) (q-1)) i.e. (2,4):

F-critical = 10.649

Clearly, we can see that Ftemp is greater than F-critical, so we reject the null hypothesis and support
that temperature has a significant effect on yield.

On the other hand, Fsalinity is less than the F-critical value, so we do not reject the null hypothesis and
support that salinity doesn’t affect the yield.

Bayes Statistics (Bayes Theorem)


Bayes Statistics is used for calculating conditional probabilities.

P( Ak ∩ B )
P( Ak | B ) =
[ P( A1 ∩ B ) + P( A2 ∩ B ) + . . . + P( An ∩ B ) ]

P( Ak ∩ B ) P( Ak ∩ B ) = P( Ak ) P( B | Ak )

It can also be written as follows:

When to Apply Bayes' Theorem:

• When the sample space is divided(partitioned) into a set of events { A1, A2,…. Ana}.
• An event B is present, for which P(B) > 0 exists within the sample space.
• P( Ak | B ) is the form to compute a conditional probability.
One of the two sets of probabilities is mentioned:

• For each Ak, Probability, P( Ak ∩ B ).


• For each Ak, Probability, P( Ak ) and P( B | Ak )

Example:

Problem

There is a marriage ceremony in the desert and Marie is getting married tomorrow. In past years, it
has rained only five days a year. The weatherman has a weather report of raining tomorrow. The
weatherman forecasts rain 90% of the time when it rains. When it failed to rain, the weatherman
incorrectly predicts 10% of the times. What's the probability that it would rain on the marriage day
of Marie?

Solution: The sample space is defined as - it rains or it does not rain. Furthermore, a 3rd event occurs
when the weatherman predicts rain.

Event A. It rains on Marie's wedding.

Event B. It does not rain on Marie's wedding.

Event C. HERE THE RAIN IS PREDICTED.

• P (A) = 5/365 =0.0136985 [In a year only 5 days it might rain.]


• P (b) = 360/365 = 0.9863014 [It does not rain 360 days out of the year.]
• P (C | A) = 0.9 [When it rains, the weatherman predicts rain 90% of the time.]
• P (C | B) = 0.1 [When it FAILS TO rain, then the prediction is 10% of rain]

We want to know P (A | C), that is the probability it will rain on the wedding’s day of Marie,

P( A ) P( B | A )
P( A | C ) =

P( A ) P( C | A ) + P( B ) P( C | A2 )

(0.0014) (0.09)
P( A | C ) =
[ (0.014)(0.09) + (0.986)(0.1) ]

P( A | C ) = 0.1111

Even when the weatherman predicts rain, it might rain only about 11% of the time. So, there is a
good chance that Marie might not get rained on at her wedding.

You might also like