0% found this document useful (0 votes)
18 views

QT-UNIT IV

QT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

QT-UNIT IV

QT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT-IV

TESTING OF HYPOTHESIS

SAMPLING THEORIES

It is the practice of selecting an individual group from a population to study the whole
population.

Let’s say we want to know the percentage of people who use iPhones in a city, for
example. One way to do this is to call up everyone in the city and ask them what type
of phone they use. The other way would be to get a smaller subgroup of individuals
and ask them the same question, and then use this information as an approximation of
the total population.

However, this process is not as simple as it sounds. Whenever you follow this
method, your sample size has to be ideal - it should not be too large or too small. Then
once you have decided on the size of your sample, you must use the right type of
sampling techniques to collect a sample from the population. Ultimately, every
sampling type comes under two broad categories:

 Probability sampling - Random selection techniques are used to select the sample.

 Non-probability sampling - Non-random selection techniques based on certain criteria


are used to select the sample.

Types Of Sampling Techniques in Data Analytics-

Now, let’s discuss the types of sampling in data analytics. First, let us start with the
Probability Sampling techniques.

Probability Sampling Techniques

Probability Sampling Techniques are one of the important types of sampling


techniques. Probability sampling allows every member of the population a chance to
get selected. It is mainly used in quantitative research when you want to produce
results representative of the whole population.

1
1. Simple Random Sampling

In simple random sampling, the researcher selects the participants randomly. There
are a number of data analytical tools like random number generators and random
number tables used that are based entirely on chance.

Example: The researcher assigns every member in a company database a number


from 1 to 1000 (depending on the size of your company) and then use a random
number generator to select 100 members.

2. Systematic Sampling

In systematic sampling, every population is given a number as well like in simple


random sampling. However, instead of randomly generating numbers, the samples are
chosen at regular intervals.

Example: The researcher assigns every member in the company database a number.
Instead of randomly generating numbers, a random starting point (say 5) is selected.
From that number onwards, the researcher selects every, say, 10th person on the list
(5, 15, 25, and so on) until the sample is obtained.

3. Stratified Sampling

In stratified sampling, the population is subdivided into subgroups, called strata, based
on some characteristics (age, gender, income, etc.). After forming a subgroup, you can
then use random or systematic sampling to select a sample for each subgroup. This
method allows you to draw more precise conclusions because it ensures that every
subgroup is properly represented.

Example: If a company has 500 male employees and 100 female employees, the
researcher wants to ensure that the sample reflects the gender as well. So the
population is divided into two subgroups based on gender.

2
4. Cluster Sampling

In cluster sampling, the population is divided into subgroups, but each subgroup has
similar characteristics to the whole sample. Instead of selecting a sample from each
subgroup, you randomly select an entire subgroup. This method is helpful when
dealing with large and diverse populations.

Example: A company has over a hundred offices in ten cities across the world which
has roughly the same number of employees in similar job roles. The researcher
randomly selects 2 to 3 offices and uses them as the sample.

Here comes the next type of sampling techniques i.e., Non-Probability Sampling
Techniques

Non-Probability Sampling Techniques

Non-Probability Sampling Techniques is one of the important types of Sampling


techniques. In non-probability sampling, not every individual has a chance of being
included in the sample. This sampling method is easier and cheaper but also has high
risks of sampling bias. It is often used in exploratory and qualitative research with the
aim to develop an initial understanding of the population.

1. Convenience Sampling

In this sampling method, the researcher simply selects the individuals which are most
easily accessible to them. This is an easy way to gather data, but there is no way to tell
if the sample is representative of the entire population. The only criteria involved is
that people are available and willing to participate.

Example: The researcher stands outside a company and asks the employees coming in
to answer questions or complete a survey.

3
2. Voluntary Response Sampling

Voluntary response sampling is similar to convenience sampling, in the sense that the
only criterion is people are willing to participate. However, instead of the researcher
choosing the participants, the participants volunteer themselves.

Example: The researcher sends out a survey to every employee in a company and
gives them the option to take part in it.

3. Purposive Sampling

In purposive sampling, the researcher uses their expertise and judgment to select a
sample that they think is the best fit. It is often used when the population is very small
and the researcher only wants to gain knowledge about a specific phenomenon rather
than make statistical inferences.

Example: The researcher wants to know about the experiences of disabled employees
at a company. So the sample is purposefully selected from this population.

4. Snowball Sampling

In snowball sampling, the research participants recruit other participants for the study.
It is used when participants required for the research are hard to find. It is called
snowball sampling because like a snowball, it picks up more participants along the
way and gets larger and larger.

Example: The researcher wants to know about the experiences of homeless people in
a city. Since there is no detailed list of homeless people, a probability sample is not
possible. The only way to get the sample is to get in touch with one homeless person
who will then put you in touch with other homeless people in a particular area.

4
Statistical Inference:
Statistical inference refers to the process of selecting and using a sample
statistic to draw conclusions about the population parameter. Statistical inference
deals with two types of problems.
They are:-
1. Testing of Hypothesis
2. Estimation
Hypothesis:
Hypothesis is a statement subject to verification. More precisely, it is a
quantitative statement about a population, the validity of which remains to be
tested. In other words, hypothesis is an assumption made about a population
parameter.
Testing of Hypothesis:
Testing of hypothesis is a process of examining whether the hypothesis
formulated by the researcher is valid or not. The main objective of hypothesis
testing is whether to accept or reject the hypothesis.
Procedure for Testing of Hypothesis:
The various steps in testing of hypothesis involves the following :-
1. Set Up a Hypothesis:
The first step in testing of hypothesis is to set p a hypothesis about
population parameter. Normally, the researcher has to fix two types of
hypothesis. They are null hypothesis and alternative hypothesis.
Null Hypothesis:-
Null hypothesis is the original hypothesis. It states that there is no
significant difference between the sample and population regarding a
particular matter under consideration. The word “null” means ‘invalid’ of ‘void’
or ‘amounting to nothing’. Null hypothesis is denoted by Ho. For example,
suppose we want to test whether a medicine is effective in curing cancer. Hence,
the null hypothesis will be stated as follows:-
H0: The medicine is not effective in curing cancer (i.e., there is no
significant difference between the given medicine and other
medicines in curing cancer disease.)
Alternative Hypothesis:-
Any hypothesis other than null hypothesis is called alternative hypothesis.
When a null hypothesis is rejected, we accept the other hypothesis,
known as alternative hypothesis. Alternative hypothesis is denoted by
H1. In the above example, the alternative hypothesis may be stated as
follows:-

5
H1: The medicine is effective in curing cancer. (i.e., there is
significant difference between the given medicine and other medicines
in curing cancer disease.)

2. Set up a suitable level of significance:


After setting up the hypothesis, the researcher has to set up a suitable
level of significance. The level of significance is the probability with which we
may reject a null hypothesis when it is true. For example, if level of significance
is 5%, it means that in the long run, the researcher is rejecting true null
hypothesis 5 times out of every 100 times. Level of significance is denoted by α
(alpha).
α = Probability of rejecting H0 when it
is true. Generally, the level of
significance is fixed at 1% or 5%.
3. Decide a test criterion:
The third step in testing of hypothesis is to select an appropriate test
criterion. Commonly used tests are z-test, t-test, X2 – test, F-test, etc.
4. Calculation of test statistic:
The next step is to calculate the value of the test statistic using appropriate
formula. The general fromfor computing the value of test statistic is:-
Value of Test statistic = Difference
Standard Error
5. Making Decision:
Finally, we may draw conclusions and take decisions. The decision may
be either to accept or reject the null hypothesis.
If the calculated value is more than the table value, we reject the null
hypothesis and accept the alternative hypothesis.
If the calculated value is less than the table value, we accept the null
hypothesis.
Sampling Distribution
The distribution of all possible values which can be assumed by some
statistic, computed from samples of the same size randomly drawn from the same
population is called Sampling distribution of that statistic.
Standard Error (S.E)
Standard Error is the standard deviation of the sampling distribution of a

6
statistic. Standard error plays a very important role in the large sample theory.
The following are the important uses of standard errors:-
1. Standard Error is used for testing a given hypothesis
2. S.E. gives an idea about the reliability of a sample, because the
reciprocal of S.E. is a measure of reliability of the sample.
3. S.E. can be used to determine the confidence limits within which
the population parameters are expected to lie.
Test Statistic
The decision to accept or to reject a null hypothesis is made on the
basis of a statistic computed from the sample. Such a statistic is called the test
statistic. There are different types of test statistics. All these test statistics can be
classified into two groups. They are
a. Parametric Tests
b. Non-Parametric Tests
PARAMETRIC TESTS
The statistical tests based on the assumption that population or population
parameter is normally distributed are called parametric tests. The important
parametric tests are:-
1. z-test
2. t-test
3. f-test
Z-test:
Z-test is applied when the test statistic follows normal distribution.
It was developed by Prof.R.A.Fisher. The following are the important uses of z-
test:-
1. To test the population mean when the sample is large or when
the population standard deviation is known.
2. To test the equality of two sample means when the samples are
large or when the population standard deviation is known.
3. To test the population proportion.
4. To test the equality of two sample proportions.
5. To test the population standard deviation when the sample is large.
6. To test the equality of two sample standard deviations when the samples
are large or when population standard deviations are known.
7
7. To test the equality of correlation coefficients.

Z-test is used in testing of hypothesis on the basis of some assumptions.


The important assumptions in z-test are:-
1. Sampling distribution of test statistic is normal.
2. Sample statistics are dose the population parameter and therefore, for
finding standard error, sample statistics are used in place where
population parameters are to be used.
T-test:
t-distribution was originated by W.S.Gosset
in the early 1900. t-test is applied when the
test statistic follows t-distribution.
Uses of t-test are:-
1. To test the population mean when the sample is small and the population
s.D.is unknown.
2. To test the equality of two sample means when the samples are small and
population S.D. is unknown.
3. To test the difference in values of two dependent samples.
4. To test the significance of correlation
coefficients. The following are the
important assumptions in t-test:-
1. The population from which the sample drawn is normal.
2. The sample observations are independent.
3. The population S.D.is known.
4. When the equality of two population means is tested, the samples are
assumed to be independent and the population variance are assumed to
be equal and unknown.
F-test:
F-test is used to determined whether two independent estimates
of population
variance significantly differ or to establish both have come from the same
population. For carrying out the test of significance, we calculate a ration, called
F-ratio. F-test is named in honour of the great statistician R.A.Fisher. It is also
called Variance Ration Test.
8
F-ratio is defined as follows:-
1 S2
F=
S2 2 where
(X 1 –X¯ 1 ) 2
S2 =
1
n1–1
( X 2 –X¯ 2 ) 2
S22 = n2–1

While calculating F-ratio, the numerator is the greater variance and denominator is
the smaller variance. So,
F=
Great
er
Varia
nce
Smal
ler
Varia
nce

Uses of F-distribution:-
1. To test the equality of variances of two populations.
2. To test the equality of means of three or more populations.
3. To test the linearity of regression

Assumptions of F-distribution:-
1. The values in each group are normally distributed.
2. The2variance within each group should be equal for all groups. (σ2 = σ2
= σ …)
1 2 3

3. The error (Variation of each value around its own group


mean) should be independent for each value.

TYPES OF ERRORS IN TESTING OF HYPOTHESIS:


In any test of hypothesis, the decision is to accept or reject a null hypothesis.
The four possibilities of the decision are:-
1. Accepting a null hypothesis when it is true.
2. Rejecting a null hypothesis when it is false.

9
3. Rejecting a null hypothesis when it is true.
4. Accepting a null hypothesis when it is false.
Out of the above 4 possibilities, 1 and 2 are correct, while 3 and 4 are
errors. The error included in the above 3 rd possibility is called type I error and
that in the 4th possibility is called type II error.

Type I Error
The error committed by rejecting a null hypothesis when it is true, is
called Type I error.
The probability of committing Type I error is denoted by α (alpha).

α = Prob. (Type I error)


= Prob. (Rejecting H0 when it is
true)

Type II Error
The error committed by accepting a null hypothesis when it is false is called
Type II error.
The probability of committing Type II error is denoted by β (beta).
β = Prob. (Type II error)
β = Prob. (Accepting H0 when it is false)

Small and Large samples


The size of sample is 30 or less than 30, the sample is called small sample.
When the size of sample exceeds 30, the sample is called large sample.

Degree of freedom
Degree of freedom is defined as the number of independent observations
which is obtained by subtracting the number of constraints from the total number
of observations.
Degree of freedom = Total number of observations – Number of constraints.

Rejection region and Acceptance region


The entire area under a normal curve may be divided into two parts.
They are rejection region and acceptance region.
Rejection Region:
Rejection region is the area which corresponds to the predetermined
level of significance. If the calculated value of the test statistic falls in the
rejection region, we reject the null hypothesis. Rejection region is also
called critical region. It is denoted by α (alpha).
10
Acceptance Region:
Acceptance region is the area which

corresponds to 1 – α. Acceptance region


= 1 – rejection
region

= 1- α.

If the calculated value of the test statistic falls in the acceptance region,
we accept the null hypothesis

NON-PARAMETRIC TESTS

A non-parametric test is a test which is not concerned with testing of


parameters. Non- parametric tests do not make any assumption regarding the
form of the population. Therefore, non-parametric tests are also called
distribution free tests.
Following are the important non-parametric tests:-
1. Chi-square test (χ2 — test)
2. Sign test
3. Signed rank test (Wilcoxon matched pairs test)
4. Rank sum test (Mann-whitney U-test and Kruskal-Wallis H test)
5. Run test
6. Kolmogrov-Smirnor Test (K-S-test)

CHI-SQUARE TEST (૏2 — test)


The value of chi-square describes the magnitude of difference between
observed frequencies and expected frequencies under certain assumptions. χ2
value (χ2 quantity) ranges from zero to infinity. It is zero when the expected
frequencies and observed frequencies completely coincide. So greater the value
of χ2, greater is the discrepancy between observed and expected frequencies.
χ2-test is a statistical test which tests the significance of difference
between observed frequencies and corresponding theoretical frequencies of a
distribution without any assumption about the distribution of the population. This
is one of the simplest and most widely used non- parametric test in statistical
work. This test was developed by Prof. Karl Pearson in 1990.

Uses of ૏2 - test
11
The uses of chi-square test are:-
1. Useful for the test of goodness of fit:- χ2 - test can be used to test
whether there is goodness of fit between the observed frequencies and
expected frequencies.
2. Useful for the test of independence of attributes:- χ2 test can be used to
test whether two attributes are associated or not.
3. Useful for the test of homogeneity:- χ2 -test is very useful t5o test
whether two attributes are homogeneous or not.
4. Useful for testing given population variance:- χ2-test can be used for
testing whether the given population variance is acceptable on the basis of
samples drawn from that population.

૏2 -test as a test of goodness of fit:

As a non-parametric test, χ2-test is mainly used to test the goodness of


fit between the observed frequencies and expected frequencies.
Procedure:-

1. Set up mull hypothesis that there is goodness of fit between observed and
expected frequencies.

2. Find the χ2 value using the following formula:-

2
χ =
Σ
Where O =
Observed
frequenci
es E =
Expected
frequenci
es

3. Compute the degree of freedom.


d. f. = n – r – 1
Where ‘r’ is the number of independent constraints to be
satisfied by the frequencies
12
4. Obtain the table value corresponding to the lord of significance and degrees
of freedom.

5. Decide whether to accept or reject the null hypothesis. If the calculated


value is less than the table value, we accept the null hypothesis and
conclude that there is goodness of fit. If the calculated value is more than
the table value we reject the null hypothesis and conclude that there is no
goodness of fit.

Qn:- A sample analysis of examination result of 200 students were made. It was
found that 46 students had failed, 68 secured III rd class, 62 IInd class and
the rest were placed in the I st class. Are these figures commensurate with
the general examination results which is in the ratio of 2 : 3: 3: 2 for
various categories respectively?

Sol: H0: The figures commensurate with the general examination results.

H1: The figures do not commensurate with the general examination results.

2
χ =
Σ

Computation of ૏2 value:
(O — E)2
O E O-E (O — E)2
E
46 200 x 2
= 40 6 36 0.9000
10

68 200 x 2
= 40 8 64 1.0667
10

62 200 x 2
= 40 2 4 0.0667
10

24 200 x 2
= 40 256 6.4000
10 -16

(O–E)2 = 8.4334
Σ E

χ2 = 8. 4334
The table value at 5% level of significance
and degree of freedom at 3. = 7. 815

13
(df = n – r- 1 =4 – 0 – 1 = 3)

The calculated value is more than the table value.

m we reject the H0
m we conclude that the analytical figures do not commensurate with the
general examination result. In other words, there is no goodness of fit between
the observed and expected frequencies.

Qn: Test whether the accidents occur uniformity over week days on the
basis of the following information:-

Days of the week: Sun Mon Tue Wed Thu Fri Sat
No. of accidents: 11 13 14 13 15 14 18

Sol: H0: There is goodness of fit between observed and expected


frequencies, i.e., accidents occur uniformly over week days.
H1: There is no goodness of fit between observed and expected
frequencies; i.e., accidents do not accrue uniformly over week days

χ2 = Σ

Computation of ૏2 value:
(O — E)2
O E O–E (O — E)2
E
11 14 -3 9 0.6429

13 14 -1 1 0.0714

14 14 0 0 0.0000

13 14 1 0.0714
-1
15 14 1 0.0714
1
14 14 0 0.0000
0
18 14 16 1.1429
4

= 2.0000
14
( )2
Σ O–E
E

The value of χ2 at 5% level of significance and


n – r- 1 = 7 – 0 – 1 = 6 d.f = 12.592

Calculated value if less than the table value.


mwe accept the null hypothesis. We may conclude that there is goodness
of fit between and expected frequencies. i.e., the accidents occur uniformity over
week days.

NB: For problems refer notes

15

You might also like