0% found this document useful (0 votes)
97 views

FDSA UNIT 3

Unit 3 of the Fundamentals of Data Science and Analytics covers inferential statistics, including concepts such as populations, samples, random sampling, hypothesis testing, and confidence intervals. It defines key terms like probability, mutually exclusive events, and dependent/independent events, along with statistical procedures like z-tests and the central limit theorem. The unit emphasizes the importance of random sampling for inferential statistics and provides guidelines for hypothesis testing and estimation.

Uploaded by

hodaids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

FDSA UNIT 3

Unit 3 of the Fundamentals of Data Science and Analytics covers inferential statistics, including concepts such as populations, samples, random sampling, hypothesis testing, and confidence intervals. It defines key terms like probability, mutually exclusive events, and dependent/independent events, along with statistical procedures like z-tests and the central limit theorem. The unit emphasizes the importance of random sampling for inferential statistics and provides guidelines for hypothesis testing and estimation.

Uploaded by

hodaids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

UNIT III – INFERENTIAL STATISTICS


SYLLABUS:
Populations – samples – random sampling – Sampling distribution- standard
error of the mean - Hypothesis testing – z-test – z-test procedure –decision
rule – calculations – decisions – interpretations - one-tailed and two-tailed
tests – Estimation – point estimate – confidence interval – level of confidence
– effect of sample size.

PART A

1. Define Population and it types.


 Population
 Any complete set of observations (or potential observations).
Types of Population
 Real Populations
o A real population is one in which all potential observations are
accessible at the time of sampling.
 Hypothetical Populations
o A hypothetical population is one in which all potential
observations are not accessible at the time of sampling.

2. Define Sample and Random Sampling.


 Sample
 Any subset of observations from a population.
 The sample size is small relative to the population size.
 Random Sampling
 A selection process that guarantees all potential observations in
the population have an equal chance of being selected.
 Inferential statistics requires that samples be random.

3. Define the term probability.


 Probability
 The proportion or fraction of times that a particular event is likely to
occur.

4. What is meant by Mutually Exclusive Events? State the Addition Rule


for Mutually Exclusive Events
Mutually Exclusive Events
 Events that cannot occur together.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 1


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Addition Rule
 Add together the separate probabilities of several mutually exclusive
events to find the probability that any one of these events will occur.

where Pr( ) refers to the probability of the event in parentheses and A


and B are mutually exclusive events.

5. What is meant by Dependent and Independent Events? State the


Multiplication Rule for Independent Events.
Dependent Events
 When the occurrence of one event affects the probability of the other
event, these events are dependent.
 Although the heights of randomly selected pairs of men are independent,
the heights of brothers are dependent.

Independent Events
 The occurrence of one event has no effect on the probability that the other
event will occur.

Multiplication Rule
 Multiply together the separate probabilities of several independent
events to find the probability that these events will occur together.

where A and B are independent events.

6. Define Conditional Probability and Alternative Approach to Conditional


Probabilities
Conditional Probability
 The probability of one event, given the occurrence of another event.

Alternative Approach to Conditional Probabilities


 Conditional probabilities can be easily misinterpreted.
 Convert probabilities to frequencies (which, for example, total 100);
solve the problem with frequencies; and then convert the answer back to a
probability

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 2


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

7. Define sampling distribution of the mean.


 The sampling distribution of the mean refers to the probability
distribution of means for all possible random samples of a given size from
some population.

8. Narrate the symbols used for the mean and standard deviation of three
types of Distributions.

9. Define mean of all sample means.

 MEAN OF ALL SAMPLE MEANS


 The mean of all sample means always equals the population mean.

where represents the mean of the sampling distribution and μ


represents the mean of the population.

10. Define Standard error of the mean.

 STANDARD ERROR OF THE MEAN


 The distribution of sample means also has a standard deviation,
referred to as the standard error of the mean.
 The standard error of the mean equals the standard deviation of
the population divided by the square root of the sample size.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 3


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

11. Define Shape of the sampling distribution or state the central limit
theorem.
 SHAPE OF THE SAMPLING DISTRIBUTION
Central Limit Theorem
 The central limit theorem states that, regardless of the shape of the
population, the shape of the sampling distribution of the mean
approximates a normal curve if the sample size is sufficiently large.

12. Define Hypothesis Testing and its types.


Hypothesis Testing
 Hypothesis testing is a statistical method used to determine if there is
enough evidence in a sample data to draw conclusions about a
population.
 It is used to estimate the relationship between 2 statistical variables.
 It involves formulating two competing hypotheses, the null hypothesis
(H0) and the alternative hypothesis (H1), and then collecting data to
assess the evidence.
 Hypothesis testing evaluates two mutually exclusive population
statements to determine which statement is most supported by
sample data.

13. Defining Null Hypothesis and Alternate Hypothesis


 Null hypothesis (H0):
In statistics, the null hypothesis is a general statement or default
position that there is no relationship between two measured cases or
no relationship among groups. In other words, it is a basic
assumption or made based on the problem knowledge.
Example:
A company’s mean production is 50 units/per day
H0: = 50.
 Alternative hypothesis (H1):
The alternative hypothesis is the hypothesis used in hypothesis
testing that is contrary to the null hypothesis.
Example:
A company’s production is not equal to 50 units/per day i.e.

H1: 50.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 4


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

14. Explain testing of Null Hypothesis. Define Common Outcome and Rare
Outcome.
Testing Null Hypothesis
 The null hypothesis is tested by determining whether the one observed
sample mean qualifies as a common outcome or a rare outcome in the
hypothesized sampling distribution
Common Outcomes
o An observed sample mean qualifies as a common outcome if the
difference between its value and that of the hypothesized population
mean is small enough to be viewed as a probable outcome under the
null hypothesis.
o There is no compelling reason for rejecting the null hypothesis, it is
retained.
Rare Outcomes
o An observed sample mean qualifies as a rare outcome if the difference
between its value and the hypothesized population mean is too large to
be reasonably viewed as a probable outcome under the null hypothesis.

15. Discuss z test for a population mean.

Z TEST FOR A POPULATION MEAN


 A hypothesis test that evaluates how far the observed sample mean
deviates, in standard error units, from the hypothesized population
mean.
 This z test is accurate only when
(1) the population is normally distributed or the sample size is large
enough to satisfy the requirements of the central limit theorem
(2) the population standard deviation is known.

16. List the z - test step by step procedure


Step 1 - State the research problem.
Step 2 - Identify the statistical hypotheses.
Step 3 - Specify a decision rule.
Step 4 - Calculate the value of the observed z.
Step 5 - Make a decision.
Step 6 - Interpret the decision.

17. Define Critical z Score


 A z score that separates common from rare outcomes and hence
dictates whether H0 should be retained or rejected.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 5


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

18. Define Level of Significance (α)


 The degree of rarity required of an observed outcome in order to reject
the null hypothesis (H0).

19. What is the use of one-tailed and two – tailed tests in hypothesis
testing? When to use it?
 One and Two-Tailed Tests are ways to identify the relationship
between the statistical variables.
 For checking the relationship between variables in a single direction
(Left or Right direction), use a one-tailed test.
 A two-tailed test is used to check whether the relations between
variables are in any direction or not.

20. Define One-Tailed or Directional Test


 A one-tailed test is based on a uni-directional hypothesis where the
area of rejection is on only one side of the sampling distribution.
 It determines whether a particular population parameter is larger or
smaller than the predefined parameter. It uses one single critical value
to test the data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 6


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

21. Define Two-Tailed or Non-directional Test


 Rejection regions are located in both tails of the sampling distribution.
 For checking whether the sample is greater or less than a range of
values, use the two-tailed testing.
 It is used for null hypothesis testing.

Figure 3.9 – Two tailed test

22. Define Point Estimate.


POINT ESTIMATE
 A single value that represents some unknown population
characteristic, such as the population mean.
 The best single point estimate for the unknown population mean is
simply the observed value of the sample mean.

23. Define Confidence interval


CONFIDENCE INTERVAL (CI) FOR μ
 A confidence interval for μ uses a range of values that, with a known
degree of certainty, includes an unknown population characteristic,
such as a population mean.

Confidence Interval for μ Based on z

where

represents the sample mean;


zconf represents a number from the standard normal table that satisfies
the confidence specifications for the confidence interval; and

represents the standard error of the mean.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 7


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

24. Define level of confidence


 The level of confidence indicates the percent of time that a series of
confidence intervals includes the unknown population characteristic,
such as the population mean.
 Any level of confidence may be assigned to a confidence interval
merely by substituting an appropriate value for zconf in Formula
 Although many different levels of confidence have been used, 95
percent and 99 percent are the most prevalent.

25. Which is efficient hypothesis tests or confidence intervals?


 Hypothesis tests merely indicate whether or not an effect is present, whereas
Confidence intervals indicate the possible size of the effect.
 Confidence intervals tend to be more informative than hypothesis tests.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 8


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

PART B

1. Give a detailed introduction about Population Sample and Probability.


 Population
 Any complete set of observations (or potential observations).
Types of Population
 Real Populations
o A real population is one in which all potential observations are
accessible at the time of sampling.
 Hypothetical Populations
o A hypothetical population is one in which all potential
observations are not accessible at the time of sampling.

 Sample
 Any subset of observations from a population.
 The sample size is small relative to the population size.

Example 3.1
For each of the following pairs, indicate with a Yes or No
whether the relationship between the first and second
expressions could describe that between a sample and its
population, respectively.
(a) students in the last row; students in class
(b) citizens of Wyoming; citizens of New York
(c) 20 lab rats in an experiment; all lab rats, similar to those
used, that could undergo the same experiment
(d) all U.S. presidents; all registered Republicans
(e) two tosses of a coin; all possible tosses of a coin
Solution
(a) Yes
(b) No. Citizens of Wyoming aren’t a subset of citizens of New York.
(c) Yes
(d) No. All U.S. presidents aren’t a subset of all registered Republicans.
(e) Yes

Example 3.2
Identify all of the expressions from Example 3.1 that involve a
hypothetical population.
Solution
Expressions in 8.1(c) and 8.1(e) involve hypothetical populations.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 9


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

 Random Sampling
 A selection process that guarantees all potential observations in the
population have an equal chance of being selected.
 Inferential statistics requires that samples be random.

Example 3.3
Indicate whether each of the following statements is True or False.
A random selection of 10 playing cards from a deck of 52 cards implies that
(a) the random sample of 10 cards accurately represents the important
features of the whole deck.
(b) each card in the deck has an equal chance of being selected.
(c) it is impossible to get 10 cards from the same suit (for example, 10
hearts).
(d) any outcome, however unlikely, is possible.
Solution
a. False. Sometimes, just by chance, a random sample of 10 cards fails to
represent the important features of the whole deck.
b. True
c. False. Although unlikely, 10 hearts could appear in a random sample of 10
cards.
d. True

 Tables Of Random Numbers


 Tables of random numbers can be used to obtain a random sample.
 These tables are generated by a computer designed to equalize the
occurrence of any one of the 10 digits: 0, 1, 2, . . . , 8, 9.

Example 3.4
Describe how you would use the table of random numbers to take
a. a random sample of five statistics students in a classroom where
each of nine rows consists of nine seats.
b. a random sample of size 40 from a large directory consisting of
3041 pages, with 480 lines per page.
Solution
a. There are many ways. For instance, consult the tables of random numbers,
using the first digit of each 5-digit random number to identify the row
(previously labelled 1, 2, 3, and so on), and the second digit of the same
random number to locate a particular student’s seat within that row.
Repeat this process until five students have been identified. (If the
classroom is larger, use additional digits so that every student can be
sampled.)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 10


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

b. Once again, there are many ways. For instance, use the initial 4 digits
of each random number (between 0001 and 3041) to identify the page
number of the telephone directory and the next 3 digits (between 001
and 480) to identify the particular line on that page. Repeat this
process, using 7-digit numbers, until 40 telephone numbers have been
identified.

 Probability
 The proportion or fraction of times that a particular event is likely to
occur.

Mutually Exclusive Events


 Events that cannot occur together.
Addition Rule
 Add together the separate probabilities of several mutually exclusive
events to find the probability that any one of these events will occur.

where Pr( ) refers to the probability of the event in parentheses and A


and B are mutually exclusive events.

Example 3.5
Assuming that people are equally likely to be born during any
One of the months, what is the probability of Jack being born
during
(a) June?
(b) any month other than June?
(c) either May or June?
Solution

Independent Events
 The occurrence of one event has no effect on the probability that the
other event will occur.
Multiplication Rule
 Multiply together the separate probabilities of several independent
events to find the probability that these events will occur together.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 11


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

where A and B are independent events.

Example 3.6
Assuming that people are equally likely to be born during any of the
months, and also assuming (possibly over the objections of astrology
fans) that the birthdays of married couples are independent, what’s
the probability of
(a) the husband being born during January and the wife being born
during February?
(b) both husband and wife being born during December?
(c) both husband and wife being born during the spring (April or May)?
(Hint: First, find the probability of just one person being born during
April or May.)
Solution

Dependent Events
 When the occurrence of one event affects the probability of the other
event, these events are dependent.
 Although the heights of randomly selected pairs of men are independent,
the heights of brothers are dependent.

Conditional Probability
 The probability of one event, given the occurrence of another event.

Alternative Approach to Conditional Probabilities


 Conditional probabilities can be easily misinterpreted.
 Convert probabilities to frequencies (which, for example, total 100);
solve the problem with frequencies; and then convert the answer back to a
probability

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 12


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Example –

Figure 3.1 – A frequency analysis of 100 drivers who caused fatal accidents

Figure 3.1 shows a frequency analysis for the 100 drivers involved in fatal
accidents.
Working from the top down, notice that among the 100 drivers, 40 are drunk
(from .40 × 100 = 40) and 20 take drugs (from .20 × 100 = 20). Also notice
that 12 of the 40 drunk drivers also take drugs (from .30 × 40 = 12). Now, it
is fairly straightforward to establish that the probability of drivers both being
drunk and taking drugs. It is simply the number of drivers who are drunk
and take drugs, 12, divided by the total number of drivers, 100, that is,
12/100 =.12, which, of course, is the same as the previous answer.
Once a frequency analysis has been done, it often is easy to answer other
questions.
For example,
“What is the conditional probability of being drunk, given that the driver
takes illegal drugs?”
Referring to Figure 3.1, divide the number of drivers who are drunk and take
drugs, 12, by the number of drivers who take drugs, 20, that is, 12/20 = .60.

Example 3.7
Among 100 couples who had undergone marital counselling, 60
couples described their relationships as improved, and among this
latter group, 45 couples had children. The remaining couples
described their relationships as unimproved, and among this group, 5
couples had children. (Hint: Using a frequency analysis, begin with the
100 couples, first branch into the number of couples with improved
and unimproved relationships, then under each of these numbers,

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 13


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

branch into the number of couples with children and without children.
Enter a number at each point of the diagram before proceeding.)
a. What is the probability of randomly selecting a couple who described
their relationship as improved?
b. What is the probability of randomly selecting a couple with children?
c. What is the conditional probability of randomly selecting a couple
with children, given that their relationship was described as
improved?
d. What is the conditional probability of randomly selecting a couple
without children, given that their relationship was described as not
improved?
e. What is the conditional probability of an improved relationship, given
that a couple has children?

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 14


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

2. Discuss in detail about sampling distribution and creating sampling


distribution in inferential statistics.

 Sampling distribution of the mean


 Creating a sampling distribution

 Mean of all sample means

 Standard error of the mean

 Shape of the sampling distribution

 SAMPLING DISTRIBUTION OF THE MEAN


 The sampling distribution of the mean refers to the probability
distribution of means for all possible random samples of a given size
from some population.

 CREATING A SAMPLING DISTRIBUTION


 Imagine small population of four observations with values of 2, 3, 4,
and 5, as shown in Figure 3.2.

Figure 3.2 - Graph of a miniature population.

 Itemize all possible random samples, each of size two, that could be
taken from this population.
 There are four possibilities on the first draw from the population and
also four possibilities on the second draw from the population, as
indicated in Table 3.1.*

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 15


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

 The two sets of possibilities combine to yield a total of 16 possible


samples.
 Table 3.1 also lists a sample mean (found by adding the two
observations and dividing by 2) and its probability of occurrence
(expressed as 1⁄16, since each of the 16 possible samples is equally
likely).

Table 3.1 - All possible samples of size two from a miniature population

 When cast into a relative frequency or probability distribution, as in


Table 3.2, the 16 sample means constitute the sampling distribution of
the mean, previously defined as the probability distribution of means for
all possible random samples of a given size from some population.
 Not all values of the sample mean occur with equal probabilities in Table
3.2 since some values occur more than once among the 16 possible
samples.
 For instance, a sample mean value of 3.5 appears among 4 of 16
possibilities and has a probability of 4⁄16.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 16


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Table 3.2 – Sampling Distribution of the Mean (samples of size


Of two from a miniature population)

 Probability of a Particular Sample Mean


 The distribution in Table 3.2 can be consulted to determine the
probability of obtaining a particular sample mean or set of sample
means.
 The probability of a randomly selected sample mean of either 5.0 or
2.0 equals 1⁄16 + 1⁄16 = 2⁄16 = .1250.
 This type of probability statement, based on a sampling distribution,
assumes an essential role in inferential statistics
 Refer Figure 3.3

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 17


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

FIGURE 3.3
Emergence of the sampling distribution of the mean from all possible
samples.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 18


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Example 3.8
Without peeking, list the special symbols for the mean of the
population
(a) mean of the sampling distribution of the mean
(b) mean of the sample
(c) standard error of the mean
(d) standard deviation of the sample
(e) standard deviation of the population (f) .

Example 3.9
Imagine a very simple population consisting of only five observations:
2, 4, 6, 8, 10.
(a) List all possible samples of size two.

(b) Construct a relative frequency table showing the sampling


distribution of the mean.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 19


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

 MEAN OF ALL SAMPLE MEANS


 The mean of all sample means always equals the population mean.

where represents the mean of the sampling distribution and μ


represents the mean of the population.

Example 3.10
Indicate whether the following statements are True or False.

The mean of all sample means, ,...


(a) always equals the value of a particular sample mean.
(b) equals 100 if, in fact, the population mean equals 100.
(c) usually equals the value of a particular sample mean.
(d) is interchangeable with the population mean.
a. False. It always equals the value of the population mean.
b. True
c. False. Because of chance, most sample means tend to be either
larger or smaller than the mean of all sample means.
d. True

 STANDARD ERROR OF THE MEAN


 The distribution of sample means also has a standard deviation,
referred to as the standard error of the mean.
 The standard error of the mean serves as a special type of
standard deviation that measures variability in the sampling
distribution.
 A rough measure of the average amount by which sample means
deviate from the mean of the sampling distribution or from the
population mean.
 The standard error of the mean equals the standard deviation of
the population divided by the square root of the sample size.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 20


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Example 3.10
Indicate whether the following statements are True or False. The

standard error of the mean, ,...


(a) roughly measures the average amount by which sample
means deviate from the population mean.
(b) measures variability in a particular sample.
(c) increases in value with larger sample sizes.
(d) equals 5, given that σ = 40 and n = 64.
(a) True
(b) False. It measures variability among sample means.
(c) False. It decreases in value with larger sample sizes.
(d) True

 SHAPE OF THE SAMPLING DISTRIBUTION


Central Limit Theorem
 the central limit theorem states that, regardless of the shape of the
population, the shape of the sampling distribution of the mean
approximates a normal curve if the sample size is sufficiently large.
 Example - For the two non-normal populations in the top panel of Figure
3.4, the shapes of the sampling distributions in the middle panel show
essentially the same preliminary drift toward normality when the sample
size equals only 2, while the shapes of the sampling distributions in the
bottom panel closely approximate normality when the sample size equals
25.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 21


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Figure 3.4 – Effect of Central limit theorem

Example 3.11
Indicate whether the following statements are True or False. The
central limit theorem
a. states that, with sufficiently large sample sizes, the shape of the
population is normal.
b. states that, regardless of sample size, the shape of the sampling
distribution of the mean is normal.
c. ensures that the shape of the sampling distribution of the mean
equals the shape of the population.
d. applies to the shape of the sampling distribution—not to the shape
of the population and not to the shape of the sample.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 22


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

a. False. The shape of the population remains the same regardless of


sample size.
b. False. It requires that the sample size be sufficiently large—usually
between 25 and 100.
c. False. It ensures that the shape of the sampling distribution
approximates a normal curve, regardless of the shape of the population
(which remains intact).
d. True

3. Explain in detail about Hypothesis Testing and its types.


Hypothesis Testing
 Hypothesis testing is a statistical method used to determine if there is
enough evidence in a sample data to draw conclusions about a
population.
 It is used to estimate the relationship between 2 statistical variables.
 It involves formulating two competing hypotheses, the null hypothesis
(H0) and the alternative hypothesis (Ha), and then collecting data to
assess the evidence.
 Hypothesis testing evaluates two mutually exclusive population
statements to determine which statement is most supported by
sample data.

Defining Hypotheses
 Null hypothesis (H0):
In statistics, the null hypothesis is a general statement or default
position that there is no relationship between two measured cases or
no relationship among groups. In other words, it is a basic
assumption or made based on the problem knowledge.
Example:
A company’s mean production is 50 units/per day
H0: = 50.
 Alternative hypothesis (H1):
The alternative hypothesis is the hypothesis used in hypothesis
testing that is contrary to the null hypothesis.
Example:
A company’s production is not equal to 50 units/per day i.e.

H1: 50.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 23


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Key Terms of Hypothesis Testing


 Level of significance:
o It refers to the degree of significance to accept or reject the null
hypothesis. 100% accuracy is not possible for accepting a
hypothesis, so, therefore, select a level of significance that is usually
5%.
o This is normally denoted with and generally, it is 0.05 or 5%,
which means the output should be 95% confident to give a similar
kind of result in each sample.
 P-value:
o The P value, or calculated probability, is the probability of finding the
observed/extreme results when the null hypothesis(H0) of a study-
given problem is true.
o If P-value is less than the chosen significance level then reject the
null hypothesis i.e. accept that the sample claims to support the
alternative hypothesis.
 Test Statistic:
o The test statistic is a numerical value calculated from sample data
during a hypothesis test, used to determine whether to reject the
null hypothesis.
o It is compared to a critical value or p-value to make decisions about
the statistical significance of the observed results.
 Critical value:
o The critical value in statistics is a threshold or cutoff point used to
determine whether to reject the null hypothesis in a hypothesis test.
 Degrees of freedom:
o Degrees of freedom are associated with the variability or freedom one
has in estimating a parameter.
o The degrees of freedom are related to the sample size and determine
the shape.

Testing Null Hypothesis


 The null hypothesis is tested by determining whether the one observed
sample mean qualifies as a common outcome or a rare outcome in the
hypothesized sampling distribution of Figure 3.5.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 24


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Figure 3.5. - Hypothesized sampling distribution of the mean centred about a


hypothesized population mean of 500.
 Common Outcomes
o An observed sample mean qualifies as a common outcome if the
difference between its value and that of the hypothesized population
mean is small enough to be viewed as a probable outcome under the null
hypothesis.
o There is no compelling reason for rejecting the null hypothesis, it is
retained.
 Rare Outcomes
o An observed sample mean qualifies as a rare outcome if the difference
between its value and the hypothesized population mean is too large to be
reasonably viewed as a probable outcome under the null hypothesis.
Boundaries for Common and Rare Outcomes

Figure 3.6 - One possible set of common and rare outcomes (values of X).

Figure 3.6 shows one possible set of boundaries for common and rare
outcomes, expressed in values of X.
If the one observed sample mean is located between 478 and 522, it will
qualify as a common outcome, and the null hypothesis will be retained.
If, however, the one observed sample mean is greater than522 or less than
478, it will qualify as a rare outcome, and the null hypothesis will be
rejected.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 25


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

4. Discuss in detail about z test for a population mean and z test


procedure.
Converting a Raw Score to z
 To convert a raw score into a standard score, express the raw score as a
distance from its mean (by subtracting the mean from the raw score), and
then split this distance into standard deviation units (by dividing with the
standard deviation).

Converting a Sample Mean to z

where

- observed sample mean;

- the hypothesized population mean

- the standard error of the mean

Z TEST FOR A POPULATION MEAN


 A hypothesis test that evaluates how far the observed sample mean
deviates, in standard error units, from the hypothesized population
mean.
 This z test is accurate only when
(1) the population is normally distributed or the sample size is large
enough to satisfy the requirements of the central limit theorem
(2) the population standard deviation is known.

Example 3.12
Calculate the value of the z test for each of the following situations:

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 26


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Z - TEST STEP BY STEP PROCEDURE

Step 1 - State the research problem.


 State the problem to be resolved by the investigation.

Step 2 - Identify the statistical hypotheses.


 The statistical hypotheses consist of a null hypothesis (H0)
and an alternative (or research) hypothesis (H1).
Null Hypothesis (H0)
 A statistical hypothesis that usually asserts that nothing
special is happening with respect to some characteristic of
the underlying population.

Where μ is the population mean


Alternative Hypothesis (H1)
 The opposite of the null hypothesis.

 Depending on the outcome of the hypothesis test, H0 will


either be retained or rejected.

Step 3 - Specify a decision rule.


 This rule indicates precisely when H0 should be rejected.

Step 4 - Calculate the value of the observed z.


 Express the one observed sample mean as an observed z,

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 27


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Critical z Score
 A z score that separates common from rare outcomes and hence
dictates whether H0 should be retained or rejected.

Level of Significance (α)


 The degree of rarity required of an observed outcome in order to
reject the null hypothesis (H0).

Step 5 - Make a decision.


 Either retain or reject H0 at the specified level of
significance, justifying this decision by noting the
relationship between observed and critical z scores.
Retaining H0 is a Weak Decision
 H0 is retained whenever the observed z qualifies as a
common outcome on the assumption that H0 is true.
Rejecting H0 is a Strong Decision
 H0 is rejected whenever the observed z qualifies as a rare
outcome on the assumption that H0 is true.

Step 6 - Interpret the decision.


 Using words, interpret the decision in terms of the original
research problem.
 Rejection of the null hypothesis supports the research
hypothesis, while retention of the null hypothesis fails to
support the research hypothesis.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 28


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 29


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Example 3.13
Indicate what’s wrong with each of the following statistical
hypotheses:

(a) Different numbers appear in H0 and H1.


(b) Sample means (rather than population means) appear in H0 and H1.

Example 3.14
First using words, then symbols, identify the null hypothesis for
each of the following situations.
a. A school administrator wishes to determine whether sixth-grade
boys in her school district differ, on average, from the national
norms of 10.2 pushups for sixth-grade boys.
b. A consumer group investigates whether, on average, the true
weights of packages of ground beef sold by a large supermarket
chain differ from the specified 16 ounces.
c. A marriage counselor wishes to determine whether, during a
standard conflict-resolution session, his clients differ, on average,
from the 11 verbal interruptions reported for “welladjusted
couples.”

(a) Sixth-grade boys in her school district average 10.2 pushups.


H0: μ = 10.2
(b) On average, weights of packages of ground beef sold by a large
supermarket chain equal 16 ounces.
H0: μ = 16
(c) The marriage counselor’s clients average 11 interruptions per
session.
H0: μ = 11

Example 3.15
For each of the following situations, indicate whether H0 should be
retained or rejected and justify your answer by specifying the precise
relationship between observed and critical z scores. Should H0 be
retained or rejected, given a hypothesis test with critical z scores of ±
1.96 and

a. Retain H0 at the .05 level of significance because z = 1.74 is less positive


than 1.96.
b. Retain H0 at the .05 level of significance because z = 0.13 is less positive
than 1.96.
c. Reject H0 at the .05 level of significance because z = −2.51 is more
negative than –1.96.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 30


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Example 3.16
According to the American Psychological Association, members
with a doctorate and a full-time teaching appointment earn, on
thaverage, $82,500 per year, with a standard deviation of $6,000. An
investigator wishes to determine whether $82,500 is also the mean
salary for all female members with a doctorate and a full-time
teaching appointment. Salaries are obtained for a random sample of
100 women from this population, and the mean salary equals
$80,100.
(a) Someone claims that the observed difference between $80,100
and $82,500 is large enough by itself to support the conclusion
that female members earn less than male members. Explain why
it is important to conduct a hypothesis test.
(b) The investigator wishes to conduct a hypothesis test for what
population?
(c) What is the null hypothesis, H0?
(d) What is the alternative hypothesis, H1?
(e) Specify the decision rule, using the .05 level of significance.
(f) Calculate the value of z. (Remember to convert the standard
deviation to a standard error.)
(g) What is your decision about H0?
(h) Using words, interpret this decision in terms of the original
problem.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 31


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

5. Discuss and differentiate between one-tailed and two – tailed tests in


hypothesis testing.

 One and Two-Tailed Tests are ways to identify the relationship between the
statistical variables.
 For checking the relationship between variables in a single direction (Left or
Right direction), use a one-tailed test.
 A two-tailed test is used to check whether the relations between variables are
in any direction or not.

One-Tailed or Directional Test


 A one-tailed test is based on a uni-directional hypothesis where the area of
rejection is on only one side of the sampling distribution.
 It determines whether a particular population parameter is larger or smaller
than the predefined parameter. Refer Figure 3.7 and Figure 3.8
 It uses one single critical value to test the data.

Figure 3.7 – One tailed test

Figure 3.8 a. One-Tailed or Directional Test (Lower Tail Critical)


Figure 3.8 b. One-Tailed or Directional Test (Upper Tail Critical)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 32


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

 Figure 3.8 a, illustrates a rejection region that is associated with only the
lower tail of the hypothesized sampling distribution.
 The corresponding decision rule, with its critical z of –1.65, is referred to
as a one-tailed or directional test with the lower tail critical.
 Figure 3.8 b, illustrates one-tailed or directional test with the upper tail
critical. This one-tailed test is the mirror image of the previous test.
 The corresponding decision rule, with its critical z of 1.65, is referred to
as a one-tailed or directional test with the upper tail critical.

Two-Tailed or Non-directional Test


 Rejection regions are located in both tails of the sampling distribution.
 For checking whether the sample is greater or less than a range of values,
use the two-tailed testing.
 It is used for null hypothesis testing.

Figure 3.9 – Two tailed test

Figure 3.10 – Two-Tailed or Nondirectional Test

 Figure 3.10 shows rejection regions that are associated with both tails of
the hypothesized sampling distribution.
 The corresponding decision rule, with its pair of critical z scores of ±1.96,
is referred to as a two-tailed or nondirectional test.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 33


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Difference Between One and Two-Tailed Test:


One-Tailed Test Two-Tailed Test
A test of any statistical hypothesis,
A test of a statistical hypothesis, where
where the alternative hypothesis
the alternative hypothesis is two-
is one-tailed either right-tailed or left-
tailed.
tailed.
For one-tailed, use either > or < sign For two-tailed, use ≠ sign for the
for the alternative hypothesis. alternative hypothesis.

When the alternative hypothesis


If no direction is given then use a two-
specifies a direction then use a one-
tailed test.
tailed test.

Critical region lies entirely on either the Critical region is given by the portion
right side or left side of the sampling of the area lying in both the tails of the
distribution. probability curve of the test statistic.

Here, the Entire level of significance


It splits the level of significance
(α) i.e. 5% has either in the left tail or
(α) into half.
right tail.

Rejection region is either from the left Rejection region is from both sides i.e.
side or right side of the sampling left and right of the sampling
distribution. distribution.

It checks the relation between the It checks the relation between the
variable in a single direction. variables in any direction.

It is used to check whether the one It is used to check whether the two
mean is different from another mean or mean different from one another or
not. not.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 34


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Example 3.17
For each of the following situations, indicate whether H0 should be
retained or rejected.
Given a one-tailed test, lower tail critical with α = .01, and
(a) z = – 2.34 (b) z = – 5.13 (c) z = 4.04
Given a one-tailed test, upper tail critical with α = .05, and
(d) z = 2.00 (e) z = – 1.80 (f) z = 1.61
a. Reject H0 at the .01 level of significance because z = –2.34 is more negative
than –2.33.
b. Reject H0 at the .01 level of significance because z = –5.13 is more negative
than –2.33.
c. Retain H0 at the .01 level of significance because z = 4.04 is less negative
than –2.33. (The value of the observed z is in the direction of no concern.)
d. Reject H0 at the .05 level of significance because z = 2.00 is more positive
than 1.65.
e. Retain H0 at the .05 level of significance because z = –1.80 is less positive
than 1.65. (The value of the observed z is in the direction of no concern.)
f. Retain H0 at the .05 level of significance because z = 1.61 is less positive
than 1.65.

Example 3.18
Specify the decision rule for each of the following situations (referring
to Table to find critical z values):
(a) a two-tailed test with α = .05
(b) a one-tailed test, upper tail critical, with α = .01
(c) a one-tailed test, lower tail critical, with α = .05
(d) a two-tailed test with α = .01
a. Reject H0 at the .05 level of significance if z equals or is more positive than
1.96 of if z equals or is more negative than –1.96.
b. Reject H0 at the .01 level of significance if z equals or is more positive than
2.33.
c. Reject H0 at the .05 level of significance if z equals or is more negative than
–1.65.
d. Reject H0 at the .01 level of significance if z equals or is more positive than
2.58 or if z equals or is more negative than –2.58.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 35


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

6. Discuss in detail about Estimation.


 POINT ESTIMATE
 A single value that represents some unknown population
characteristic, such as the population mean.
 The best single point estimate for the unknown population mean is
simply the observed value of the sample mean.

Example 3.19
A random sample of 200 graduates of U.S. colleges reveals a mean
annual income of $62,600. What is the best estimate of the
unknown mean annual income for all graduates of U.S. colleges?
$62,600

 CONFIDENCE INTERVAL (CI) FOR μ


 A confidence interval for μ uses a range of values that, with a known
degree of certainty, includes an unknown population characteristic,
such as a population mean.

Confidence Interval for μ Based on z

where

represents the sample mean;


zconf represents a number from the standard normal table that satisfies
the confidence specifications for the confidence interval; and

represents the standard error of the mean.

Example 3.20
Reading achievement scores are obtained for a group of fourth
graders. A score of 4.0 indicates a level of achievement
appropriate for fourth grade, a score below 4.0 indicates
underachievement, and a score above 4.0 indicates
overachievement. Assume that the population standard
deviation equals 0.4. A random sample of 64 fourth graders
reveals a mean achievement score of 3.82.
a. Construct a 95 percent confidence interval for the unknown
population mean. (Remember to convert the standard deviation
to a standard error.)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 36


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

b. Interpret this confidence interval; that is, do you find any


consistent evidence either of overachievement or of
underachievement?

(b) Can claim, with 95 percent confidence, that the interval between
3.72 and 3.92 includes the true population mean reading score
for the fourth graders. All of these values suggest that, on
average, the fourth graders are underachieving

Example 3.21
Before taking the GRE, a random sample of college seniors received
special training on how to take the test. After analysing their scores
on the GRE, the investigator reported a dramatic gain, relative to
the national average of 500, as indicated by a 95 percent confidence
interval of 507 to 527. Are the following interpretations true or
false?
a. About 95 percent of all subjects scored between 507 and 527.
b. The interval from 507 to 527 refers to possible values of the
population mean for all students who undergo special training.
c. The true population mean definitely is between 507 and 527.
d. This particular interval describes the population mean about 95
percent of the time.
e. In practice, we never really know whether the interval from 507
to 527 is true or false.
f. We can be reasonably confident that the population mean is
between 507 and 527.
a. False. We can be 95 percent confident that the mean for all subjects
will be between 507 and 527.
b. True
c. False. We can be reasonably confident—but not absolutely confident—
that the true population mean lies between 507 and 527.
d. False. This particular interval either describes the one true population
mean or fails to describe the one true population mean.
e. True
f. True

 LEVEL OF CONFIDENCE
 The level of confidence indicates the percent of time that a series of
confidence intervals includes the unknown population characteristic,
such as the population mean.
 Any level of confidence may be assigned to a confidence interval
merely by substituting an appropriate value for zconf in Formula

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 37


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Choosing a Level of Confidence


 Although many different levels of confidence have been used, 95
percent and 99 percent are the most prevalent.

 EFFECT OF SAMPLE SIZE


 The larger the sample size, the smaller the standard error and, hence,
the more precise (narrower) the confidence interval will be.
 Indeed, as the sample size grows larger, the standard error will
approach zero and the confidence interval will shrink to a point
estimate.
 Given this perspective, the sample size for a confidence interval,
unlike that for a hypothesis test, never can be too large.
 Factors to select the sample size
i. Experience – Small samples can result in wide confidence interval
and risk of errors.
ii. Confidence Level – Larger the confidence level, larger the sample
size.

Example 3.22
On the basis of a random sample of 120 adults, a pollster
reports, with 95 percent confidence, that between 58 and 72
percent of all Americans believe in life after death.
a. If this interval is too wide, what, if anything, can be done
with the existing data to obtain a narrower confidence
interval?
b. What can be done to obtain a narrower 95 percent
confidence interval if another similar investigation is being
planned?
a. Switch to an interval having a lesser degree of confidence, such
as 90 percent or 75 percent.
b. Increase the sample size.

 HYPOTHESIS TESTS OR CONFIDENCE INTERVALS?


 Hypothesis tests merely indicate whether or not an effect is present, whereas
Confidence intervals indicate the possible size of the effect.
 Confidence intervals tend to be more informative than hypothesis tests.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 38


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Example 3.23
In a recent scientific sample of about 900 adult Americans, 70
percent favour stricter gun control of assault weapons, with a
margin of error of ±4 percent for a 95 percent confidence interval.
Therefore, the 95 percent confidence interval equals 66 to 74
percent. Indicate whether the following interpretations are true or
false:
a. The interval from 66 to 74 percent refers to possible values of
the sample percent.
b. The true population percent is between 66 and 74 percent.
c. In the long run, a series of intervals similar to this one would
fail to include the population percent about 5 percent of the
time.
d. We can be reasonably confident that the population percent is
between 66 and 74 percent.

(a) False. The interval from 66 to 74 percent refers to possible values of


the population proportion.
(b) False. Can be reasonably confident—but not absolutely confident—
that the true population proportion is between 66 and 74 percent.
(c) True
(d) True

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 39


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Example 3.23
For the population at large, the Wechsler Adult Intelligence Scale is
designed to yield a normal distribution of test scores with a mean of
100 and a standard deviation of 15. School district officials wonder
whether, on the average, an IQ score different from 100 describes the
intellectual aptitudes of all students in their district. Wechsler IQ
scores are obtained for a random sample of 25 of their students, and
the mean IQ is found to equal 105. Using the step-by-step procedure,
test the null hypothesis at the .05 level of significance.

Example 3.24
Consult the power curves in Figure 11.7 to estimate the approximate
detection rates, rounded to the nearest tenth, for the following
situations:
(a) a three-point effect, with a sample size of 29
(b) a six-point effect, with a sample size of 13
(c) a twelve-point effect, with a sample size of 13

(a) .3
(b) .4
(c) .9

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 40


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

Example 3.25
An investigator consults a chart to determine the sample size required
to detect an eight-point effect with a probability of .80. What happens
to this detection rate of .80—will it actually be smaller, the same, or
larger—if, unknown to the investigator, the true effect actually equals
(a) twelve points?
(b) five points?
a. The power for the 12-point effect is larger than .80 because the true
sampling distribution is shifted further into the rejection region for the
false H0.
b. The power for the 5-point effect is smaller than .80 because the true
sampling distribution is shifted further into the retention region for the
false H0.

Example 3.26
In Question 10.5 on page 191, it was concluded that, the mean salary
among the population of female members of the American
Psychological Association is less than that ($82,500) for all
comparable members who have a doctorate and teach full time.
(a) Given a population standard deviation of $6,000 and a sample
mean salary of $80,100 for a random sample of 100 female members,
construct a 99 percent confidence interval for the mean salary for all
female members.
(b) Given this confidence interval, is there any consistent evidence that
the mean salary for all female members falls below $82,500, the mean
salary for all members?

(b) can claim, with 99 percent confidence, that the interval between $78,552
and $81,648 includes the true population mean salary for all female members
of the American Psychological Association. All of these values suggest that,
on average, females’ salaries are less than males’ salaries.

Example 3.27
Imagine that one of the following 95 percent confidence intervals
estimates the effect of vitamin C on IQ scores:

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 41


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 3

(a) Which one most strongly supports the conclusion that vitamin C
increases IQ scores?
(b) Which one implies the largest sample size?
(c) Which one most strongly supports the conclusion that vitamin C
decreases IQ scores?
(c) Which one would most likely stimulate the investigator to
conduct an additional experiment using larger sample sizes?
(a) 3 (b) 1 (c) 5 (d) 4

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 42

You might also like