0% found this document useful (0 votes)
7 views

Statistical Inference

Chapter 6 of CIS 370 discusses statistical inference, focusing on sampling distributions, sampling methods, and the importance of sample size in reducing sampling error. It explains the concepts of confidence intervals and hypothesis testing, including the formulation of null and alternative hypotheses, and the trade-offs between Type I and Type II errors. The chapter emphasizes the necessity of normal distribution for sampling distributions and the application of the central limit theorem in statistical analysis.

Uploaded by

manojpruthvi650
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Statistical Inference

Chapter 6 of CIS 370 discusses statistical inference, focusing on sampling distributions, sampling methods, and the importance of sample size in reducing sampling error. It explains the concepts of confidence intervals and hypothesis testing, including the formulation of null and alternative hypotheses, and the trade-offs between Type I and Type II errors. The chapter emphasizes the necessity of normal distribution for sampling distributions and the application of the central limit theorem in statistical analysis.

Uploaded by

manojpruthvi650
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

CIS 370

Business Analytics
Chapter 6: Statistical Inference
Instructor: Hamed Qahri-Saremi, PhD
Sampling Distributions
Sampling Distributions
• In many applications, we are interested in the characteristics of a population.
– It is difficult if not impossible to analyze the entire population
• We don’t have information on the population parameters

– Make inferences about the characteristics of the population based on a random sample.
• We can instead make inferences based on sample statistics
– *Inferences: Making conclusions on the basis of evidence and reasoning.

– There is only one population, but many possible samples.


Sampling
• Different types of sampling schemes have different properties
– Random sampling
– Systematic sampling
– Stratified sampling
– Cluster sampling

• Typically, a tradeoff between cost and accuracy between different sampling techniques
– Bias − the tendency of a sample statistic to systematically over- or underestimate a population
parameter.
– Selection bias − a systematic underrepresentation of certain groups from consideration for the sample.
• Regression to the mean.

– Nonresponse bias − a systematic difference in preferences between respondents and non-


respondents to a survey or a poll.
Sampling Methods
• Simple random sample is a sample of n observations that has the same probability of
being selected from the population as any other sample of n observations.
– Most statistical methods presume simple random samples.
– However, in some situations other sampling methods have an advantage over simple random
samples.

LO 7.2
Sampling Methods
• Alternatives to Simple Random Sampling

LO 7.2
Sampling Methods

LO 7.2 https://www.scribbr.com/methodology/sampling-methods/
Sampling Error
• A point estimate is a single numeric value, or a “best guess” of a population parameter, based on the
data in the sample, such as a mean.
• The sampling error is the difference between the point estimate and the true value of the population
parameter. Sampling error is unavoidable when collecting a random sample.

2500 employees Sample of 30 employees


Effect of Sample Size
• Sampling error decreases as sample size increases.

– There are many tables and references available online to provide a suggested sample size based
on the size of the population and the acceptable level of sampling error.

– However, businesses need to consider the cost and time required to perform the sampling.
Sampling Distributions
• The sampling distribution of a point estimate is the distribution of point estimates from
all possible samples from the population.
– Like any distribution, a sampling distribution has a mean and a standard deviation.
Random Variable

Mean of A distribution of means


X1 X2 X3 X4 X from each random draw
6 10 8 4 5.57 from the population – a
5 10 4 3 5.71
sampling distribution.
1 8 4 3 6.36
4 1 6 2 4.07
One simple random 6 6 8 4
sample drawn from the 7 7 8 6
1 5 10 5
population.
5 5 9 1
4 6 4 2
7 4 9 5 Mean of the means
8 5 8 6 (mean of the sampling
Means for several 9 2 7 7 distribution).
samples drawn from the 9 1 2 3
population. 6 10 2 6
Means 5.57 5.71 6.36 4.07 5.43
Sampling Distributions
• The sampling distribution of the sample mean is the
probability distribution derived from all the means
that come from all possible samples of a given size.

– Consider a sample mean derived from n


observations.

– Another sample mean can be derived from a


different sample of n observations (from the same n
to the mea
population). Regression

– Repeat the process many times, the frequency


distribution of the sample means is the
sampling distribution.
“Regression to the Mean”
• “Regression to the mean”: a statistical phenomenon that occurs when repeated
measurements over separate samples are made on the same subject or unit of observation.
• In general, when observing repeated measurements in the same subject, relatively high (or
relatively low) observations/samples are likely to be followed by less extreme ones nearer
the subject's true mean (population mean).
– It happens because values are observed with random error.
• Random error is a non-systematic variation in the observed values around a true mean (e.g., random
measurement error, random sampling error, or random fluctuations in a subject).

Reference: Int J Epidemiol, Volume 34, Issue 1, February 2005,


Pages 215–220, https://doi.org/10.1093/ije/dyh299.
Sampling Distributions
• The mean ( ) of the sampling distribution of sample means is the same as the population mean.
• The standard error of the sample mean ( ) is the standard deviation of the sampling population,
which is equal to the population standard deviation divided by the square root of the sample size.
Sampling Distributions
• For making statistical inferences, it is essential that the sampling distribution of is
normally distributed.

– What if the underlying population is not normally distributed?


• The central limit theorem (CLT) states that the sum or the average of a large number of independent
observations (samples) from the same underlying distribution has an approximate normal distribution.
– The approximation steadily improves as the number of observations increases.

• Practitioners often use the normal distribution approximation when .


– If the population is normal, then the sampling distribution of is normal regardless of the sample size.
Sampling Distributions of the Sample Proportion
• Sometimes, sampling is done in order to estimate the proportion of a population that has a specific
characteristic
– Example 1: Proportion of students attaining final letter grades C in a class.
– Example 2: Proportion of products manufactured in an assembly line that are defective.

• The sample proportion () can be assumed to follow a normal distribution if the sample size is sufficiently
large:
– and
• where p is the estimated probability of success.
Estimation
Estimation
• A confidence interval, or interval estimate, provides a range of values that, with a certain level of
confidence, contains the population parameter of interest.
– A confidence interval around the point estimate is calculated from the sample data. The interval is very likely to
contain the true value of the population parameter.
– The confidence level describes the likelihood that the confidence interval will contain the population parameter
(for example, 95%)
– A confidence interval is associated with a margin of error.
• The margin of error around the confidence interval accounts for the standard error of the estimator and the desired
confidence level.

• The confidence interval for the population mean and population proportion is constructed as:
– point estimate margin of error.

• To construct a confidence interval for the population mean or the population proportion, it is
essential that the sampling distributions of and follow (approximately) a normal distribution.
Estimation: Calculating the Confidence Interval
• Specify a confidence level, usually 90% or 95% or 99%
– This determines the level of significance: = 1 – the confidence level

• Let (alpha) denote the allowed probability of error.


– The probability that the estimation procedure will generate an interval that does not contain the
population mean.
• Related to the significance level later in hypothesis testing.

• Use the level of significance (), the degrees of freedom (n-1), and the t-distribution
to determine the multiple of the standard error on either side of the point estimate.
– Margin of error.
Estimation
• A 90% confidence interval simply implies that if numerous samples of some predetermined
size (n) are drawn from a given population, then 90% of the intervals will contain the mean.
• A confidence interval for the population mean is computed as the below.

= sample standard deviation


Point Estimate Margin of Error
.
– This is only valid when (approximately) follows a normal distribution.
Estimation
• The t table
– Only lists probabilities for a limited number of values (columns). For /2=0.05, df=10:
– Provides probabilities in the upper tail.
P(T10 ≥ 1.812) = 0.05

t 0.05,10
Estimation
Example: in a sample of 25 ultra-green cars, we find mpg and .
• E4Q1: Construct a 90% confidence interval for the population mean, assuming that mpg
follows a normal distribution.
Estimation

• For a 90% confidence interval

• (by the Student’s t Distribution table)

• which gives [92.86, 100.18] mpg


• The 90% confidence interval for the average mpg of all ultra-green cars is between 92.86 mpg
and 100.18 mpg.
Example 1 (confidence interval)
A random sample of 40 customers who ordered a new sandwich were surveyed. Each
customer was asked to rate the sandwich on a scale of 1 to 10 (Satisfaction Ratings.xlsx).
What is the 95% confidence interval for the customer satisfaction?
1. Use JMP: Analyze ► Distribution
2. Select Satisfaction and click Y, Columns
3. On the report page, click the red triangle next to Satisfaction, select Confidence Interval, and select
95% as the confidence level.
Example 1 Solution (confidence interval)
• The confidence interval is 5.739 to 6.761. We can say with 95% confidence that the
population mean is between these two numbers.
Estimation
• is approximately normally distributed when and .
– and
– Replace with

• A confidence interval for the population proportion is computed as the below.

• Note that is the value associated with the upper tail of the standard normal distribution.
Estimation – Confidence Interval for a Sample Proportion
• Specify a confidence level, usually 90% or 95% or 99%.
• This determines the value for the equation below (you can also use the Z table).
– 90% use 1.645
– 95% use 1.96
– 99% use 2.58
Estimation
Example: In a sample of 25 ultra-green cars, seven of the cars obtained over 100 miles per gallon (mpg).
• E5Q1: Construct a 90% confidence interval for the population proportion of all ultra-
green cars that obtain over 100 mpg.
Estimation

• The normality condition is satisfied since and


• For a 90% confidence interval, and
– 1.645 (by the z table)

The 90% confidence interval for the proportion of cars that obtain over 100 mpg is between 13.2% and 42.8%.
Hypothesis Testing
Hypothesis Testing
• Hypothesis testing is used to resolve conflicts between two competing hypotheses on a
particular population of interest.
– Determines the validity of an assumption using sample data.
– Resolves a conflict between two competing hypotheses.

– The null hypothesis is denoted .


• Presumed default state of nature or status quo.
• Specified by or

– The alternative hypothesis is denoted .


• Contradicts default state of nature or status quo.
• Whatever we wish to establish, something new.
• Opposite sign as found in the null hypothesis: , or respectively.
Hypothesis Testing
• We can make one of two decisions:
– Reject the null hypothesis
– Not reject the null hypothesis.

• The sample evidence is either inconsistent with , in which case we reject ,


OR
• The evidence is not inconsistent with , in which case we cannot reject

– Note that we cannot “accept” because the sample size isn’t large enough to make that conclusion.
– We cannot also prove that the null hypothesis is true.
• Maintain status quo or business as “usual”.
Hypothesis Testing
• A crucial step in a hypothesis test concerns the formulation of the two
competing hypotheses because the conclusion of the test depends on
how the hypotheses are stated.

• There are One-Tailed and Two-Tailed hypothesis tests about a


population mean:
– (One tailed) Upper-tail test: : and :
– (One tailed) Lower-tail test: : and :
– Two-tailed test: : and :

• In general, we follow three steps when formulating the competing


hypotheses:
1. Identify the relevant population parameter of interest.
2. Determine whether it is a one- or two-tailed test.
3. Include some form of the equality sign in the null hypothesis and use the
alternative hypothesis to establish a claim.
Hypothesis Testing
• Because the decision of a hypothesis test is based on limited sample information, we
are bound to make errors.
– Correct decisions: reject the null hypothesis when the null hypothesis is false, not reject the
null hypothesis when the null hypothesis is true.
– Incorrect decisions:
• Reject the null hypothesis when we should not: Type I Error ~ p-value when H0 is true.
– Recall that when constructing confidence intervals, (alpha) is the level of significance and
denotes the allowed probability of error in a confidence interval. The probability that the estimation
procedure will generate an interval that does not contain the population mean.
• Not reject the null hypothesis when we should: Type II Error.
Hypothesis Testing
• There is a trade-off between these errors; by reducing the likelihood of a Type I error,
we implicitly increase the likelihood of a Type II error, and vice versa.
– The only way we can lower both and is by increasing n (sample size).

• For a given n, however, we can reduce only at the expense of a higher and reduce
only at the expense of a higher .
– The optimal choice of and depends on the relative cost of these two types of errors, and
determining these costs is not always easy.
Hypothesis Testing
• A hypothesis test regarding the population mean is based on the assumption that the
sample mean is (approximately) normally distributed.
1. Assume that the population mean , the hypothesized value of the population mean in the null
hypothesis.
• Assume null hypothesis is true.
2. Compute the test statistic:
3. Then find the p-value:
• The likelihood/probability of observing a sample mean that is at least as extreme as the one
derived from the sample assuming the null hypothesis is true.
• This depends on the specification in the alternative.
Hypothesis Testing
• Finding p value from test statistic:
– Use the t Table for finding the p value
– Or use an online p value calculator from t statistic such as this one
– Or use JMP to calculate the test statistic and p value from sample data
• For one-tailed upper-tailed test, use the p value referred as Prob > t
• For one-tailed lower-tailed test, use the p value referred as Prob < t
• For two-tailed test, use the p value referred as Prob > |t|
Hypothesis Testing
• We define as the allowed probability of making a Type I error.
– We refer to as the significance level.
• Most hypothesis tests are conducted using a significance level of 1%, 5%, or 10%, using = 0.01,
0.05, or 0.10, respectively.
• We generally choose a value for before implementing a hypothesis test.

• The p-value is referred to as the observed probability of making a Type I error.


– The actual probability of Type I error if you reject the null hypothesis.
– The hypothesis test uses the sample data to calculate the p-value.
– Use the p-value to make the decision.
• Reject the null hypothesis () if the p-value < .
– The actual Type I error is less than your threshold for Type I error (i.e., ).
• Do not reject the null hypothesis () if the p-value ≥ .
Hypothesis Testing
Example 6: The dean at a large university in California wonders if students at her university
study less than the 1961 national average of 24 hours per week. She randomly selects 35
students and asks their average study time per week (in hours). From their responses, she
calculates a sample mean of 16.3714 hours and a sample standard deviation of 7.2155
hours.
– E6Q1: Specify the competing hypotheses to test the dean’s concern.
– E6Q2: Calculate the value of the test statistic.
– E6Q3: Find the p-value.
– E6Q4: At the 5% significance level, what is the conclusion to the hypothesis test?
Hypothesis Testing
E6Q1: The hypothesized mean is . The hypotheses are

E6Q2:

E6Q3: The p-value is which is less than 0.0005 using the table.

E6Q4: Reject the null hypothesis because the p-value is less than . At the 5% significance
level, we conclude that the average study time at the university is less than the 1961
average of 24 hours per week.
Example 2 (Pizza Restaurant)
A manager of a pizza restaurant would like to know whether a new preparation method
provides better tasting pizza. Customers were asked to rate the new style pizza vs. the old
style on a scale of -10 to +10 (Pizza_Ratings.xlsx). A negative rating favors the old style. A
positive rating favors the new style. A rating around zero indicates indifference.
Example 2 (Pizza Restaurant) Solution
• The hypotheses are
• This is an upper-tail test (alternative hypothesis is one-tailed of the “greater than”
variety)
• JMP instructions:
1. Analyze ► Distribution
2. Select Rating and click Y, Columns
3. On the report page, click the red triangle next to Rating, select Test Mean, and enter 0 as the
Hypothesized Mean.
Example 2 (Pizza Restaurant) Solution
• The p-value is 0.0038.

– This is less than ɑ of 0.01, so we can say with 99% certainty that the
sample mean is significantly greater than the hypothesized value of
zero.

– Reject , which means the new style is more favored by the


customers.
Example 3 (Holiday Toys)
Holiday Toys manufactures and distributes products to more than 1000 retail outlets. For
this year’s most important new toy, Holiday’s marketing director is expecting demand to
average 40 units per retail outlet. To check this estimate, Holiday Toys has surveyed a
sample of 25 retailers for an anticipated order quantity (Orders.xlsx).
Example 3 (Holiday Toys) Solution
• The hypotheses are
• This is a two-tailed test.
• JMP instructions:
1. Analyze ► Distribution
2. Select Units and click Y, Columns
3. On the report page, click the red triangle next to Units, select Test Mean, and enter 40 as the
Hypothesized Mean
Example 3 (Holiday Toys) Solution
• The p-value is 0.2811. With a significance level of 0.05 we
cannot reject the null hypothesis of .

• That doesn’t mean the demand will be 40, only that it’s
possible the demand could be 40.
Hypothesis Testing for Population Proportion
• A hypothesis test regarding the population proportion is based on the assumption that the
sample proportion is (approximately) normally distributed.
• Assume that the population proportion , the hypothesized value of the population
proportion in the null hypothesis.

• The three forms for a hypothesis test about a population proportion are:
– (One tailed) Lower-tail test: : and :
– (One tailed) Upper-tail test: : and :
– Two-tailed test: : and :

• Compute the test statistic:


• Find the p-value similarly as with the test about the population mean, but use Z rather
than T.
Hypothesis Testing for Population Proportion
• Finding p value from test statistic:
– Use the z Table for finding the p value
– Or use an online p value calculator from t statistic such as this one
– Or use JMP to calculate the test statistic and p value using sample data
Hypothesis Testing for Population Proportion
Example 7: Driven by growing public support, the legalization of marijuana in America has
been moving at a very rapid rate. Today, 57% of adults say the use of marijuana should be
made legal. A health practitioner in Ohio collects data from 200 adults and finds that 102 of
them favor marijuana legalization.
• E7Q1: The health practitioner believes that the proportion of adults who favor marijuana
legalization in Ohio is not representative of the national proportion. Specify the
competing hypotheses to test her claim.
• E7Q2: Calculate the value of the test statistic and the p-value.
• E7Q3: At the 10% significance level, do the sample data support the health
practitioner’s belief?
Hypothesis Testing for Population Proportion
E7Q1: The hypotheses are .

E7Q2: . The p-value is .

E7Q3: Because the p-value of 0.0872 is less than , we reject the null hypothesis. Therefore,
at the 10% significance level, the proportion of adults who favor marijuana legalization in
Ohio differs from the national proportion of 0.57.
Example 4 (Pine Creek)
Over the past year, 20% of the players at Pine Creek golf course were women. In an effort
to increase this proportion, Pine Creek has implemented a special promotion to attract
women golfers (WomenGolf.xlsx). Develop hypotheses to assess whether the promotion
helped increase the proportion of women golfers.
Example 4 (Pine Creek) Solution
• The hypotheses are
• This is a one-tailed test.
• JMP instructions:
1. Analyze ► Distribution
2. Select Golfer and click Y, Columns
3. On the report page, click the red triangle next to
Golfer, select Test Probabilities, and enter 0.2 as
the Hypothesized Mean for female
Example 4 (Pine Creek) Solution
• The p-value is 0.0086, which is less than any reasonable value of .
– Therefore, we reject the null hypothesis. The promotion seems to be working to increase the
proportion of women golfers.

You might also like