Statistical Inference
Statistical Inference
Business Analytics
Chapter 6: Statistical Inference
Instructor: Hamed Qahri-Saremi, PhD
Sampling Distributions
Sampling Distributions
• In many applications, we are interested in the characteristics of a population.
– It is difficult if not impossible to analyze the entire population
• We don’t have information on the population parameters
– Make inferences about the characteristics of the population based on a random sample.
• We can instead make inferences based on sample statistics
– *Inferences: Making conclusions on the basis of evidence and reasoning.
• Typically, a tradeoff between cost and accuracy between different sampling techniques
– Bias − the tendency of a sample statistic to systematically over- or underestimate a population
parameter.
– Selection bias − a systematic underrepresentation of certain groups from consideration for the sample.
• Regression to the mean.
LO 7.2
Sampling Methods
• Alternatives to Simple Random Sampling
LO 7.2
Sampling Methods
LO 7.2 https://www.scribbr.com/methodology/sampling-methods/
Sampling Error
• A point estimate is a single numeric value, or a “best guess” of a population parameter, based on the
data in the sample, such as a mean.
• The sampling error is the difference between the point estimate and the true value of the population
parameter. Sampling error is unavoidable when collecting a random sample.
– There are many tables and references available online to provide a suggested sample size based
on the size of the population and the acceptable level of sampling error.
– However, businesses need to consider the cost and time required to perform the sampling.
Sampling Distributions
• The sampling distribution of a point estimate is the distribution of point estimates from
all possible samples from the population.
– Like any distribution, a sampling distribution has a mean and a standard deviation.
Random Variable
• The sample proportion () can be assumed to follow a normal distribution if the sample size is sufficiently
large:
– and
• where p is the estimated probability of success.
Estimation
Estimation
• A confidence interval, or interval estimate, provides a range of values that, with a certain level of
confidence, contains the population parameter of interest.
– A confidence interval around the point estimate is calculated from the sample data. The interval is very likely to
contain the true value of the population parameter.
– The confidence level describes the likelihood that the confidence interval will contain the population parameter
(for example, 95%)
– A confidence interval is associated with a margin of error.
• The margin of error around the confidence interval accounts for the standard error of the estimator and the desired
confidence level.
• The confidence interval for the population mean and population proportion is constructed as:
– point estimate margin of error.
• To construct a confidence interval for the population mean or the population proportion, it is
essential that the sampling distributions of and follow (approximately) a normal distribution.
Estimation: Calculating the Confidence Interval
• Specify a confidence level, usually 90% or 95% or 99%
– This determines the level of significance: = 1 – the confidence level
• Use the level of significance (), the degrees of freedom (n-1), and the t-distribution
to determine the multiple of the standard error on either side of the point estimate.
– Margin of error.
Estimation
• A 90% confidence interval simply implies that if numerous samples of some predetermined
size (n) are drawn from a given population, then 90% of the intervals will contain the mean.
• A confidence interval for the population mean is computed as the below.
t 0.05,10
Estimation
Example: in a sample of 25 ultra-green cars, we find mpg and .
• E4Q1: Construct a 90% confidence interval for the population mean, assuming that mpg
follows a normal distribution.
Estimation
• Note that is the value associated with the upper tail of the standard normal distribution.
Estimation – Confidence Interval for a Sample Proportion
• Specify a confidence level, usually 90% or 95% or 99%.
• This determines the value for the equation below (you can also use the Z table).
– 90% use 1.645
– 95% use 1.96
– 99% use 2.58
Estimation
Example: In a sample of 25 ultra-green cars, seven of the cars obtained over 100 miles per gallon (mpg).
• E5Q1: Construct a 90% confidence interval for the population proportion of all ultra-
green cars that obtain over 100 mpg.
Estimation
The 90% confidence interval for the proportion of cars that obtain over 100 mpg is between 13.2% and 42.8%.
Hypothesis Testing
Hypothesis Testing
• Hypothesis testing is used to resolve conflicts between two competing hypotheses on a
particular population of interest.
– Determines the validity of an assumption using sample data.
– Resolves a conflict between two competing hypotheses.
– Note that we cannot “accept” because the sample size isn’t large enough to make that conclusion.
– We cannot also prove that the null hypothesis is true.
• Maintain status quo or business as “usual”.
Hypothesis Testing
• A crucial step in a hypothesis test concerns the formulation of the two
competing hypotheses because the conclusion of the test depends on
how the hypotheses are stated.
• For a given n, however, we can reduce only at the expense of a higher and reduce
only at the expense of a higher .
– The optimal choice of and depends on the relative cost of these two types of errors, and
determining these costs is not always easy.
Hypothesis Testing
• A hypothesis test regarding the population mean is based on the assumption that the
sample mean is (approximately) normally distributed.
1. Assume that the population mean , the hypothesized value of the population mean in the null
hypothesis.
• Assume null hypothesis is true.
2. Compute the test statistic:
3. Then find the p-value:
• The likelihood/probability of observing a sample mean that is at least as extreme as the one
derived from the sample assuming the null hypothesis is true.
• This depends on the specification in the alternative.
Hypothesis Testing
• Finding p value from test statistic:
– Use the t Table for finding the p value
– Or use an online p value calculator from t statistic such as this one
– Or use JMP to calculate the test statistic and p value from sample data
• For one-tailed upper-tailed test, use the p value referred as Prob > t
• For one-tailed lower-tailed test, use the p value referred as Prob < t
• For two-tailed test, use the p value referred as Prob > |t|
Hypothesis Testing
• We define as the allowed probability of making a Type I error.
– We refer to as the significance level.
• Most hypothesis tests are conducted using a significance level of 1%, 5%, or 10%, using = 0.01,
0.05, or 0.10, respectively.
• We generally choose a value for before implementing a hypothesis test.
E6Q2:
E6Q3: The p-value is which is less than 0.0005 using the table.
E6Q4: Reject the null hypothesis because the p-value is less than . At the 5% significance
level, we conclude that the average study time at the university is less than the 1961
average of 24 hours per week.
Example 2 (Pizza Restaurant)
A manager of a pizza restaurant would like to know whether a new preparation method
provides better tasting pizza. Customers were asked to rate the new style pizza vs. the old
style on a scale of -10 to +10 (Pizza_Ratings.xlsx). A negative rating favors the old style. A
positive rating favors the new style. A rating around zero indicates indifference.
Example 2 (Pizza Restaurant) Solution
• The hypotheses are
• This is an upper-tail test (alternative hypothesis is one-tailed of the “greater than”
variety)
• JMP instructions:
1. Analyze ► Distribution
2. Select Rating and click Y, Columns
3. On the report page, click the red triangle next to Rating, select Test Mean, and enter 0 as the
Hypothesized Mean.
Example 2 (Pizza Restaurant) Solution
• The p-value is 0.0038.
– This is less than ɑ of 0.01, so we can say with 99% certainty that the
sample mean is significantly greater than the hypothesized value of
zero.
• That doesn’t mean the demand will be 40, only that it’s
possible the demand could be 40.
Hypothesis Testing for Population Proportion
• A hypothesis test regarding the population proportion is based on the assumption that the
sample proportion is (approximately) normally distributed.
• Assume that the population proportion , the hypothesized value of the population
proportion in the null hypothesis.
• The three forms for a hypothesis test about a population proportion are:
– (One tailed) Lower-tail test: : and :
– (One tailed) Upper-tail test: : and :
– Two-tailed test: : and :
E7Q3: Because the p-value of 0.0872 is less than , we reject the null hypothesis. Therefore,
at the 10% significance level, the proportion of adults who favor marijuana legalization in
Ohio differs from the national proportion of 0.57.
Example 4 (Pine Creek)
Over the past year, 20% of the players at Pine Creek golf course were women. In an effort
to increase this proportion, Pine Creek has implemented a special promotion to attract
women golfers (WomenGolf.xlsx). Develop hypotheses to assess whether the promotion
helped increase the proportion of women golfers.
Example 4 (Pine Creek) Solution
• The hypotheses are
• This is a one-tailed test.
• JMP instructions:
1. Analyze ► Distribution
2. Select Golfer and click Y, Columns
3. On the report page, click the red triangle next to
Golfer, select Test Probabilities, and enter 0.2 as
the Hypothesized Mean for female
Example 4 (Pine Creek) Solution
• The p-value is 0.0086, which is less than any reasonable value of .
– Therefore, we reject the null hypothesis. The promotion seems to be working to increase the
proportion of women golfers.