Regression Modeling
Regression Modeling
Based on the criticality of the requirement, we can choose a lower significance level of
1% as well.
Determine the Test Statistic and calculate its value for Hypothesis Testing
Hypothesis testing uses Test Statistic which is a numerical summary of a data-set that reduces
the data to one value that can be used to perform the hypothesis test.
Select the type of Hypothesis test
We choose the type of test statistic based on the predictor variable – quantitative or categorical.
Below are a few of the commonly used test statistics for quantitative data
Type of predictor
Distribution type Desired Test Attributes
variable
•Large sample size
Quantitative Normal Distribution Z – Test •Population standard
deviation known
•Sample size less than 30
Quantitative T Distribution T-Test •Population standard
deviation unknown
•When you want to
Positively skewed
Quantitative F – Test compare 3 or more
distribution
variables
•Requires feature
Negatively skewed
Quantitative NA transformation to
distribution
perform a hypothesis test
•Test of independence
Categorical NA Chi-Square test
•Goodness of fit
Z-statistic – Z Test
Z-statistic is used when the sample follows a normal distribution. It is calculated based on the population
parameters like mean and standard deviation.
One sample Z test is used when we want to compare a sample mean with a population mean
Two sample Z test is used when we want to compare the mean of two samples
T-statistic – T-Test
T-statistic is used when the sample follows a T distribution and population parameters are unknown. T
distribution is similar to a normal distribution, it is shorter than normal distribution and has a flatter tail.
If the sample size is less than 30 and population parameters are not known, we use T distribution. Here also,
we can use one Sample T-test and a two-sample T-test.
F-statistic – F test
For samples involving three or more groups, we prefer the F Test. Performing T-test on multiple groups
increases the chances of Type-1 error. ANOVA is used in such cases.
Analysis of variance (ANOVA) can determine whether the means of three or more groups are different. ANOVA
uses F-tests to statistically test the equality of means.
F-statistic is used when the data is positively skewed and follows an F distribution. F distributions are always
positive and skewed right.
F = Variation between the sample means/variation within the samples
For negatively skewed data we would need to perform feature transformation
Chi-Square Test
For categorical variables, we would be performing a chi-Square test.
Following are the two types of chi-squared tests:
1.Chi-squared test of independence – We use the Chi-Square test to determine whether
or not there is a significant relationship between two categorical variables.
2.Chi-squared Goodness of fit helps us determine if the sample data correctly represents
the population.
1) Type1 Error – This occurs when the null hypothesis is true but we reject it. The
probability of type I error is denoted by alpha (α). Type 1 error is also known as the
level of significance of the hypothesis test
2) Type 2 Error – This occurs when the null hypothesis is false but we fail to reject
it. The probability of type II error is denoted by beta (β