Parametric vs Non-parametric Test
Parametric vs Non-parametric Test
NON-PARAMETRIC TEST
In statistics, parametric tests are tests that make assumptions about the underlying
distribution of data.
Assumption 1: Normality
Parametric tests assume that each group is roughly normally distributed.
If the sample sizes of each group are small (n < 30), then we can use a Shapiro-Wilk test to
determine if each sample size is normally distributed. However, if the sample sizes are large
then it’s better to use a Q-Q plot to visually check if the data is normally distributed.
Assumption 3: Independence
Parametric tests assume that the observations in each group are independent of observations
in every other group.
The easiest way to check this assumption is to verify that the data was collected using a
probability sampling method – a method in which every member in a population has an equal
probability of being selected to be in the sample.
Assumption 4: No Outliers
Parametric tests assume that there are no extreme outliers in any group that could adversely
affect the results of the test.
You may want to know if the treatment for high blood pressure has an effect on the blood
pressure. So you want to know if blood pressure changes over time.
But what if you have different therapies and you want to see if there is a difference between
them? You now have two factors, one for the therapy and one for the repeated measurements.
Since you now have two factors and one of the factors is a dependent sample, you use a two-
way repeated measures analysis of variance.
CHI-SQUARE TEST
The Chi-square test is a hypothesis test used to determine whether there is a relationship
between two categorical variables.
Categorical variables are, for example, a person's gender, preferred newspaper, frequency of
television viewing, or their highest level of education. So whenever you want to test whether
there is a relationship between two categorical variables, you use a Chi 2 test.
The chi-square test is a hypothesis test used for categorical variables with nominal or ordinal
measurement scale. The chi-square test checks whether the frequencies occurring in the
sample differ significantly from the frequencies one would expect. Thus, the observed
frequencies are compared with the expected frequencies and their deviations are examined.
• For example, Forty-eight subjects are asked to express their attitude toward the
proposition "Should the India join United Nations for the Control of Peace in the world?"
by marking F (favorable), I (indifferent) or U (unfavorable). Of the members in the
group, 24 marked F, 12 I, and 12 U. Do these results indicate a significant trend of
opinion?
Chi-square test for Independence: This test is used when you wish to explore the relationship
between two categorical variables. Each of these variables can have two or more categories.
CORRELATION
Correlation analysis is a statistical technique that gives you information about the relationship
between variables. Correlation analysis can be calculated to investigate the relationship of
variables. How strong the correlation is determined by the correlation coefficient, which varies
from -1 to +1. Correlation analyses can thus be used to make a statement about the strength
and direction of the correlation.
For example: You want to find out whether there is a connection between the age at which a
child speaks its first sentences and its later success at school.
Positive Correlation: The correlation is said to be positive correlation if the values of two
variables changing with same direction and the the correlation coefficient lies between 0 and 1,
i.e. a positive value. Example: Height & Weight.
Negative Correlation: The correlation is said to be negative correlation when the values of
variables change with opposite direction. In this case, the correlation coefficient is between -1
and 0, so it assumes a negative value. Example: The product price and the sales quantity.
Strength of correlation
With regard to the strength of the correlation coefficient r, the following table can be used as a
guide:
|r| Strength of correlation
0.0 < 0.1 no correlation
0.1 < 0.3 little correlation
0.3 < 0.5 medium correlation
0.5 < 0.7 high correlation
0.7 < 1 very high correlation
Scatter plot and correlation
Scatter Diagram is a graph of observed plotted points where each points represents the values
of X & Y as a coordinate. It portrays the relationship between these two variables graphically.
Pearson Correlation
Pearson correlation analysis examines the relationship between two variables. For example, is
there a correlation between a person's age and salary? More specifically, we can use the
pearson correlation coefficient to measure the linear relationship between two variables.
If the assumptions are not met, Spearman's rank correlation can be used.
Spearman's rank correlation coefficient
The Spearman rank correlation examines the relationship between two variables, being the
non-parametric counterpart of Pearson's correlation. Therefore, in this case, a normal
distribution of the data is not required.
Spearman correlation uses the ranks of the data rather than the original data, hence the name
rank correlation.
The Kendall's rank correlation is a non-parametric test procedure. For the calculation of
Kendall's tau, the data must not be normally distributed and the two variables must only have
an ordinal scale level.
However, Kendall's Tau should be preferred over Spearman's correlation when there is very
little data and many rank ties!
Where C is the number of concordant pairs and D is the number of discordant pairs.
Point Biserial Correlation
Point biserial correlation is a special case of Pearson correlation and examines the relationship
between a dichotomous variable and a metric variable.
A dichotomous variable is a variable with two expressions, for example gender with male and
female or smoking status with smoker and non-smoker. A metric variable is, for example, a
person's weight or a person's salary.
Let's say we want to study the correlation between the number of hours spent learning for an
exam and the exam result (pass/fail).
To calculate the point biserial correlation, we first need to convert the test score into numbers.
We can assign a value of 1 to the students who passed the test and 0 to the students who failed
the test.
Partial Correlation
Partial correlation, calculates the correlation between two variables, while excluding the effect
of a third variable. This makes it possible to find out whether the correlation rxy between
variables x and y is produced by the variable z.
The partial correlation rxy,z tells how strongly the variable x correlates with the variable y, if the
correlation of both variables with the variable z is calculated out.
Example:
After controlling for the effects of socially desirable responding bias, is there still a relationship
between optimism and life satisfaction?
REGRESSION ANALYSIS
Linear Regression analysis is used to create a model that describes the relationship between a
dependent variable and one or more independent variables. Depending on whether there are
one or more independent variables, a distinction is made between simple and multiple linear
regression analysis.
In linear regression, an important prerequisite is that the measurement scale of the dependent
variable is metric and a normal distribution exists. If the dependent variable is categorical, a
logistic regression is used.
In linear regression analysis, a straight line is drawn in the scatter plot. To determine this
straight line, linear regression uses the method of least squares.
The regression line can be described by the following equation:
Marketing example:
For a video streaming service you should predict how many times a month a person streams
videos. For this you get a record of user's data (age, income, gender, ...).
Medical example:
You want to find out which factors have an influence on the cholesterol level of patients. For
this purpose, you analyze a patient data set with cholesterol level, age, hours of sport per week
and so on.
Multiple regression should not be confused with multivariate regression. In the former case,
the influence of several independent variables on a dependent variable is examined. In the
second case, several regression models are calculated to allow conclusions to be drawn about
several dependent variables. Consequently, in a multiple regression, one dependent variable is
taken into account, whereas in a multivariate regression, several dependent variables are
analyzed.
What is the p-value?
The p-value indicates the probability that the observed result or an even more extreme result
will occur if the null hypothesis is true.
The p-value is used to decide whether the null hypothesis is rejected or retained (not rejected).
If the p-value is smaller than the defined significance level (often 5%), the null hypothesis is
rejected, otherwise not.
Significance level
The significance level is determined before the test. If the calculated p-value is below this
value, the null hypothesis is rejected, otherwise it is retained. As a rule, a significance level of 5
% is chosen.
The significance level thus indicates the probability of a 1st type error. What does this mean? If
there is a p-value of 5% and the null hypothesis is rejected, the probability that the null
hypothesis is valid is 5%, i.e. there is a 5% probability of making a mistake. If the critical value is
reduced to 1%, the probability of error is accordingly only 1%, but it is also more difficult to
confirm the alternative hypothesis.