0% found this document useful (0 votes)
30 views14 pages

Parametric vs Non-parametric Test

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views14 pages

Parametric vs Non-parametric Test

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

PARAMETRIC vs.

NON-PARAMETRIC TEST

In statistics, parametric tests are tests that make assumptions about the underlying
distribution of data.

Assumption 1: Normality
Parametric tests assume that each group is roughly normally distributed.
If the sample sizes of each group are small (n < 30), then we can use a Shapiro-Wilk test to
determine if each sample size is normally distributed. However, if the sample sizes are large
then it’s better to use a Q-Q plot to visually check if the data is normally distributed.

Assumption 2: Homoscedasticity (Equal Variance)


Parametric tests assume that the variance of each group is roughly equal.
If the ratio of the largest variance to the smallest variance is less than 4, then we can assume
the variances are approximately equal. For example, suppose group 1 has a variance of 24.5 and
group 2 has a variance of 15.2. The ratio of the larger sample variance to the smaller sample
variance would be calculated as:
Ratio: 24.5 / 15.2 = 1.61
Since this ratio is less than 4, we could assume that the variances between the groups are
approximately equal.

Assumption 3: Independence
Parametric tests assume that the observations in each group are independent of observations
in every other group.
The easiest way to check this assumption is to verify that the data was collected using a
probability sampling method – a method in which every member in a population has an equal
probability of being selected to be in the sample.

Assumption 4: No Outliers
Parametric tests assume that there are no extreme outliers in any group that could adversely
affect the results of the test.

Assumption 5: Scale Data


The classical parametric tests make inferences on parameters – means, variances, and the like –
which are, by definition, numbers. To make inferences about numbers, we need data that are
also numbers. Continuous data – interval- or ratio-level data – will work just fine.
ANOVA

What is an analysis of variance?


An analysis of variance (ANOVA) tests whether statistically significant differences exist between
more than two samples.
There are different types of analysis of variance; the most common are the one-way and two-
way analysis of variance, each of which can be calculated either with or without repeated
measurements.

One-factor ANOVA: Independent Measures


The one-way analysis of variance is an extension of the t-test for independent groups. With the
t-test only a maximum of two groups can be compared; this is now extended to more than two
groups.
Does a person's place of residence (independent variable) influence his or her salary?
Non-parametric Counterpart: Kruskal-Wallis test

One-factor ANOVA: Repeated Measures


The one-factor analysis of variance with repeated measures is the extension of the t-test for
dependent samples for more than two groups. In a dependent sample, the measured values are
connected. For example, you might be interested to know whether therapy after a slipped disc
has an influence on the patient's perception of pain. For this purpose, you measure the pain
perception before the therapy, in the middle of the therapy and at the end of the therapy. Now
you want to know if there is a difference between the different times.
Non-parametric Counterpart: Friedman test

Two-factors ANOVA: Independent Measures


Two-way (or two factor) analysis of variance tests whether there is a difference between more
than two independent samples split between two variables or factors.
Using the two-factor analysis of variance, you can now answer three things:
 Does factor 1 have an effect on the dependent variable?
 Does factor 2 have an effect on the dependent variable?
 Is there an interaction between factor 1 and factor 2?
For example: Let's say we are interested in studying the effect of two factors, "Treatment" and
"Gender," on the response variable "Blood Pressure." In this example, we have two levels of the
"Treatment" factor (A and B) and two levels of the "Gender" factor (Male and Female). The
"Blood Pressure" measurements are recorded for each participant based on their treatment
and gender.
Two-factors ANOVA: Repeated Measures
In contrast to the two-factorial analysis of variance without measurement repetitions, one of
the factors is thereby created by measurement repetitions.
In other words, one factor is a dependent sample.
For example, if you take a sample of people with high blood pressure and measure their blood
pressure before, during and after treatment, this is a dependent sample. This is because the
same person is interviewed at different times.

You may want to know if the treatment for high blood pressure has an effect on the blood
pressure. So you want to know if blood pressure changes over time.

But what if you have different therapies and you want to see if there is a difference between
them? You now have two factors, one for the therapy and one for the repeated measurements.
Since you now have two factors and one of the factors is a dependent sample, you use a two-
way repeated measures analysis of variance.

CHI-SQUARE TEST

The Chi-square test is a hypothesis test used to determine whether there is a relationship
between two categorical variables.
Categorical variables are, for example, a person's gender, preferred newspaper, frequency of
television viewing, or their highest level of education. So whenever you want to test whether
there is a relationship between two categorical variables, you use a Chi 2 test.

The chi-square test is a hypothesis test used for categorical variables with nominal or ordinal
measurement scale. The chi-square test checks whether the frequencies occurring in the
sample differ significantly from the frequencies one would expect. Thus, the observed
frequencies are compared with the expected frequencies and their deviations are examined.

There are two types of chi-square test:


• Chi-square test for goodness of fit
• Chi-square test for independence
Chi-square test for goodness of fit: This test, also referred to as one sample chi-square, is often
used to compare the proportion of cases from a sample with hypothesized values. It works on
one categorical variable.

• For example, Forty-eight subjects are asked to express their attitude toward the
proposition "Should the India join United Nations for the Control of Peace in the world?"
by marking F (favorable), I (indifferent) or U (unfavorable). Of the members in the
group, 24 marked F, 12 I, and 12 U. Do these results indicate a significant trend of
opinion?

Chi-square test for Independence: This test is used when you wish to explore the relationship
between two categorical variables. Each of these variables can have two or more categories.

CORRELATION

Correlation analysis is a statistical technique that gives you information about the relationship
between variables. Correlation analysis can be calculated to investigate the relationship of
variables. How strong the correlation is determined by the correlation coefficient, which varies
from -1 to +1. Correlation analyses can thus be used to make a statement about the strength
and direction of the correlation.

For example: You want to find out whether there is a connection between the age at which a
child speaks its first sentences and its later success at school.

Correlation and causality


If the correlation analysis shows that two characteristics are related to each other, it can
subsequently be checked whether one characteristic can be used to predict the other
characteristic. If the correlation mentioned in the example is confirmed, for example, it can be
checked whether school success can be predicted by the age at which a child speaks its first
sentences by means of a linear regression.

But beware! Correlations need not be causal relationships.

With the help of correlation analysis two statements can be made:


 one about the direction
 and one about the strength
of the linear relationship between two metric or ordinally scaled variables. The direction
indicates whether the correlation is positive or negative, while the strength indicates whether
the correlation between the variables is strong or weak.

Positive Correlation: The correlation is said to be positive correlation if the values of two
variables changing with same direction and the the correlation coefficient lies between 0 and 1,
i.e. a positive value. Example: Height & Weight.
Negative Correlation: The correlation is said to be negative correlation when the values of
variables change with opposite direction. In this case, the correlation coefficient is between -1
and 0, so it assumes a negative value. Example: The product price and the sales quantity.

Strength of correlation
With regard to the strength of the correlation coefficient r, the following table can be used as a
guide:
|r| Strength of correlation
0.0 < 0.1 no correlation
0.1 < 0.3 little correlation
0.3 < 0.5 medium correlation
0.5 < 0.7 high correlation
0.7 < 1 very high correlation
Scatter plot and correlation
Scatter Diagram is a graph of observed plotted points where each points represents the values
of X & Y as a coordinate. It portrays the relationship between these two variables graphically.
Pearson Correlation
Pearson correlation analysis examines the relationship between two variables. For example, is
there a correlation between a person's age and salary? More specifically, we can use the
pearson correlation coefficient to measure the linear relationship between two variables.

Calculate Pearson correlation


The Pearson correlation coefficient is calculated using the following equation.

Assumptions of the Pearson correlation


 To calculate the Pearson correlation coefficient, only two metric variables must be
present. Metric variables are, for example, a person's weight, a person's salary or
electricity consumption.
 The two variables must also be normally distributed.

If the assumptions are not met, Spearman's rank correlation can be used.
Spearman's rank correlation coefficient
The Spearman rank correlation examines the relationship between two variables, being the
non-parametric counterpart of Pearson's correlation. Therefore, in this case, a normal
distribution of the data is not required.
Spearman correlation uses the ranks of the data rather than the original data, hence the name
rank correlation.

Spearman Correlation Equation


If there are no rank ties, this equation can also be used to calculate the Spearman correlation.
Kendall's Tau

The Kendall's rank correlation is a non-parametric test procedure. For the calculation of
Kendall's tau, the data must not be normally distributed and the two variables must only have
an ordinal scale level.

The Kendall's tau is very similar to Spearman's rank correlation coefficient.

However, Kendall's Tau should be preferred over Spearman's correlation when there is very
little data and many rank ties!

Calculate Kendall's Tau


We can calculate the Kendall's Tau with this formula:

Where C is the number of concordant pairs and D is the number of discordant pairs.
Point Biserial Correlation

Point biserial correlation is a special case of Pearson correlation and examines the relationship
between a dichotomous variable and a metric variable.

A dichotomous variable is a variable with two expressions, for example gender with male and
female or smoking status with smoker and non-smoker. A metric variable is, for example, a
person's weight or a person's salary.

Calculate point biserial correlation

Let's say we want to study the correlation between the number of hours spent learning for an
exam and the exam result (pass/fail).

To calculate the point biserial correlation, we first need to convert the test score into numbers.
We can assign a value of 1 to the students who passed the test and 0 to the students who failed
the test.
Partial Correlation

Partial correlation, calculates the correlation between two variables, while excluding the effect
of a third variable. This makes it possible to find out whether the correlation rxy between
variables x and y is produced by the variable z.

The partial correlation rxy,z tells how strongly the variable x correlates with the variable y, if the
correlation of both variables with the variable z is calculated out.

Calculate Partial Correlation


For the calculation of the partial correlation, the three correlations between the individual
variables are required. The partial correlation then results in

 rxy = Correlation between variable x and y


 rxz = Correlation of the third variable z with the variable x
 ryz = Correlation of the third variable z with the variable y

Example:
After controlling for the effects of socially desirable responding bias, is there still a relationship
between optimism and life satisfaction?
REGRESSION ANALYSIS
Linear Regression analysis is used to create a model that describes the relationship between a
dependent variable and one or more independent variables. Depending on whether there are
one or more independent variables, a distinction is made between simple and multiple linear
regression analysis.

Example: Simple Linear Regression


Does the height have an influence on the weight of a person?
Example: Multiple Linear Regression
Do the height and gender have an influence on the weight of a person?

In linear regression, an important prerequisite is that the measurement scale of the dependent
variable is metric and a normal distribution exists. If the dependent variable is categorical, a
logistic regression is used.

Simple Linear Regression


The goal of a simple linear regression is to predict the value of a dependent variable based on
an independent variable. The greater the linear relationship between the independent variable
and the dependent variable, the more accurate is the prediction.
Visually, the relationship between the variables can be shown in a scatter plot. The greater the
linear relationship between the dependent and independent variables, the more the data
points lie on a straight line.

In linear regression analysis, a straight line is drawn in the scatter plot. To determine this
straight line, linear regression uses the method of least squares.
The regression line can be described by the following equation:

Multiple Linear Regression


Unlike simple linear regression, multiple linear regression allows more than two independent
variables to be considered. The goal is to estimate a variable based on several other variables.
The variable to be estimated is called the dependent variable (criterion). The variables that are
used for the prediction are called independent variables (predictors).

Marketing example:
For a video streaming service you should predict how many times a month a person streams
videos. For this you get a record of user's data (age, income, gender, ...).

Medical example:
You want to find out which factors have an influence on the cholesterol level of patients. For
this purpose, you analyze a patient data set with cholesterol level, age, hours of sport per week
and so on.

Multiple Regression vs. Multivariate Regression

Multiple regression should not be confused with multivariate regression. In the former case,
the influence of several independent variables on a dependent variable is examined. In the
second case, several regression models are calculated to allow conclusions to be drawn about
several dependent variables. Consequently, in a multiple regression, one dependent variable is
taken into account, whereas in a multivariate regression, several dependent variables are
analyzed.
What is the p-value?
The p-value indicates the probability that the observed result or an even more extreme result
will occur if the null hypothesis is true.

The p-value is used to decide whether the null hypothesis is rejected or retained (not rejected).
If the p-value is smaller than the defined significance level (often 5%), the null hypothesis is
rejected, otherwise not.

Significance level
The significance level is determined before the test. If the calculated p-value is below this
value, the null hypothesis is rejected, otherwise it is retained. As a rule, a significance level of 5
% is chosen.

 Alpha < 0.01 : very significant result.


 Alpha < 0.05 : significant result.
 Alpha > 0.05 : not significant result.

The significance level thus indicates the probability of a 1st type error. What does this mean? If
there is a p-value of 5% and the null hypothesis is rejected, the probability that the null
hypothesis is valid is 5%, i.e. there is a 5% probability of making a mistake. If the critical value is
reduced to 1%, the probability of error is accordingly only 1%, but it is also more difficult to
confirm the alternative hypothesis.

You might also like