0% found this document useful (0 votes)

124 views

Is Linear Regression Valid When The Outcome (Dependant Variable) Not Normally Distributed?

Uploaded by

vaskore

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views

Is Linear Regression Valid When The Outcome (Dependant Variable) Not Normally Distributed?

Uploaded by

vaskore

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Search for publications, researchers, or questions or Discover by subject area Recruit researchers Join for free Login

Question Answers 78 Similar questions Related publications

Question Asked 18th Aug, 2014

Simin Mahinrad
Northwestern University

Is linear regression valid when the outcome (dependant variable) not Join ResearchGate to find the people and
normally distributed? research you need to help your work.

I am perfomring linear regression analysis in SPSS , and my dependant variable is not-normally distrubuted.
Could anyone help me if the results are valid in such a case? If not, what could be the possible solutions for 19+ million members
that? 135+ million publications
700k+ research projects
Thank you in advance

Join for free

SPSS Normal Distribution Linear Regression Regression Analysis Epidemiology SPSS 20

Get help with your research

Join ResearchGate to ask questions, get input, and advance your work.

Join for free Log in

Most recent answer

Adrian Olszewski 3rd Sep, 2020

2KMM CRO

There is no need to fix normality at all. It's a matter of choosing appropriate model that handles it naturally.

The Generalized Linear Model allows one to entirely forget about transforming the DV and send it to the trash of
history in the 21st century because of the mathematical facts I provided.

Of course assuming the standard linear model isn't sufficient, which has no assumption about the normality of the
DV - it's about conditional normality or residual normality (equivalent terms). But if there is some kind of mean-
variance relationship, if the DV is truncated, integer, fractional (like 0-1; percentage), categories (like Likert items),
or the DV is known to be conditionally skewed, the GLM is the way to go.

If there are also problems with heterogeneity of variance, there are various workarounds: - marginal models: linear
model fit with generalized least square estimation (where you can define the correlation pattern) or GEE and IPW
(inverse probability weighting) GEE when there are missing data (MAR), which is generalization of the gLM which
relaxes the need for normality of residuals. If other than unstructured residual covariance structure is chosen for
the GLS, empirical standards errors (e.g. sandwich, CR2 or CR3) have to be used rather than model-based ones.

- conditional-to-random-effect models like general and generalized mixed-effects models.

From all the above, the GEE has the fewest number of assumptions but requires more data and is less flexible,
and of course is a marginal model. For the general linear model case marginal matches the conditional, but not for
the generalized case, like logistic regression, so one has to decide.

But if neither of the above doesn't work, we have also GAM - generalized additive model, quantile (mixed)
regression.

Please look here at my diagram for reference:

https://www.dropbox.com/s/5a8w8kckyfeaix0/statistical%20models%20-%20diagram.pdf

And a plethora of non-parametric tests outperforming the old Kruskal-Wallis and Friedman methods, if one needs
just some kind of main/interaction effects testing in the spirit of ANOVA (which, BTW, is nothing but a sequence of
likelihood tests, so can be applied to any model for which residual variance or deviance can be defined):

https://www.quora.com/Is-there-any-reliable-non-parametric-alternative-to-two-way-ANOVA-in-
biostatistics/answer/Adrian-Olszewski-1?ch=10&share=2dada943&srid=MByz

Please find a reference on those non-parametric methods in literature below:

https://shrib.com/?v=nc#MountainGoat8PNVVXm

Cite 1 Recommendation

Bruce Weaver 18th Aug, 2014

Lakehead University Thunder Bay Campus

Here are a few observations.

1. The normality assumption for linear regression applies to the errors, not the outcome variable per se (and most
certainly not to the explanatory variables). The usual statement is that the errors are i.i.d. (i.e., independently and
identically distributed) as Normal with a mean of 0 and some variance. Independence and homoscedasticity are
more important assumptions than normality.

2. As George Box famously noted: “…the statistician knows…that in nature there never was a normal distribution,
there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive
results which match, to a useful approximation, those found in the real world.” (JASA, 1976, Vol. 71, 791-799)
Therefore, the normality assumption will never be exactly true when one is working with real data.

3. Non-normality of the errors will have some impact on the precise p-values of the tests on coefficients etc. But if
the distribution is not too grossly non-normal, the tests will still provide good approximations.

4. As Michael suggested, it is useful to look at diagnostics, including residual plots. But note the distinction
between residuals and errors (see link below). The former are observable, whereas the latter are not. (I would
also look at measures of influence, such as Cook's distance.)

HTH.

p.s. - Of course, depending on the nature of your outcome variable, some other form of regression may be far
more appropriate--e.g., Poisson or Negative Binomial regression for analysis of count variables.

http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics

Cite 64 Recommendations

All Answers (78)

Michael S Martin 18th Aug, 2014

University of Ottawa

There are whole textbook chapters on this issue, so it's hard to cover fully, but here's a short answer that you may
want to explore issues of further.

Technically speaking, they are not valid, though often times if your sample is large enough and the deviation from
normality is not too big, your results should be reasonably close to what you would obtain if you weren't violating
the assumptions of the test. That said, you should examine the various diagnostics that SPSS and other software
offers like residual plots, leverage plots/statistics, box plots (for outliers) to see if there are other issues in your
data.

Specific to violations of normality, you can also transform your dependent variable (log and square
root transformations are common, though it depends what your distribution of your outcome variable is as to which
is appropriate), and compare your results. With transformed variables it's harder to interpret the results since they
are no longer in the units in which you measured the variable, so if the results are similar you'll often present the
untransformed results for ease of interpretation with a note that you compared them to those with the appropriate
transformation.

Hope this helps

Cite 3 Recommendations

Bruce Weaver 18th Aug, 2014

Lakehead University Thunder Bay Campus

Here are a few observations.

HTH.

http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics

Cite 64 Recommendations

Ronán Michael Conroy 18th Aug, 2014

Royal College of Surgeons in Ireland

A linear regression is valid whenever you can ask the question "What is the increase in the predicted variable for a
one-unit increase in the predictor?"

This implies that the effect of the predictor will be the same throughout the range of the predicted variable. And this
may be an untenable assumption.

But, of course, there are many linear models, and part of the solution is going to be choosing a linear model that
matches the research question. If you are counting, say, episodes of illness, then a Poisson model or related
model will be more informative than least-squares regression.

But first decide if the theory behind your research allows you to ask a linear question.

Cite 4 Recommendations

Harald Lang 18th Aug, 2014

KTH Royal Institute of Technology

Just to emphasize the Bruce's most important message: it is the error terms that should be normal, not the dep.
variable.

Also, if you have reasonably many data, then homoskedasticity (equal variances of the residuals) is more
important than normality.

If you worry about heteroskedasticity (unequal variances), then employ "robust errors" (this will not influence the
point estimates of the coefficients, only standard errors and confidence intervals.)

Cheers -- Harald

Cite 21 Recommendations

Sean Clouston 18th Aug, 2014

Stony Brook University

The basic answer is no. If you think/know the outcome is not normally distributed, then it's not okay to use OLS
(without correcting for that). The choice of alternative depends on the distribution that you have.

Cite 1 Recommendation

Seyi Ajao 18th Aug, 2014

The Federal Polytechnic, Ado-Ekiti, Nigeria.

Since the dependent variable is not normally distributed, you can transform it or use the alternative method of non
parametric

Cite 1 Recommendation

Zeljko Pedisic 18th Aug, 2014

Victoria University Melbourne

Dear Simin,

It is a common misbelief that the outcome variable in linear regression needs to be normally distributed. Only
residuals need to be normally distributed.

In SPSS, you can check the normality of residuals using histogram and p-p plot of standardized residuals
(Analyze--Regression--Plots--Standardized Residual Plots--Histogram & Normal probability plot).

Cheers,

Zeljko

Cite 4 Recommendations

Simin Mahinrad 19th Aug, 2014

Northwestern University

Thank you so much for your answers and help.

I think I'll go firs for checking the normality of my residuals...

Cite

James R Knaub 19th Aug, 2014

N/A

Simin - You will find in perhaps most cases there is heteroscedasticity in your residuals, which is especially true in
regression through the origin. People often transform this to OLS so that they can use hypothesis tests. In my
opinion, this is unnecessary and not very useful. The two attachments here are notes on WLS regression (sorry,
notation is not good on the first page), and a letter on hypothesis tests. I think it is often better to use confidence
imtervals. Note that normality does not always hold there either. But the standard errors are important and can be
worked with.

Article Properties of Weighted Least Squares Regression for Cutoff S...

Article Practical Interpretation of Hypothesis Tests - letter to the...

Cite 1 Recommendation

Simin Mahinrad 20th Aug, 2014

Northwestern University

Dear all,

I was just wondering, how should the p-plot look like when the residuals are normally distributed?

Cite 1 Recommendation

Michael S Martin 20th Aug, 2014

University of Ottawa

Simin,

the p-p (or q-q) plot should be linear with all the points along the line. If the points start curving away from the line
at one end for example, then your residuals don't follow the normal distribution.

Cite 1 Recommendation

James R Knaub 20th Aug, 2014

N/A

Simin - You can look at the distribution of residuals to study both nonlinearity and heteroscedasticity. Some
econometrics books could be helpful - say by Maddala, for instance. Also, Carroll and Ruppert, Transformation
and Weighting in Regression (I think), Chapman and Hall, 1988, later CRC Press - I think. - Anyway, that was a
good, insightful question. But I think nonlinearity and heteroscedasticity are what you are after. - Jim

Cite 2 Recommendations

Richard Anthony Champion Jr 20th Aug, 2014

City College of San Francisco

There are two major points here.

The first is that regression fits a line using a least squares criterion that minimizes residuals. This does not make
any assumptions regarding the probability distribution of the residuals.

The second point if that the assumption of normality is used to compute confidence intervals. For this to be
meaningful, you need to demonstrate characteristics such as normality, identical distribution, and independence.

I suggest finding, and then plotting, the regression line and these residuals.

Then you can decide what to do next.

Cite 1 Recommendation

Islam F. Bourini 21st Aug, 2014

Al falah university

Yes there are a new tools for non linear relationship and when it is non normal distributed.

SEM-PLS: Structural Equation Modeling- Partial Least Square consider a new tool for exploring approach studies,
PLS can be used in three conditions:

1. Exploratory research propose.

2. Non Normal distribution.

3. Small Sample Size.

4. Theory is not fully fitting the theoretical model

5. Non-linear relationships (Quadratic relationship and polynomial relationships)

For further information i suggest to read the following papers:

Kock, N., and Lynn, G. S. (2012). Lateral Collinearity and Misleading Results in Variance-Based SEM: An
Illustration and Recommendations. Journal of the Association for Information System, 13(7), 546-580.

Hair Jr, J. F., Hult, G. T. M., Ringle, C., & Sarstedt, M. (2013). A primer on partial least squares structural equation
modeling (PLS-SEM). SAGE Publications, Incorporated.

Hair, J. F., Ringle, C. M., and Sarstedt, M. (2011). PLS-SEM: Indeed a Silver Bullet. Journal of Marketing Theory
and Practice, 19(2), 139-151.

Hair, J. F., Sarstedt, M., Ringle, C. M., and Mena, J. A. (2012). An Assessment of the Use of Partial Least Squares
Structural Equation Modeling in Marketing Research. Journal of the Academy of Marketing Science, 40, 414-433.

All the best

Cite 1 Recommendation

Julie Slater 21st Aug, 2014

Swansea University

There's a really good paper which explains this quite clearly : Xiang Li, Wanling Wong, Ecosse L. Lamoureux &
Tien Y Wong (2012) The title of the paper is the same as your question Jnl - Invest. Opthalmol. & Vis .Sci.

As ever in statistics there are no black and white answers to your question and you have to use your judgement
based on advice from others and your own analysis following that. Basically, I have found that it is OK to
undertake regression on non-normal DVs as long as the sample sizes are large enough - these should have been
determined by sample power calculation.

Good luck with your analysis - I hope you find a sensible solution

Cite 3 Recommendations

Raid Amin 22nd Aug, 2014

University of West Florida

I recommend using normal quantile transformation, also called normal scores. This is a well documented approach
in the literature, and one of the first versions of this approach was by Van der Waerden (can be Googled). It is
available in SAS under proc rank. Basically, the raw data are arranged in order from smallest to largest, and then
each observation is mapped into a standard normal curve, so x(i) becomes z(j), where z~N(0,1). This is a robust
procedure and not a true nonparametric procedure, but it does the job very well. In addition to overcoming the
normality issue, you will also get added clarity in understanding the outcomes since each z(j) is simply deviations
from the mean. Also, in multivariate analysis, if one of the variables is much larger in magnitude than the rest of
the variables,it can dominate the analysis. Such scores will "make variables equal in weight".

Here is a link for the background:

http://en.wikipedia.org/wiki/Van_der_Waerden_test

Cite

Josh Makore 25th Aug, 2014

Department of Agricultural Research - Botswana

Normality will only come into importance if you do the statistical tests (inferences). You can still do the simple linear
least squares estimation without the inferences. However the suggestions by the other writers should allow you to
do the further step of testing the significance.

Cite

Erik Mønness 25th Aug, 2014

Inland Norway University of Applied Sciences

When the dependent variable is a dicotom, one is often adviced to use logistic regression instead of OLS.
However results are often equal anyway. For the non-statistical audience, OLS is more easy to understand. See

Qual Quant (2009) 43:59–74 DOI 10.1007/s11135-007-9077-3

ORIGINAL PAPER Linear versus logistic regression when the dependent

variable is a dichotomy

Ottar Hellevik

Cite 2 Recommendations

Raid Amin 25th Aug, 2014

University of West Florida

Simin,

Could you give us more information on your experiment? Else, there will be lots of guessing going on here.

Cite

Bruce Weaver 25th Aug, 2014

Lakehead University Thunder Bay Campus

Following up on Erik's post, with logistic regression, you get the odds ratio as your summary measure, whereas
OLS linear regression with a dichotomous outcome gives you the risk difference. There's nothing wrong with that.
But note that the standard errors are probably not going to be correct. You can fix that by using a robust SE. See
the paper by Cheung (2007), for example (link below).

HTH.

http://aje.oxfordjournals.org/content/166/11/1337.full

Cite 2 Recommendations

Yuanzhang Li 26th Aug, 2014

Walter Reed Army Institute of Research

Simin,

The model selection depends on the test hypothesis and data structure. As you mentioned, you outcome
dependent variable should be continuous. You do not check the distribution on dependent variable itself. You need
check it in the regression process. As many answers mentioned, the residuals are independent and identical
normal distributed rather than the outcome itself. You also need check, the collinearrity of independent factors,
which should be independent without errors, and the relationship between the outcome and independent factors,
which should be linear by assumption. The whole topic is MODEL building, the process is:

1. Check the type of co factors, if some have a lot of missing value or typo, correct them or let them out. Then
for continuous factors, check the person correlation coefficients, if Pearson correlation near 1 or -1 among any
two, one should be gone in multiple regression. For category factors, check independency of any two. if most
count appear in the diagonal of the contingence table, one of the two category variables should be gone.

2. Draw the scatter plots between the continuous outcome and each independent quantitative factor to see the
association: if linear trend is shown, the factor is in, if non-linear effect is shown, transformation is needed. If no
trend, like random, the independent factor could be out.

3. Suppose the dependent variable should be independent. If the dependent variable is related to time, check
the auto correlation, if autocorrelation exists, time series modeling say Autoreg might be used.

4. Do PCA analysis to see if there are still multicolinearity among the independent factors, if some eigen-value
is near zero, you may drop one of them, or define a new factor (transformation) .

5. If the sample size is large enough, say 10 (at least) times higher of the unknown parameter number, you
can do multiple regressions, you may use auto select option, such as forward backward or best, which will select
independent factors for you.

6. Number of parameters: interception: count 1, continuous factor, each counts 1, categorical factor with level
of k, count as k-1.

7. Check outliers by leverage or CooksD or Residual, if exists you may delete them or do both model with and
without the outliers.

8. Check normality of residuals from the multi variable regression, if violated, do transformation, variance
homogeneity exist: transformation on some independent factor, variance homogeneity doesn’t exist:
transformation on dependent factor. Those may improve your model fitting.

9. You may use Akaike’s information criterion or Bayesian information criterion or Mallows’ CP to decide how
many factors should be in. Using them is better than comparison of R2.

10. Check interaction among the independent factors, the interaction among two quantitative predictors, means
there is joint effect; the effect of one factor varies across the level of another factor. If the interaction between an
quantitation factor and a category factor, say gender, it means the effect of the quantitative, slope, is different
between males and females.

If the outcome is categorical, you may consider logistic or other model.

Cite 6 Recommendations

Jana A Hirsch 26th Aug, 2014

Drexel University

For sufficiently large samples, violations of normality in the outcome may not be an issue:

Diehr P, Lumley T. The importance of the normality assumption in large public health data sets. Annual Review of
Public Health. 2002;23:151-169.

"The t-test and least-squares linear regression do not require any assumption of Normal distribution in sufficiently
large samples. Previous simulations studies show that “sufficiently large” is often under 100, and even for our
extremely non-Normal medical cost data it is less than 500. This means that in public health research, where
samples are often substantially larger than this, the t-test and the linear model are useful default tools for analyzing
differences and trends in many types of data, not just those with Normal distributions. Formal statistical tests for
Normality are especially undesirable as they will have low power in the small samples where the distribution
matters and high power only in large samples where the distribution is unimportant."

http://www.annualreviews.org/doi/full/10.1146/annurev.publhealth.23.100901.140546

Cite 8 Recommendations

Kylie Rixon 27th Aug, 2014

University of the Sunshine Coast

You could try transforming the DV until it is as close to normal as you can get it, re-run your model, then compare
the results. If the transformed variable gives you results that lead to the same interpretation and conclusions about
the data, then its probably a pretty robust relationship.

Cite 1 Recommendation

Thiago Augusto Da Cunha 15th Sep, 2014

Universidade Federal do Acre

Simin! As a normal distribution comprises an area under the famous "bell shape", necessarily we always consider
a continuos raw data as a dependent variable. Otherwise, if the raw data (dependent variable) has other class of
measures (like the numbers of fruits counted) we have to consider other distribution to analyze a possible
relationship, such a Poisson distribution and others. If the dependent variables that you measured has binomial
behavior (e.g. death or not death) you should consider a Binomial distribution (or Negative Binomial distribution if
you data are zero-inflated).

So, if your raw data (dv) are a continuos variable, you are allowed to START the analyzes. First of all, you do not
need to check the assumption of Normality of the raw data (as mentioned the colleagues).

I usually consider the following steps to building a model:

a) Scatter plot between the dependent variable and each independent variable. These plots will advise you to
select potential independent variables to consider into the model. Be careful and analyze each plot in the context
of biological expectation (e.g. we expected that the trees increases the height as increases in diameter; we do not
expect a tree with a 1 meter of diameter with a 5 meter of height);

b) A correlation analyzes (Pearson correlation) will provide you a mathematical relationship. However, not
necessarily a high correlated explanatory variable will significantly enter to the regression model;

c) Check the assumptions of the regression for the errors.

d) others steps ....

If you use SAS System I can send you a Datajob to verify the regression assumptions.

Best regards.

Cite 1 Recommendation

Yuanzhang Li 15th Sep, 2014

Walter Reed Army Institute of Research

Thiago discussed with you about the model selection/ dignosis. The common process for regression should do as
follwoing:

The model selection depends on the test hypothesis and data structure. As you mentioned, you outcome
dependent variable should be continuous. You do not check the distribution on dependent variable itself. You need
check it in the regression process. As I mentioned earlier, the residuals are independent and identical normal
distributed rather than the outcome itself. You also need check, the collinearrity of independent factors, which
should be independent without errors, and the relationship between the outcome and independent factors, which
should be linear by assumption. The whole topic is MODEL building and diagnosis problem, there are many way to
do it, according your statement I think the following will be easy. Assume the dependent variable is continuous and
normal distributed as you stated

4. Do PCA analysis to see if there are still multicolinearity among the independent factors, if some eigen-value
is near zero, you may drop one of them, or define a new factor (transformation) .

6. Number of parameters: interception: count 1, continuous factor, each counts 1, categorical factor with level
of k, count as k-1.

7. Check outliers by leverage or CooksD or Residual, if exists you may delete them or do both model with and
without the outliers.

9. You may use Akaike’s information criterion or Bayesian information criterion or Mallows’ CP to decide how
many factors should be in. Using them is better than comparison of R2.

Yuanzhang li

Cite

Erik Mønness 16th Sep, 2014

Inland Norway University of Applied Sciences

Lots of good advice above.

Ordinary least squares (OLS) estimators is unbiased independent of the error distribution (except from
the independence), so the estimates are valid. Equality of variances is an issue about the t-test, but experience is
that the t-test (enough observations!) performs reasonable even when the outcome is dichotomous. Also
regression estimates are themselves weighted means, and means tends to be nearly normally distributed even at
modest number of observations.

See example “Linear versus logistic regression when the dependent variable is a dichotomy” Ottar Hellevik. Qual
Quant (2009) 43:59–74 DOI 10.1007/s11135-007-9077-3

Cite

Josh Makore 16th Sep, 2014

Department of Agricultural Research - Botswana

Thiago - [If you use SAS System I can send you a Datajob to verify the regression assumptions.] Please email me
the Datajob - thanks.

Simin - you did not tell us of the nature of the dependent variable ie are the observations auto-correlated (one
measurement is dependent on the next or previous one) or not. This is the independence being referred to in the
discussions in which if its not satisfied the ordinary regression will be invalid. Li says use autoreg.

To everyone, if the residuals, Єi = yi – ŷ, are niid, why are the observations, yi, not also niid. ŷ, is fixed by
experimenter.

Cite 1 Recommendation

Mohammed Suleiman M. Gibreel 14th Oct, 2014

Imam abdulrahman Bin Faisal University

If you have large sample (n >= 30) your model will not suffer from normallity violations. or you can transform the
DV.

good luck

Cite

Athanasios Dermanis 14th May, 2015

Aristotle University of Thessaloniki

Normality has nothing to do with linear regression, except if one wants to stick to the maximum likelihood
estimation principle to justify the use of a least squares solution (and regression is such a solution) with weight
matrix being the inverse of the covariance matrix of the errors (except for a positive scalar factor).

Maximum likelihood may be attractive to some statisticians but most applied scientists since the time of Gauss
prefer a different type of optimal estimation the so called Best Linear (Uniformly) Unbiased Estimation, BLUUE or
more usually BLUE.

This means that among all liner functions of the observations we seek as estimate of any unknown parameter (or
even any linear function of the unknowns) the one which unbiased (i.e. the mean = expected value of the estimate
equals the true value) uniformly (i.e. whatever the true values of the unknowns are) and best. Best here means
that the mean square error of the estimation is minimized. The mean square error is the mean = expected value of
the square of the difference between the estimate and the true value.

The famous Gauss-Markov theorem asserts that BLUUE (or simply BLUE) estimates are obtained by the least
squares method when the used weight matrix is proportional (up to a positive scalar multiplicative factor) to the
inverse of the covariance matrix of the errors (no matter what their probability distribution is).

The BLUE principle is much more attractive than the maximum likelihood one, because we minimize the mean
square error so that different estimates under repetitions of the experiment (with different error outcomes) are
more concentrated and are also concentrated around the true value. How can maximum likelihood beat that?
Furthermore BLUE estimation applies equally well also to the cases where the errors are not normally distributed,
while maximum likelihood (if applicable) leads to different that the usual least squares (in our case liner regression)
solution.

Both principles give the same answer. So why bother about normality? Not to mention that the normal distribution
is a myth. Every scientist that has left the convenience of his office to go out to the field and make repeated
measurements under the same conditions knows that very well.

Cite 2 Recommendations

Saima Eman 14th May, 2015

Lahore College for Women University

1- Try data transformation - then check normality

2- If standardised residuals are normal- no worries...

Cite 2 Recommendations

Deleted profile

I read the article Jana Hirsch suggested, and it seems to suggest that normality of outcome variable is needed, but
due to Central Limit Theorem sample sizes of more than say 500 normality is not a problem. Others here (Bruce
Weaver) say don't worry about normality, it is only the residuals that should be normally distributed. So which is it,
and why?

Cite 2 Recommendations

James R Knaub 14th Jun, 2015

N/A

Pieter -

The central limit theorem is invoked when you want to look at the distribution of estimated means. Here, in this
application of regression, the salient point is the distribution of the residuals. The y-values can have any
distribution, and actually, I worked for many years, modeling establishment survey continuous data, which are very
highly skewed.

We may often want errors or residuals to be normally distributed, but Simin's dependent variable can have any
distribution.

Cheers - Jim

PS - As indicated elsewhere, heteroscedasticity is a more important consideration. It is naturally occurring, and in

my long experience with this, is handled very well with weighted least squares (WLS) regression - with the
heteroscedasticity thus accounted for in the error structure. (See the paper found through the link attached.)

Article Properties of Weighted Least Squares Regression for Cutoff S...

Cite 5 Recommendations

Deleted profile

Thank you James! I will have a look at the paper you attached.

Cite

James R Knaub 14th Jun, 2015

N/A

Pieter - I worked with continuous data for that modeling. People may have different points-of-view, but I did a
great deal of work with heteroscedasticity. If this is not an area in which you have much experience, you may want
to start with the Sage Pub Encyclopedia entry attached here.

I have an Australian friend who might use this expression:

Hooroo! - Jim

Article HETEROSCEDASTICITY AND HOMOSCEDASTICITY

Cite 1 Recommendation

Deleted profile
Thanks James!

Cite

Emmanuel Ekpenyong 23rd Jul, 2016

Michael Okpara University of Agriculture, Umudike

As most of my colleagues have rightly said, the assumption of normality applies to error term and not the
dependent and independent variables

Cite 1 Recommendation

Nisha Arora 4th Nov, 2016

Feelancer

Firstly, the assumption of normality applies to error term/ residuals and not the variables (response & predictor).
Let's review assumptions of OLS

Assumptions of OLS:

Error terms/ residuals have expectation zero and are uncorrelated and have equal variances [Guass Markov
Theorem]
Error term is uncorrelated with any of predictor variables
No Multi-colinearity
No Auto-correlation

Here, I've not written 'normality of residual' as an assumption of OLS. Yes, if residuals are not normal, then too
we can get** best linear unbiased estimator (BLUE)** of the coefficients is given by the ordinary least squares
(OLS) estimator but it will create problem in statistical inference. That's why sometimes it's included in the
assumptions.

In case of Violations of normality statistical tests for determining whether model coefficients are significantly
different from zero and for calculating confidence intervals for forecasts may not be reliable.

Erik Mønness 4th Nov, 2016

Inland Norway University of Applied Sciences

OLS regression also often works fine when the depended variable is binomial (where the variance of portions vary
with portion size). OLS is unbiased regardless of the error distribution. OLS estimates are equal to cross-table
proportions and can easily be applied in multitable situations. Logistic regression might give more correct p-values,
but often the difference to OLS p-values is small, if the number of observations is not too small

Cite

Muhammad S. Abu-Salih 17th Feb, 2017

Amman Arab University

The assumptions for regression are that: Residuals are to be independent identically distributed with zero mean
and equal variances.These assumptions are necessary for drawing inference and computing p-values for F and t.
If inference is not the goal and all what is needed is fitting an OLS line , then the above assumptions are not
necessary. In addition to the above, CLT is enough to establish normality of the mean if N>30 and P_P plot , Q-Q
plot and Histogram with Normal curve in addition to K-S or Shapiro tests can be used to establish normality.

Cite

Mushtaq Ahmad 13th Apr, 2017

University of Peshawar

You can check the normality and homoscedasticity assumption of the error term. if the variance of the residuals
changes , you can transform the dependent variable using some transformation, Try Box Cox!

Cite 1 Recommendation

Muthana A.K. Al-Zobaei Al-Zobae 5th Feb, 2018

University of Anbar

good question

Cite

Javed Iqbal 12th Mar, 2018

Institute of Business Administration Karachi

Normality of errors is required for inference, and not for point estimation (thanks to Gauss-Markov theorem). So
the only concern is that if errors are non normal, the test may be misleading especially for small sample sizes. I
suggest use bootstrap based p-values to carry out tests.

Cite 1 Recommendation

Athanasios Dermanis 12th Mar, 2018

Aristotle University of Thessaloniki

A remark on the answer of Javed Iqbal : I would fully agree with your answer, if the term "normality" is replaced
with the term "knowledge of the distribution (probability density function) of the errors". If one knows the
distribution he can still make statistical inference and employ statistical tests, even if the distribution is not the
normal one. The only problem is that he cannot use the ready made recipes of statistical textbooks which follow
the normality assumption.

Cite

Bruce Weaver 14th Mar, 2018

Lakehead University Thunder Bay Campus

Since I posted my earlier response 4 years ago, my thoughts on the normality requirement for OLS regression
have changed a bit, and I think they are probably in line with the views expressed by Javed Iqbal and Athanasios
Dermanis. I have been influenced by Jeffrey Wooldridge's book, Introductory Econometrics. What he says about
the assumptions for OLS linear regression is summarized in the attached PDF.

Note especially this excerpt (from p. 168 of the book) on slide 9:

"One practically important finding is that even without the normality assumption (Assumption MLR.6), t and F
statistics have approximately t and F distributions, at least in large sample sizes."

This suggests to me that normality of the errors is really a sufficient condition, but not a necessary condition.
Sufficient for what? Sufficient to ensure (approximate) normality of the sampling distributions of the model
parameters (i.e., the coefficients). And it is those sampling distributions, at the end of the day, that really need to
be (approximately) normal.

Cheers,

Bruce

OLS_regression_assumptions_Wooldridge.pdf · 379.76 KB

Cite 2 Recommendations

John-Kåre Vederhus 15th Mar, 2018

Sørlandet Hospital

Old post, but nonetheless. No one seem to have commented on how to interpret the normal probability plot as @
Simin Mahinrad asked about:

In the NORMAL P-P plot, you are hoping that your points will lie in a reasonably straight diagonal line from bottom
left to top right. In the SCATTER plot (see enclosed), you are hoping that the residuals will be roughly rectangularly
distributed, with most scored consentrated in the center (along the O point). You can also see outliers, i.e., cases
that have a std residual >3 or <-3. You can also double-click the scatterplot (go to chart editor), rightclick on
outliers and "go to cases" to see who those are. You may want to run the model without outliers to see if the model
changes. (Tabachnik & Fidell 2013. Using multivariate statistics).

Best

normality plot.jpg · 124.17 KB

scatterplot.jpg · 112.40 KB

Cite 1 Recommendation

Suzan Hatipoglu 20th Nov, 2018

Kettering General Hospital NHS Foundation Trust

Log-transform data and apply linear regression afterwards, I guess.

Cite 1 Recommendation

B.K. Bhaumik 22nd Nov, 2018

Variable Energy Cyclotron Centre

We have to apply generalized power transform (x^lambda -1)/lambda to transform data into Normal form.

Cite

Girdhar Agarwal 23rd Nov, 2018

University of Lucknow

The point estimate will be valid, but testing of hypothesis and confidence interval can not be found for non -normal
case. Regression diagnostic is to be done before proper analysis can be performed.

Cite

Raid Amin 23rd Nov, 2018

University of West Florida

It is sufficient to check for normally distributed errors. Check if the residuals seem to be covered well by a normal
curve.

If this is not the case, you could do one of the following:

1. Bootstrap

2. Use normal quintile transformation

3. Figure out an appropriate transformation

4. Use the raw data with caution.

5. (Anything else?)

Realize that not all published articles or texts on regression are 100% acceptable views and findings on
regression. Take them as “input”.

Then make up your mind.

Cite

Jaganathan Jothikumar 26th Nov, 2018

SRM Institute of Science and Technology College of Science and Humanities

Find R -Square value . Suppose if the R square value is 80% then it is interpreted as 80% of the relationship
between dependent and independent variables are explained and only 20% are unexplained.

Then there is test for Linearity of regression . Use this test and come to the conclusion.

Cite

Kendra Taylor 28th Nov, 2018

Georgia Tech Research Institute

Simin Mahinrad Bruce Weaver

The article linked below from the field of ophthalmology, where they seem to work with non-normal dependent
variables often, shows a relationship between sample size and the degree to which violating the normality
assumption for the dependent variable. The view expressed is that the dependent variable may have a conditional
normal distribution across the data set.

I hope this helps.

Article Are Linear Regression Techniques Appropriate for Analysis Wh...

Cite 1 Recommendation

Abolfazl Ghoodjani 28th Nov, 2018

McGill University

It is good to use the Minitab to get the best model regardless of whether it is linear or not. Although the
dispensation is not normal, it can also be used with a linear model.

Cite

Dhritikesh Chakrabarty 8th Dec, 2018

Handique Girls' College

I do agree with Bruce Weaver.

Regarding the validity of the results of the linear regression analysis , my suggestion is as follows:

Test the significance of the deviation of linearity of regression of the data. If the deviation is found to be
insignificant then it is to be understood that the results are valid. On the other hand, if the deviation is found to be
significant then it is to be understood that the data set do not follow linear regression model which implies that the
results are not valid. In this case, non-linear regression analysis will have to be performed for obtaining valid
results.

Cite 1 Recommendation

Girdhar Agarwal 8th Dec, 2018

University of Lucknow

As I have suggested earlier, regression diagnostic (analysis based on residuals) will give the real picture. It will tell
whether linear regression is valid or not. It will also give the idea whether quadratic or higher degree polynomial
will be required.

Cite 2 Recommendations

Dhritikesh Chakrabarty 9th Dec, 2018

Handique Girls' College

I do agree with Girdhar Agarwal .

The answer given by him is , in the true sense, the same with that already given by me.

Cite 1 Recommendation

Serhii Zabolotnii 9th Dec, 2018

Cherkasy State Business College

If the PDF of errors (residuals) differs significantly from the Gaussian law, then the variance of OLS-estimates of
the parameters of regression models can be reduced. One possible approach is to use higher order statistics. For
example,
Chapter Polynomial Estimation of Linear Regression Parameters for th...

Cite

Raid Amin 9th Dec, 2018

University of West Florida

I view it as important not to get cornered into one approach or methodology because some book "said so".
Statistics is a wonderful field, and its major charm to me is the fact that it can open up the world to us statisticians.
We can pick and choose what we want to analyze and in which way (as long as we stay with reasonably
acceptable and defensible choices).

In this thread we are debating and discussing options on what to do when "

the outcome (dependant variable) not normally distributed?". A healthy exchange of thoughts can never hurt. This
is good to do.

Cite 1 Recommendation

Nader Mohamed 11th Dec, 2018

Cairo University, Faculty of Economics & Political sciences

Hello Dr. Simin,

I think you should do any transformation on your dependent variable such as;

- Log. transformation

- Standardize it using mean centered method

- Lag transformation

I hope my response maybe useful for you

Cite 1 Recommendation

Dhritikesh Chakrabarty 11th Dec, 2018

Handique Girls' College

The suggestion provided by Nader Mohamed, as I assess, may be fruitful.

Cite 1 Recommendation

Debopam Ghosh 13th Dec, 2018

Atomic Minerals Directorate for Exploration and Research

you can try the Box-Cox family of transformations to transform the dependent variable you are dealing with , to a
response variable that is normal/nearly normally distributed.

Cite

Ghislain Aihounton 19th Dec, 2018

Faculty of Agronomy, University of Parakou

You can decide after examining the regression residuals and the abline. If log, sqrt or standard mean centered
transformation, or box-cox transformation still do not converge towards normality, i suggest you use more suitable
approach that is to say GLM methods and then you can run gamma regression which allows you to get more
reliable estimates under flexibility assumption. Linear Regression in such case ( non normality will impose
normality and this basically violate OLS assumption and yield baised estimates. So let's conclude that though
Linear Regression is well recognised as the best model ( i hereby confirm it), you need to explore the data on hand
by looking at outcome distribution and in the meanwhile predicted outcomes vs observed outcomes. After this
process, you can decide on the suitable model.

Cite

Mohamed Abdallah Turky 19th Jan, 2019

Tanta University

Dear\ Simin Mahinrad

pls read this article :-

https://iovs.arvojournals.org/article.aspx?articleid=2128171
https://data.library.virginia.edu/normality-assumption/

good luck

Cite

Shubham Jagtap 18th Dec, 2019

Technological University Dublin - City Campus

u can use Spearman Rank Order Correlation (non-parametric) on non-distributed independent variables for the
correlation analysis.

Cite

Adrian Olszewski 21st Mar, 2020

2KMM CRO

As others said, the linear regression doesn't make any assumptions on the variables. It's about the residuals.
Nothing to add here.

But I'm going to make a note on data transformations for modelling.

I do suggest avoiding any variable transformations at any cost, except the cases you can thoroughly and
convincingly justify the reason and explain the outcome. It applies especially to Box-Cox.

What I wrote below refers to responses (dependent variable) but the interpretational part also refers to the
predictor.

1. It completely changes the formulation, and totally affects the interpretation. Only in "clean" cases you will get
interpretable outcome, like log-transformed data generated by multiplicative process (not *any right skewed
data*!). log, exp, reciprocal, square/cube root, power of 2, 3 transformations may be meaningful in *special
scenarios*, e.g. velocity, area, volume, concentration, length (square root of area), but y^-0.67 doesn't mean
anything. And most of your audience will have no idea how your response changes with the predictor unless you
draw the curve. Easy for singe response, but if you have more? You will need marginal effects to give some idea.

So, by transforming, you *force your variables to follow certain distribution*. For example, log-transformation
assumes your data comes from log-normal distribution. Just look what it does with the equation. As a
consequence...

2. ... It changes the model along with the errors! In our case - from additive to multiple including errors. Maybe it's
good maybe not, depends on your case.

3. It will also affect the variance along with means - many people blindly use transformations completely forgetting
about that! Well, it can be useful if we want to stabilize variance, BUT it changes more! For example, in normal
distribution mean and variance are independent, in log-normal it's not! Of course this is idealized case, Box-Cox
may return any weird coefficient - guess how the model and the mean-variance relationship changes?

4. Jensen's inequality says clearly, log(E(y)) is not E(log(y)) (except the identity link). By running regression you
are interested in modelling the conditional expectation of the response, rather than response itself (transformed).
And remember that no transformation can handle certain response distribution properly, like counts (it makes no
sense, by the way).

5. In case of testing, it changes the null hypothesis, which is not the one you assess any more! In our case: from
shift in arithmetic means to ratios of geometric means. Yes, you performed valid inference... on a hypothesis you
never wanted to test (unless you can justify that).

You obtain a valid answer to *unasked question*. And yes, log(y) may results may differ from results returned by a
model with log link (e.g. gamma regression). You will have to decide which one to choose.

Sometimes there are industry guidelines, like those given by the FDA for clinicalbiostatistics, which advises using
log on PK data (for a good reason), but *even those guidelines* warn you against unconditional and *unjustified*
transformations!

6. Your back-transformed confidence intervals may be biased. Another disease to the collection.

7. BoxCox and any other transformation does NOT guarantee the properties you need. And what then? Transform
again, only complicating already complicated situation? Not to mention that you may turn your right skewed data
into... left-skewed one an only fall into more troubles.

I know there are many proponents of unconventional data transformation ("skewed data? Transform it!") on
ResearchGate, being taught this for tens of years or because a famous persons told them to continue using it, but
in the light of the above arguments I collected I strongly suggest considering (practically always better)
alternatives.

Except the mentioned few scenarios, the transformations can cause more harm than good.

We have 21st century now and plenty of models (being here for 50 years), including GLM, VGLM, GAM and tons
of others. Find my diagram showing some of them and their relationships:
https://www.dropbox.com/s/5a8w8kckyfeaix0/statistical%20models%20-%20diagram.pdf

Ghislain Aihounton mentioned the gamma GLM, which makes a good proposal.

Of course it makes NO applicationsin your case.

Cite 3 Recommendations

Emmanuel Ekpenyong 21st Mar, 2020

Michael Okpara University of Agriculture, Umudike

Yes. The assumption of normality is only on the residuals. It has nothing to do with the dependent variable.

Cite 1 Recommendation

Erik Mønness 23rd Mar, 2020

Inland Norway University of Applied Sciences

Box-Cox may be beneficial used as a distribution generator. See https://brage.inn.no/inn-

xmlui/handle/11250/134438 and also https://brage.inn.no/inn-xmlui/handle/11250/298119

From Canadian Journal of Forest Research

Cite

Onyekachi Chukwu 24th Mar, 2020

Nnamdi Azikiwe University, Awka

The assumption of normality in regression is about the error not variables.

Cite

Hernan Manrique 25th Mar, 2020

KU Leuven

It's more about the normality of the residuals than the distribution of your dependent variable

Cite

Ben M. Kane 12th Apr, 2020

Université du Québec en Outaouais

Hi, you can just transform your data to normalize them. There are several method you can use.

Cite 1 Recommendation

Adrian Olszewski 12th Apr, 2020

2KMM CRO

Ben M.Kane, please find my answer 5 posts above on why this is generally a bad idea, if used routinely. The fewer
people will cease to use it routinely, the better for the science.

Cite

Mushtaq Ahmad 12th Apr, 2020

University of Peshawar

Cite

James R Knaub 16th Apr, 2020

N/A

Yes.

For example:

https://www.researchgate.net/publication/319914742_Quasi-
Cutoff_Sampling_and_the_Classical_Ratio_Estimator_-
_Application_to_Establishment_Surveys_for_Official_Statistics_at_the_US_Energy_Information_Administration_-
_Historical_Development

Cite

Dhritikesh Chakrabarty 22nd Apr, 2020

Handique Girls' College

Regression analysis is a technique of determining the mathematical or statistical type of dependence of one
variable (called dependent variable) on one or more independent variable (or variables). The mathematical or
statistical type of dependence is described (or explained) by an equation called regression equation. When the
dependence is linear, it is described by a linear regression equation i.e. the linear regression is a means of
describing or explaining, mathematically (or statistically), the linear dependence of one variable on another
variable (or variables). While describing the linear dependence, it is not a necessary condition that the dependent
variable is to be normally distributed.

Thus, in the linear regression analysis, the results/findings are valid even if the dependent variable under study is
not-normally distributed.

Cite 2 Recommendations

W.V.A.D Karunarathne 6th May, 2020

University of Kelaniya

If DV is not normal it is violated a basic assumption.So there is a chance to transform non-normal DV in to a

normal DV.

Anura

Cite

Adrian Olszewski 6th May, 2020

2KMM CRO

@W.V.A.D Karunarathne There is no such assumption in the linear regression (and in general, in the General
Linear Model). Also, please read my comment why transformig DV is one of the worst methods, when better ones
exist.

Cite 1 Recommendation

Muhammed Ashraful Alam 18th Jun, 2020

Ministry of Health and Family Welfare, Bangladesh

You can compromise for big data set.

Cite

Loay Alwehwah 3rd Sep, 2020

Al-Quds University

some said we have to transform others says no, we need someone expert to tell us how to fix the normality issues,
and the auto correlation isues

Cite

Can you help by adding an answer?

Answer

Add your answer

Can we do regression analysis with non normal data distribution?

Question 19 answers Asked 30th Apr, 2014

Sania Khan

I used a 710 sample size and got a z-score of some skewness between 3 and 7 and Kurtosis between 6 and 8.8.
This shows data is not normal for a few variables. Can I still conduct regression analysis?

View

What is the acceptable range of skewness and kurtosis for normal distribution of data?

Question 308 answers Asked 19th Apr, 2014

Naeem Aslam

It is desirable that for the normal distribution of data the values of skewness should be near to 0. What if the values
are +/- 3 or above?

View

How can I report regression analysis results professionally in a research paper?

Question 28 answers Asked 6th Jun, 2017

Mohammad Bakri Hammami

Any provided material would be helpful.

View

How to report logistic regression findings in research papers? Specially in APA format?

Question 15 answers Asked 28th Oct, 2018

Hammad Hashmi

Kindly share some links of research papers in which logistic regression findings are reported.

View

If there is no correlation , is there a need to run linear regression?

Question 16 answers Asked 27th Mar, 2017

Kamila Alammar

I have five independent variables and one dependent. I have run a correlation matrix, and 2 of them have
a correlation with the DV. if I will run multiple regression , shall I add all the variables or just correlated
variables??

View

How to calculate the effect size in multiple linear regression analysis?

Question 17 answers Asked 15th May, 2016

Deleted profile

I performed a multiple linear regression analysis with 1 continuous and 8 dummy variables as predictors. The
analysis revealed 2 dummy variables that has a significant relationship with the DV. I need to know the practical
significance of these two dummy variables to the DV. That is, I want to know the strength of relationship that
existed. I was told that effect size can show this. How can I compute for the effect size, considering that i have
both continuous and dummy IVs? Thanks in advance. - Jonas

View

In regression, what are the beta values and correlation coefficients used for and how are
they interpreted?

Question 14 answers Asked 17th Apr, 2016

Yemi Fakere

Regression in academic research

View

Multicollinearity issues: is a value less than 10 acceptable for VIF?

Question 72 answers Asked 24th Sep, 2016

Alejandro Ros-Gálvez
Hello mates

Some papers argue that a VIF<10 is acceptable, but others says that the limit value is 5.

- "10" as the maximum level of VIF (Hair et al., 1995)

- "5" as the maximum level of VIF (Ringle et al., 2015)

Do you think there is any problem reporting VIF=6 ?

Maybe both limits are valid and that it depends on the researcher criteria...

Thank you in advance

Hair, J. F. Jr., Anderson, R. E., Tatham, R. L. & Black, W. C. (1995). Multivariate Data Analysis (3rd ed). New York:
Macmillan.

Ringle, Christian M., Wende, Sven, & Becker, Jan-Michael. (2015). SmartPLS 3. Bönningstedt: SmartPLS.
Retrieved from http://www.smartpls.com

View

How do I report the results of a linear mixed models analysis?

Question 31 answers Asked 3rd Jan, 2015

Subina Saini

1) Because I am a novice when it comes to reporting the results of a linear mixed models analysis, how do I
report the fixed effect, including including the estimate, confidence interval, and p-value in addition to the
size of the random effects. I am not sure how to report these in writing. For example, how do I report the
confidence interval in APA format and how do I report the size of the random effects?

2) How do you determine the significance of the size of the random effects (i.e. how do you determine if
the size of the random effects is too large and how do you determine the implications of that size)?

3) Our study consisted of 16 participants, 8 of which were assigned a technology with a privacy setting and 8 of
which were not assigned a technology with a privacy setting. Survey data was collected weekly. Our fixed effect
was whether or not participants were assigned the technology. Our random effects were week (for the 8-week
study) and participant. How do I justify using a linear mixed model for this study design? Is it accurate to
say that we used a linear mixed model to account for missing data (i.e. non-response; technology issues)
and participant-level effects (i.e. how frequently each participant used the technology; differences in
technology experience; high variability in each individual participant's responses to survey questions
across the 8-week period). Is this a sufficient justification?

I am very new to mixed models analyses, and I would appreciate some guidance.

View

Related Publications

Linear regression analysis study

Article Full-text available Jan 2018

Khushbu Kumari · Suniti Yadav

Linear regression is a statistical procedure for calculating the value of a dependent variable from an independent
variable. Linear regression measures the association between two variables. It is a modeling technique where a
dependent variable is predicted based on one or more independent variables. Linear regression analysis is the
most widely us...
View

Methods for Calculating Statistical Intervals for a Normal Distribution

Chapter Apr 2017

William Meeker · Gerald J. Hahn · Luis Escobar

This chapter gives general methods for calculating various statistical intervals for samples from a population or
process that can be approximated by a normal distribution. It explains statistical intervals for linear regression
analysis. The chapter presents methods for constructing confidence intervals for a normal distribution quantile. It
descr...

View

Linear Regression Analysis Primary Agreement’s Test

Article Full-text available Mar 2019

Phongsak Simmonds

Abstract This Linear Regression Analysis Primary Agreement’s Test article aimed to provide the concepts and
principles of the primary agreement’s test methodology for linear regression to benefit new research beginners
who would like to learn to design a research related to linear regression from choosing the factors, characteristics
and types of...

View

Got a technical question?

Get high-quality answers from experts.

Ask a question

Company Support Business solutions

About us Help Center Advertising

News Recruiting

Careers

Research Methods For Business Students (Saunders)
100% (2)
Research Methods For Business Students (Saunders)
82 pages
Assumptions of MANOVA
100% (1)
Assumptions of MANOVA
2 pages
Multiple Regression Analysis Using SPSS Laerd
No ratings yet
Multiple Regression Analysis Using SPSS Laerd
14 pages
Multiple Regression Analysis Using SPSS Statistics
No ratings yet
Multiple Regression Analysis Using SPSS Statistics
9 pages
Up-Selling in Restaurant
100% (10)
Up-Selling in Restaurant
39 pages
SWOT (Gardenia)
No ratings yet
SWOT (Gardenia)
2 pages
Literature Review On Linear Regression
100% (2)
Literature Review On Linear Regression
6 pages
How Do I Test The Normality of A Variable's Distribution?
No ratings yet
How Do I Test The Normality of A Variable's Distribution?
6 pages
David A. Freedman - The Limits of Econometrics PDF
100% (1)
David A. Freedman - The Limits of Econometrics PDF
13 pages
Standard Deviation Dissertation
100% (1)
Standard Deviation Dissertation
5 pages
Logistic Regression Using SPSS
No ratings yet
Logistic Regression Using SPSS
29 pages
Chapter 15 ANCOVA For Dichotomous Dependent Variables
No ratings yet
Chapter 15 ANCOVA For Dichotomous Dependent Variables
12 pages
Help PDF
No ratings yet
Help PDF
34 pages
Power Analysis
No ratings yet
Power Analysis
8 pages
Multivariate Analysis-MR
No ratings yet
Multivariate Analysis-MR
8 pages
Master Thesis Multiple Regression Analysis
100% (2)
Master Thesis Multiple Regression Analysis
7 pages
Notes 14
No ratings yet
Notes 14
33 pages
Non Parametric Test
100% (1)
Non Parametric Test
16 pages
An Animated Guide: Matching Test and Control Subjects: Russell Lavery, Independent Consultant
No ratings yet
An Animated Guide: Matching Test and Control Subjects: Russell Lavery, Independent Consultant
21 pages
Williams Et Al. - 2013 - Assumptions of Multiple Regression Correcting Two
No ratings yet
Williams Et Al. - 2013 - Assumptions of Multiple Regression Correcting Two
15 pages
Thesis With Regression Analysis
100% (3)
Thesis With Regression Analysis
7 pages
T Rns Formations
No ratings yet
T Rns Formations
6 pages
One Way ANOVA in 4 Pages
No ratings yet
One Way ANOVA in 4 Pages
8 pages
Big-data-and-a-bewildered-lay-analyst_2018_Statistics---Probability-Letters
No ratings yet
Big-data-and-a-bewildered-lay-analyst_2018_Statistics---Probability-Letters
5 pages
Understanding Diagnostic Plots For Linear Regression Analysis
No ratings yet
Understanding Diagnostic Plots For Linear Regression Analysis
5 pages
Robust Regression
No ratings yet
Robust Regression
7 pages
Four Assumptions Test in Multiple Regression
No ratings yet
Four Assumptions Test in Multiple Regression
10 pages
Chapter 19 Summary: Econometrics by Example, Second Edition, © Damodar Gujarati, 2014
No ratings yet
Chapter 19 Summary: Econometrics by Example, Second Edition, © Damodar Gujarati, 2014
2 pages
Things I Have Learned (So Far)
No ratings yet
Things I Have Learned (So Far)
9 pages
Things Learned So Far PDF
No ratings yet
Things Learned So Far PDF
9 pages
SubjectiveQuestions
No ratings yet
SubjectiveQuestions
4 pages
A Guide to Robust Statistical Methods -- Rand R_ Wilcox
No ratings yet
A Guide to Robust Statistical Methods -- Rand R_ Wilcox
338 pages
Sample Thesis Using Regression Analysis
100% (5)
Sample Thesis Using Regression Analysis
6 pages
Statistics Interview Questions & Answers For Data Scientists
No ratings yet
Statistics Interview Questions & Answers For Data Scientists
43 pages
assumption 3
No ratings yet
assumption 3
15 pages
Why's and Wherefore's
No ratings yet
Why's and Wherefore's
15 pages
Thesis Multiple Linear Regression
100% (2)
Thesis Multiple Linear Regression
5 pages
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet
Thesis With Multiple Regression Analysis
100% (2)
Thesis With Multiple Regression Analysis
6 pages
10 Errores
No ratings yet
10 Errores
4 pages
Quotes
No ratings yet
Quotes
4 pages
1 s2.0 S2452301118300154 Main
No ratings yet
1 s2.0 S2452301118300154 Main
8 pages
Four Assumptions of Multiple Regression That Researchers Should A
No ratings yet
Four Assumptions of Multiple Regression That Researchers Should A
6 pages
GZLM
No ratings yet
GZLM
3 pages
MEDI 1020_Workshop 7_Regression (1)
No ratings yet
MEDI 1020_Workshop 7_Regression (1)
15 pages
Linear Models Bias
No ratings yet
Linear Models Bias
17 pages
Green 2021 Count Regression Final Postprint v2
No ratings yet
Green 2021 Count Regression Final Postprint v2
20 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
One-way-ANOVA
No ratings yet
One-way-ANOVA
10 pages
Basic Statistics: Basic Statistical Interview Question
No ratings yet
Basic Statistics: Basic Statistical Interview Question
5 pages
15 Data Analyst Questions
No ratings yet
15 Data Analyst Questions
9 pages
The Five Assumptions of Multiple Linear Regression
No ratings yet
The Five Assumptions of Multiple Linear Regression
18 pages
The Relationship Betweenthe Standardized Root Mean Square Residualand Model Misspecificationin Factor Analysis Models
No ratings yet
The Relationship Betweenthe Standardized Root Mean Square Residualand Model Misspecificationin Factor Analysis Models
21 pages
Mean Variance Analysis and Tracking Error
No ratings yet
Mean Variance Analysis and Tracking Error
3 pages
Capstone 1 Proposal Final AnandMohan DayTrading StockMarket
No ratings yet
Capstone 1 Proposal Final AnandMohan DayTrading StockMarket
21 pages
What Is Regression Analysis (LI Post)
No ratings yet
What Is Regression Analysis (LI Post)
4 pages
One Pager 2
No ratings yet
One Pager 2
5 pages
AMT Statistical Modeling 1
No ratings yet
AMT Statistical Modeling 1
21 pages
Thesis Using Logistic Regression
100% (2)
Thesis Using Logistic Regression
7 pages
Babies Learning Language_ Methods[03-04]
No ratings yet
Babies Learning Language_ Methods[03-04]
2 pages
Final Paper Guide For PS, Spring : e Source File For This Document Is Not Yet Available at
No ratings yet
Final Paper Guide For PS, Spring : e Source File For This Document Is Not Yet Available at
13 pages
Thesis Using Linear Regression
100% (2)
Thesis Using Linear Regression
7 pages
Linear Regression Analysis Theory and Computing 1st Edition Xin Yan download pdf
100% (1)
Linear Regression Analysis Theory and Computing 1st Edition Xin Yan download pdf
51 pages
Problems With Stepwise Regression
No ratings yet
Problems With Stepwise Regression
1 page
Python - Display Number With Leading Zeros - Stack Overflow
No ratings yet
Python - Display Number With Leading Zeros - Stack Overflow
8 pages
Narrowing The Search: Which Hyperparameters Really Matter?
No ratings yet
Narrowing The Search: Which Hyperparameters Really Matter?
9 pages
Three Reasons That You Should NOT Use Deep Learning - by George Seif - Towards Data Science
No ratings yet
Three Reasons That You Should NOT Use Deep Learning - by George Seif - Towards Data Science
1 page
Python - How Do I Find Numeric Columns in Pandas - Stack Overflow
No ratings yet
Python - How Do I Find Numeric Columns in Pandas - Stack Overflow
6 pages
R - How Dnorm Works? - Stack Overflow
No ratings yet
R - How Dnorm Works? - Stack Overflow
1 page
Mboxcox, Interpreting Difficult Regressions: 2 Answers
No ratings yet
Mboxcox, Interpreting Difficult Regressions: 2 Answers
1 page
Autofilter With Column Formatted As Date: 10 Answers
No ratings yet
Autofilter With Column Formatted As Date: 10 Answers
1 page
Organisational Restructure Excel Dashboard - Excel Dashboards VBA
No ratings yet
Organisational Restructure Excel Dashboard - Excel Dashboards VBA
1 page
Excel - Selecting A Specific Column of A Named Range For The SUMIF Function - Stack Overflow
No ratings yet
Excel - Selecting A Specific Column of A Named Range For The SUMIF Function - Stack Overflow
1 page
For-Loops in R (Optional Lab) : This Is A Bonus Lab. You Are Not Required To Know This Information For The Final Exam
No ratings yet
For-Loops in R (Optional Lab) : This Is A Bonus Lab. You Are Not Required To Know This Information For The Final Exam
2 pages
VBA - String Parsing. String Parsing Involves Looking Through - by Breakcorporate - Medium
No ratings yet
VBA - String Parsing. String Parsing Involves Looking Through - by Breakcorporate - Medium
1 page
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Refer To Excel Cell in Table by Header Name and Row Number: 7 Answers
No ratings yet
Refer To Excel Cell in Table by Header Name and Row Number: 7 Answers
1 page
Excel - Can Advanced Filter Criteria Be in The VBA Rather Than A Range? - Stack Overflow
No ratings yet
Excel - Can Advanced Filter Criteria Be in The VBA Rather Than A Range? - Stack Overflow
1 page
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
No ratings yet
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
1 page
You're Not Good at Excel and You Don't Even Know It - by Breakcorporate - Medium
No ratings yet
You're Not Good at Excel and You Don't Even Know It - by Breakcorporate - Medium
1 page
Excel VBA Type Mismatch Error Passing Range To Array - Stack Overflow
No ratings yet
Excel VBA Type Mismatch Error Passing Range To Array - Stack Overflow
1 page
Excel VBA - Message and Input Boxes in Excel, MsgBox Function, InputBox Function, InputBox Method
No ratings yet
Excel VBA - Message and Input Boxes in Excel, MsgBox Function, InputBox Function, InputBox Method
2 pages
TreeSheets: App Reviews, Features, Pricing & Download - AlternativeTo
No ratings yet
TreeSheets: App Reviews, Features, Pricing & Download - AlternativeTo
1 page
MS Excel PivotTable Deleted Items Remain - Excel and Access
No ratings yet
MS Excel PivotTable Deleted Items Remain - Excel and Access
1 page
VBA - Bubble Sort. A Bubble Sort Is A Technique To Order - by Breakcorporate - Medium
No ratings yet
VBA - Bubble Sort. A Bubble Sort Is A Technique To Order - by Breakcorporate - Medium
1 page
Sorting Arrays in VBA
No ratings yet
Sorting Arrays in VBA
2 pages
Site Safety Audit Checklist Sample PDF Report
No ratings yet
Site Safety Audit Checklist Sample PDF Report
6 pages
Sr. No. City Bank Address: Patni GE Confidential
No ratings yet
Sr. No. City Bank Address: Patni GE Confidential
30 pages
Product Brochure of Graphite Electrode 1
No ratings yet
Product Brochure of Graphite Electrode 1
11 pages
GCB-DS New PDF
No ratings yet
GCB-DS New PDF
20 pages
Food Systems and The Triple Challenge
No ratings yet
Food Systems and The Triple Challenge
4 pages
Publication 2
No ratings yet
Publication 2
3 pages
Catalog OIL HEATERS Motor Grupuri IPL-109web PDF
No ratings yet
Catalog OIL HEATERS Motor Grupuri IPL-109web PDF
16 pages
Thermowell With Flange (Fabricated) Version Per DIN 43772 Form 2F, 3F Models TW40-8, TW40-9
No ratings yet
Thermowell With Flange (Fabricated) Version Per DIN 43772 Form 2F, 3F Models TW40-8, TW40-9
3 pages
B.voc Operation Theatre Technology
No ratings yet
B.voc Operation Theatre Technology
51 pages
Xii 4 Electromagnetic Induction
No ratings yet
Xii 4 Electromagnetic Induction
3 pages
Japanese Civilization
No ratings yet
Japanese Civilization
19 pages
The Dnight Visitor
No ratings yet
The Dnight Visitor
4 pages
Advocare Arthritis Osteoporosis and Rheumatology Associates
No ratings yet
Advocare Arthritis Osteoporosis and Rheumatology Associates
3 pages
Pinto pm2 Ism ch14
0% (1)
Pinto pm2 Ism ch14
16 pages
Bu0560 Ers GB 0512
No ratings yet
Bu0560 Ers GB 0512
4 pages
Design of Precast Segmental Tunnel Lining For Pawtucket CSO Tunnel Project
No ratings yet
Design of Precast Segmental Tunnel Lining For Pawtucket CSO Tunnel Project
12 pages
Pharmacy Leve III Skill Gap Training Materials Based On Version
No ratings yet
Pharmacy Leve III Skill Gap Training Materials Based On Version
51 pages
didatics
No ratings yet
didatics
6 pages
4 - The Nature and Types of Variables and Data
No ratings yet
4 - The Nature and Types of Variables and Data
5 pages
SEMINAR
No ratings yet
SEMINAR
12 pages
Stu w02b Beginners Guide To Reverse Engineering Android Apps PDF
No ratings yet
Stu w02b Beginners Guide To Reverse Engineering Android Apps PDF
22 pages
Oncogenic virus
No ratings yet
Oncogenic virus
3 pages
Common Admission To PG Programmes of Farm Universities of Karnataka
No ratings yet
Common Admission To PG Programmes of Farm Universities of Karnataka
35 pages
Single Wide Master Catalog: Goss Community Unit Goss Community Folder
100% (1)
Single Wide Master Catalog: Goss Community Unit Goss Community Folder
109 pages
Examinee Handbook - ServSafe Food Protection Manager
No ratings yet
Examinee Handbook - ServSafe Food Protection Manager
15 pages
Cyber Law and Policy: Lesson 6 Information Security Policies
No ratings yet
Cyber Law and Policy: Lesson 6 Information Security Policies
51 pages
Data Analysis For Parallel Car-Crash Simulation Results
No ratings yet
Data Analysis For Parallel Car-Crash Simulation Results
22 pages

Is Linear Regression Valid When The Outcome (Dependant Variable) Not Normally Distributed?

Uploaded by

Is Linear Regression Valid When The Outcome (Dependant Variable) Not Normally Distributed?

Uploaded by

Search for publications, researchers, or questions or Discover by subject area Recruit researchers Join for free Login

Question Answers 78 Similar questions Related publications

Question Asked 18th Aug, 2014

Join for free

Get help with your research

Join for free Log in

Most recent answer

Adrian Olszewski 3rd Sep, 2020

- conditional-to-random-effect models like general and generalized mixed-effects models.

Please look here at my diagram for reference:

Please find a reference on those non-parametric methods in literature below:

Popular Answers (1)

Bruce Weaver 18th Aug, 2014

Here are a few observations.

All Answers (78)

Michael S Martin 18th Aug, 2014

Hope this helps

Bruce Weaver 18th Aug, 2014

Here are a few observations.

Ronán Michael Conroy 18th Aug, 2014

Harald Lang 18th Aug, 2014

Sean Clouston 18th Aug, 2014

Seyi Ajao 18th Aug, 2014

Zeljko Pedisic 18th Aug, 2014

Simin Mahinrad 19th Aug, 2014

Thank you so much for your answers and help.

I think I'll go firs for checking the normality of my residuals...

James R Knaub 19th Aug, 2014

Article Properties of Weighted Least Squares Regression for Cutoff S...

Article Practical Interpretation of Hypothesis Tests - letter to the...

Simin Mahinrad 20th Aug, 2014

Michael S Martin 20th Aug, 2014

James R Knaub 20th Aug, 2014

Richard Anthony Champion Jr 20th Aug, 2014

There are two major points here.

Then you can decide what to do next.

Islam F. Bourini 21st Aug, 2014

1. Exploratory research propose.

2. Non Normal distribution.

3. Small Sample Size.

4. Theory is not fully fitting the theoretical model

5. Non-linear relationships (Quadratic relationship and polynomial relationships)

For further information i suggest to read the following papers:

All the best

Julie Slater 21st Aug, 2014

Raid Amin 22nd Aug, 2014

Here is a link for the background:

Josh Makore 25th Aug, 2014

Erik Mønness 25th Aug, 2014

Qual Quant (2009) 43:59–74 DOI 10.1007/s11135-007-9077-3

ORIGINAL PAPER Linear versus logistic regression when the dependent

Raid Amin 25th Aug, 2014

Bruce Weaver 25th Aug, 2014

Yuanzhang Li 26th Aug, 2014

If the outcome is categorical, you may consider logistic or other model.

Jana A Hirsch 26th Aug, 2014

Kylie Rixon 27th Aug, 2014

Thiago Augusto Da Cunha 15th Sep, 2014

I usually consider the following steps to building a model:

c) Check the assumptions of the regression for the errors.

d) others steps ....

Yuanzhang Li 15th Sep, 2014

Erik Mønness 16th Sep, 2014

Lots of good advice above.

Josh Makore 16th Sep, 2014

Mohammed Suleiman M. Gibreel 14th Oct, 2014

Athanasios Dermanis 14th May, 2015

Saima Eman 14th May, 2015

1- Try data transformation - then check normality

2- If standardised residuals are normal- no worries...

James R Knaub 14th Jun, 2015

PS - As indicated elsewhere, heteroscedasticity is a more important consideration. It is naturally occurring, and in

Article Properties of Weighted Least Squares Regression for Cutoff S...

James R Knaub 14th Jun, 2015

I have an Australian friend who might use this expression:

Article HETEROSCEDASTICITY AND HOMOSCEDASTICITY