BIOSTATISTICS
BIOSTATISTICS
BEST MODELS
IN REGRESSION
Dr Minu Maria Rose
2nd Semester
JSS School of Public
Health
INTRODUCTION
• Model selection is an important part of any statistical analysis, and
indeed is central to the pursuit of science in general.
• For a good regression model, you want to include the variables that you
are specifically testing along with other variables that affect the response
in order to avoid biased results.
• Many tools for selecting the best model have been suggested in the
literature.
REGRESSION ANALYSIS
LINEAR REGRESSION
LOGISTIC REGRESSION
POLYNOMIAL REGRESSION
COX/ HAZARD
PROPORTIONAL
REGRESSION
R SQUARED
• Most popular, used in linear regression.
• Square of the correlation coefficient.
• It is the proportion of the variation in Y that is accounted by the variation in X.
• R2 varies between zero (no linear relationship) and one (perfect linear
relationship).
• R2, officially known as the coefficient of determination, is defined as the sum of
squares due to the regression divided by the adjusted total sum of squares of Y.
• A higher R-squared indicates the model is a good fit
• Lower R-squared indicates the model is not a good fit.
5
ADVANTAGES OF R2
• It represents the strength of the fit (on average, your predicted values do not deviate
much from your actual data).
DISADVANTAGES OF R2
• It does not tell you if the model is good.
• Whether the data you’ve chosen is biased.
• Or even if you’ve chosen the correct modelling method.
The R² value ranges from 0 to 1, with higher values denoting a strong fit, and lower
values denoting a weak
R2 < 0.5 – Weak fit
• If the P-Value is less than the significance level (usually 0.05) then your
model fits the data well.
AKAIKE INFORMATION CRITERION
• The Akaike information criterion (AIC) is a mathematical method for evaluating how
well a model fits the data it was generated from.
• In statistics, AIC is used to compare different possible models and determine which one is
the best fit for the data.
• AIC estimates the quality of each model, relative to each of the other models.
• AIC deals with both the risk of overfitting and the risk of underfitting.
• When a statistical model is used to represent the process that generated the data, the
representation will almost never be exact; so some information will be lost by using the
model to represent the process. AIC estimates the relative amount of information lost by a
given model: the less information a model loses, the higher the quality of that model.
1
2
ROOT MEAN SQUARED ERROR (RMSE), which measures the average error performed by the model
in predicting the outcome for an observation. Mathematically, the RMSE is the square root of the mean
squared error (MSE), which is the average squared difference between the observed actual outcome
values and the values predicted by the model.
So, MSE = mean((observeds - predicteds)^2) and RMSE = sqrt(MSE).
The lower the RMSE, the better the model.
RESIDUAL STANDARD ERROR (RSE), also known as the model sigma, is a variant of the RMSE
adjusted for the number of predictors in the model. The lower the RSE, the better the model. In practice,
the difference between RMSE and RSE is very small, particularly for large multivariate data.
MEAN ABSOLUTE ERROR (MAE), like the RMSE, the MAE measures the prediction error.
Mathematically, it is the average absolute difference between observed and predicted outcomes, MAE =
mean(abs(observeds - predicteds)). MAE is less sensitive to outliers compared to RMSE.
http://www.sthda.com/
english/articles/38-
regression-model-
validation/158-regression-
model-accuracy-metrics-r-
square-aic-bic-cp-and-
more/
14
Thank you