0% found this document useful (0 votes)
12 views

BIOSTATISTICS

Uploaded by

Minu Maria Rose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

BIOSTATISTICS

Uploaded by

Minu Maria Rose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

CHOOSING THE

BEST MODELS
IN REGRESSION
Dr Minu Maria Rose
2nd Semester
JSS School of Public
Health
INTRODUCTION
• Model selection is an important part of any statistical analysis, and
indeed is central to the pursuit of science in general.

• For a good regression model, you want to include the variables that you
are specifically testing along with other variables that affect the response
in order to avoid biased results.

• Many tools for selecting the best model have been suggested in the
literature.
REGRESSION ANALYSIS

LINEAR REGRESSION

LOGISTIC REGRESSION

POLYNOMIAL REGRESSION

COX/ HAZARD
PROPORTIONAL
REGRESSION
R SQUARED
• Most popular, used in linear regression.
• Square of the correlation coefficient.
• It is the proportion of the variation in Y that is accounted by the variation in X.
• R2 varies between zero (no linear relationship) and one (perfect linear
relationship).
• R2, officially known as the coefficient of determination, is defined as the sum of
squares due to the regression divided by the adjusted total sum of squares of Y.
• A higher R-squared indicates the model is a good fit
• Lower R-squared indicates the model is not a good fit.

5
ADVANTAGES OF R2

• It represents the strength of the fit (on average, your predicted values do not deviate
much from your actual data).
DISADVANTAGES OF R2
• It does not tell you if the model is good.
• Whether the data you’ve chosen is biased.
• Or even if you’ve chosen the correct modelling method.
The R² value ranges from 0 to 1, with higher values denoting a strong fit, and lower
values denoting a weak
R2 < 0.5 – Weak fit

0.5 < R2 < 0.8 – Moderate fit

R2 > 0.8 – Strong fit


ADJUSTED R SQUARED
• For a multiple regression model, R-squared increases or remains the same as
we add new predictors to the model.
• Adjusted R-squared eliminates this drawback.
• It only increases if the newly added predictor improves the model’s predicting
power.
• Adding independent and irrelevant predictors to a regression model results in a
decrease in the adjusted R-squared.
CONTINUE
• Generally, you choose the models that have higher adjusted and predicted R-squared
values.
• The adjusted R squared increases only if the new term improves the model more than
would be expected by chance and it can also decrease with poor quality predictors.
• The predicted R-squared is a form of cross-validation and it can also decrease.
• Cross-validation determines how well your model generalizes to other data sets by
partitioning your data.
PSEUDO R SQUARED
• Pseudo R-squared value between of 0.2 to 0.4 indicates excellent fit.
• These are “pseudo” R-squared because they look like R-squared in the sense that they are
on a similar scale, ranging from 0 to 1 (though some pseudo R-squareds never achieve 0
or 1) with higher values indicating better model fit,
• But they cannot be interpreted as one would interpret an R-squared.
• Different types are: Efron’s, Macfadden’s, Cox and snell etc…
• They are valid and useful in evaluating multiple models predicting the same outcome on
the same dataset.
Continue..
• Used when the outcome variable is nominal or ordinal such that the
coefficient of determination R2 cannot be applied as a measure for
goodness of fit.
• R2 = 1 – [ln LL(Mˆfull)]/[ln LL(Mˆintercept)]
• A pseudo R-squared only has meaning when compared to another
pseudo R-squared of the same type, on the same data, predicting the
same outcome. In this situation, the higher pseudo R-squared indicates
which model better predicts the outcome.
P VALUE

• Low p-values indicate terms that are statistically significant

• “Reducing the model” refers to the practice of including all candidate


predictors in the model, and then systematically removing the term with
the highest p-value one-by-one until you are left with only significant
predictors.

• If the P-Value is less than the significance level (usually 0.05) then your
model fits the data well.
AKAIKE INFORMATION CRITERION
• The Akaike information criterion (AIC) is a mathematical method for evaluating how
well a model fits the data it was generated from.
• In statistics, AIC is used to compare different possible models and determine which one is
the best fit for the data.
• AIC estimates the quality of each model, relative to each of the other models.
• AIC deals with both the risk of overfitting and the risk of underfitting.
• When a statistical model is used to represent the process that generated the data, the
representation will almost never be exact; so some information will be lost by using the
model to represent the process. AIC estimates the relative amount of information lost by a
given model: the less information a model loses, the higher the quality of that model.

1
2
ROOT MEAN SQUARED ERROR (RMSE), which measures the average error performed by the model
in predicting the outcome for an observation. Mathematically, the RMSE is the square root of the mean
squared error (MSE), which is the average squared difference between the observed actual outcome
values and the values predicted by the model.
So, MSE = mean((observeds - predicteds)^2) and RMSE = sqrt(MSE).
The lower the RMSE, the better the model.
RESIDUAL STANDARD ERROR (RSE), also known as the model sigma, is a variant of the RMSE
adjusted for the number of predictors in the model. The lower the RSE, the better the model. In practice,
the difference between RMSE and RSE is very small, particularly for large multivariate data.
MEAN ABSOLUTE ERROR (MAE), like the RMSE, the MAE measures the prediction error.
Mathematically, it is the average absolute difference between observed and predicted outcomes, MAE =
mean(abs(observeds - predicteds)). MAE is less sensitive to outliers compared to RMSE.
http://www.sthda.com/
english/articles/38-
regression-model-
validation/158-regression-
model-accuracy-metrics-r-
square-aic-bic-cp-and-
more/

Refer this link for BIC

14
Thank you

You might also like