Sta 3010 Quizes
Sta 3010 Quizes
Perform ordinal logistic regression analysis on a dataset in R and interpret the results.
Multinomial Logistic Regression:
Compare and contrast multinomial logistic regression with other types of regression models.
Multinomial logistic regression is used to model nominal outcome variables, estimating the log-
odds of each category relative to a baseline category as a function of predictor variables,
without specific assumptions about the relationship between categories.
Binary logistic regression, on the other hand, models the log-odds of a binary outcome (coded
as 1 vs. 0) as a linear combination of predictor variables and is used for predicting the
probability of one of two possible outcomes.
Ordinal logistic regression models ordinal outcome variables by estimating the log-odds of
being at or above each category threshold as a function of predictor variables, assuming a
natural order of categories and the proportional odds assumption.
Poisson regression is used to model count data by estimating the log of the expected count as a
linear combination of predictor variables, assuming that the mean and variance of the counts
are equal.
For count data with over dispersion, negative binomial regression is more appropriate as it
relaxes the equidispersion assumption.
Linear regression, is used to model continuous outcome variables by estimating the mean of
the outcome as a linear combination of predictor variables, assuming a linear relationship
between predictors and outcome, homoscedasticity, and normality of residuals.
Use R to fit a multinomial logistic regression model to categorical outcome data and interpret
the output.
Discuss the assumptions underlying multinomial logistic regression and how they are
assessed.
The following are assumptions of multinomial logistic regression such as:
1. Assumption of independence among the dependent variables that is, the choice of one
dependent variable does not influence the choice of another dependent variable. It can
be tested with the Hausman-McFadden test.
2. Assumption of non-perfect separation since if the groups of outcome variables are
perfectly separated by the predictors then unrealistic coefficient will be estimated and
effect sizes will be greatly exaggerated.
3. Assumption of multicollinearity meaning having two or more independent variables that
are highly correlated with each other. It can be tested using Variance Inflation Factor
(VIF) values. VIF values greater than 10 indicate significant multicollinearity.
Probit Regression:
Explain the concept of probit regression and its application compared to logistic regression.
The probit model is used to model binary outcomes. It does this by assuming that the
probability of the outcome follows a cumulative standard normal distribution. The probit model
employs a probit link function, where the inverse of the cumulative standard normal
distribution is modeled as a linear combination of the predictors.
In contrast, logistic regression uses the cumulative logistic distribution and a logit link function.
Specifically, the logistic model maps the linear combination of predictors to the cumulative
logistic distribution. Although both models are used for binary outcomes, they differ in terms of
interpreting the coefficients
For Logistic Regression the coefficient represents the change in the log odds of the dependent
variable (y) for a one-unit increase in the predictor variable. Exponentiating the coefficient gives
the odds ratio, indicating how the odds of the outcome change with the predictor. For Probit
Model the coefficient represents the change in the z-score (the standard normal deviate) for a
one-unit increase in the predictor variable.
Implement probit regression in R for a given dataset and discuss the interpretation of results.
Discuss situations where probit regression might be preferred over logistic regression.
If the underlying variable that determines the binary outcome (latent variables) is assumed to
follow a normal distribution, then probit regression is more appropriate.
Probit regression is particularly suited for experiments where the assumptions about the
normality of the latent variable hold true. On the other hand, logistic regression is often
preferred for observational studies
Poisson Regression:
Define the Poisson regression model and its assumptions.
Poisson Regression model is used to model count data and model response variables (Y-values)
that are counts. It is used to model count data by estimating the log of the expected count as a
linear combination of predictor variables, assuming that the mean and variance of the counts
are equal.
The following are assumption made when using the Poisson regression model:
Discuss the advantages and limitations of using log-linear models for count data analysis.
Log linear models are advantageous in that:
1. They can analyze count data in multidimensional contingency tables enabling us to
understand the relationship between categorical variables.
2. Thy are simple, easy to understand and have the flexibility associated with ANOVA and
regression
The disadvantageous think about in count data is:
1. They can be highly complex with high dimensional data
2. They cannot handle over dispersion effectively like Quasi - Poisson regression .
Kernel Smoothing:
Define kernel smoothing and its use in non-parametric regression.
Kernel smoothing is an extension of the kernel density estimation to regression problems. It is
used to smooth the curve on the data points. It can be used to model the non-linear
relationship between the outcome variable and predictors. It uses the kernel function to weight
nearby observations when estimating the regression function at a particular period.
Implement kernel smoothing in R to fit a smooth curve to data and interpret the bandwidth
parameter.
Compare kernel smoothing with parametric regression methods like linear regression.
Kernel smoothing is non-parametric technique while methods like linear regression are
parametric.
Kernel smoothing doesn’t assume any type of relationship between the dependent and the
independent variable while methods like the linear regression assumes a linear relationship
between the dependent and independent variable.
Kernel smoothing is a technique use to smoothen the curve on the data points while methods
like linear regression are used to fit a curve or regression line to the variables.
Splines:
Explain the concept of splines and how they are used to model non-linear relationships.
Spline smoothing is a technique that involves fitting a piecewise continuous curve (spline) to the
data. It is used when the relationship between a dependent variable and an independent
variable are not captured properly by the linear model. Here a dataset is divided into bins at
intervals or points called as knots. The bins have its separate fit.
Perform spline regression in R using natural splines and compare with polynomial regression.
Discuss the advantages of using splines over polynomial regression for modeling complex
curves.
The advantage of using the spline over polynomial are:
1. While polynomial regression only captures a certain amount of curvature in a nonlinear
relationship spline is able to model nonlinear relationships without any issue.
2. Splines provide a smoother interpolate between fixed points, called knots as compared
to Polynomial regression which is computed between knots.
3. The are able to avoid many of the pitfalls associated with polynomial regression, such as
overfitting and instability, making them a preferred choice for modeling complex curves.
Generalized Additive Models (GAM):
Define GAM and explain its advantages over traditional linear models.
GAM is a generalized linear model in which the dependent variable depends linearly on
unknown smooth functions of some independent variables, and interest focuses on inference
about these smooth functions.
It has several advantageous against the traditional linear models such as:
1. GAM address the limitation of traditional linear model that is they assume a linear
relationship between the dependent variable and the independent variable which is not
the case by allowing flexible modelling of this models through use of smoothing
functions.
2. They are able to capture patterns in which can be missed by the traditional linear
models.
Fit a GAM in R to a dataset containing non-linear relationships and interpret the results.
Compare GAM with other non-parametric regression methods like regression trees and
random forest.
GAM is flexible, easy to intercept uses smoothing function but its downside is that it requires
careful selection of smoothing parameters and they may not capture interaction well like tree
based models. It is used to show the non-relationship between the variables.
Regression Trees is easy to interpret, can be used to model the non-relationship between the
variables. Its disadvantage is that it can over fit, it lacks smoothness and piecewise constant fits
like GAM. It handles categorical variables and helps in capturing their interactions.
Random Forest are commonly used in high dimensional datasets with complex interactions.
They have a high predictive accuracy and are robust to overfitting. The disadvantage of this
regression model is that it is less interpretable than GAMs and require heavy computation
power.
Panel Data Analysis:
Define panel data and distinguish between balanced and unbalanced panel datasets.
Panel data, also known as longitudinal data or cross-sectional time-series data, is data collected
over time for the same entities e.g. countries. This type of data combines both cross-sectional
data which is data collected at one point in time and time-series data that is data collected
within or over multiple time periods. For example, you have data on GDP growth rate for 10
countries over 10 years.
Balanced panel dataset is when the data has the same number of observations for each entity
across all time periods. This means there are no missing values for any time period for any
entity in the dataset. Hence balance data is when we have all the data for all the countries that
is each of the 10 countries has GDP growth rate data for all 10 years without missing
observation.
Unbalanced panel datasets when they data has a different number of observations for different
entities over time. This means there are some missing values for some time periods for some
entities in the dataset. Hence unbalanced data will be when you have data for 8 countries for
GDP group 10 years while the other 2 only have 7 years of GDP growth.
Implement fixed effects and random effects models in R for panel data and interpret the
results.
Discuss the advantages of using panel data models over cross-sectional models.
The advantage of panel data over cross-section model is that:
1. They can control for unobserved heterogeneity by accounting for individual-specific
effects, providing consistent estimates than cross-sectional models.
2. Unlike cross-sectional models, which only capture a snapshot at one point in time, panel
data models allow you to analyze changes over time
3. Panel data helps in identifying temporary variations and understanding how causal
impacts might fluctuate over time. This is useful for observing how short-term changes
influence outcomes.
4. By using the timing of changes within individuals over multiple periods, panel data
models can better distinguish between causal relationships. This timing information
helps in identifying whether changes in one variable cause changes in another.
Generalized Estimating Equations (GEE):
Explain the concept of GEE and how it differs from traditional regression models.
GEE is a method for modeling longitudinal or clustered data. It is used with non-normal data
such as binary or count data. It is different from other traditional regression model in that:
1. It seeks to model a population average. They allow us to estimate different parameters
for each subject or cluster that is, the parameter estimates are conditional on the
subject/cluster.
2. It allows us to specify a correlation structure for different responses within a subject or
group.
3. It is used for simple clustering or repeated measures.
Apply GEE in R to analyze longitudinal data and interpret the population-averaged effects.
Discuss the assumptions of GEE and how they are assessed in practice.
GEE assumes that the mean structure should be correctly specified and can be assessed
through talking goodness of fit or residual plots.
GEE assumes correlation structure for different responses within a subject or group and can be
assessed through use of criteria such as Quasi-likelihood.
GEE assumes stationarity and can be assessed by plotting residuals over time.