0% found this document useful (0 votes)
20 views10 pages

Sta 3010 Quizes

Uploaded by

ndanuneema51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Sta 3010 Quizes

Uploaded by

ndanuneema51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

NAME: NEEMA NDANU ID: 666457

COURSE: STATISTCAL MODELLING


Quizzes
Describe the difference between logistic regression and linear regression.
Logistic regression works on the concept of probability and used to solve classification problems
that is works with one or more independent variable and a categorical dependent variable
which provides a binary outcome such as Yes or No, 0 or 1. It uses a sigmoid function or a
logistic function for modelling data. It uses root mean square to calculate next weight of data
point spread across the regression line.
Linear regression shows the linear relationship between dependent variable (Y-axis) and
independent variables(x-axis). It shows relationship between continuous level or numerical
variables e.g. age, salary. It uses a fixed equation which is Y= B0 + B1 x. It uses a precision
method to predict the next weight value.
Explain how the logistic regression model handles binary outcome variables.
Logistic regression model handles binary outcome variables by modelling the probability that a
given independent variable belongs to a specific class that e.g. 0 or 1. This how it handles both
the calculation and the interpretation of the outcome:
1. It uses the logistic or sigmoid function to model to map the predicted values to the
probabilities.
1
f ( x )= −x
where x=B0 + B1 x1 + ⋯ Bn x n
1+ ⅇ
2. The output is binary that is either 0 or 1, Yes or No etc.
3. For any logistic regression model the model is based on the concept of odds, for our
binary logistic it is the ratio of the probability of an event occurring to the probability of
it not occurring.
P(Y =0 ∣ X )
log ( )=β❑0 + β 1 x 1 + ⋯ + β n x n
P (Y =1 ∣ X )
4. The model coefficients are then maximized using MLE to find values that maximize the
likelihood of observing the given set of data.
5. We then calculate the p-value and compare it to the standard 0.05 value if less we reject
H0 and vice versa.
6. For coefficient output we interpret it as the logit or logs odds for being in favor of the
dependent variable(y) is estimated to rise or decrease by Bi coefficient value. For
example, the logs odds for being in favor of gay is estimated to rise 0.15 with each year
of education.
7. We can also use the odds ratio to interpret the variables that is the odds of the outcome
change with one –unit increase in predictor variable. For example, the odds of being in
support of gay marriage are predicted to grow 1.17 times larger for each additional year
of education.
Using an example dataset in R, demonstrate how to fit a binary logistic regression model and
interpret the coefficients.

Ordinal Logistic Regression:


Define what an ordinal logistic regression model is and how it differs from binary logistic
regression.
Ordinal logistic regression is used to model ordinal outcome variables, where the categories
have a natural order. Models the log-odds of being at or above each category threshold and
estimates the log-odds of being in a higher or equal category as a function of predictor
variables. The model assumes that the relationship between each pair of outcome categories is
the same.
Binary logistic regression is used to model binary outcome variables that is, two possible
outcomes. Models the log-odds of one category usually coded as 1 relative to the other
category coded as 0 and estimates the probability of the outcome being in one of the two
categories as a function of predictor variables. The model has no special assumptions beyond
logistic regression.
Provide an example scenario where ordinal logistic regression would be appropriate.
A study is conducted at a school to investigate factors contributing to a decline in students'
national exam grades. Teachers are interested in understanding the influence of various factors
on the likelihood of students performing well in exams. The outcome variable is ordinal with
three levels: "unlikely," "somehow likely," and "likely”. The factors that will be examined are
early attendance, specific lectures, teaching method or tuition fee.

Perform ordinal logistic regression analysis on a dataset in R and interpret the results.
Multinomial Logistic Regression:
Compare and contrast multinomial logistic regression with other types of regression models.

Multinomial logistic regression is used to model nominal outcome variables, estimating the log-
odds of each category relative to a baseline category as a function of predictor variables,
without specific assumptions about the relationship between categories.

Binary logistic regression, on the other hand, models the log-odds of a binary outcome (coded
as 1 vs. 0) as a linear combination of predictor variables and is used for predicting the
probability of one of two possible outcomes.

Ordinal logistic regression models ordinal outcome variables by estimating the log-odds of
being at or above each category threshold as a function of predictor variables, assuming a
natural order of categories and the proportional odds assumption.

Poisson regression is used to model count data by estimating the log of the expected count as a
linear combination of predictor variables, assuming that the mean and variance of the counts
are equal.

For count data with over dispersion, negative binomial regression is more appropriate as it
relaxes the equidispersion assumption.

Linear regression, is used to model continuous outcome variables by estimating the mean of
the outcome as a linear combination of predictor variables, assuming a linear relationship
between predictors and outcome, homoscedasticity, and normality of residuals.

Use R to fit a multinomial logistic regression model to categorical outcome data and interpret
the output.

Discuss the assumptions underlying multinomial logistic regression and how they are
assessed.
The following are assumptions of multinomial logistic regression such as:
1. Assumption of independence among the dependent variables that is, the choice of one
dependent variable does not influence the choice of another dependent variable. It can
be tested with the Hausman-McFadden test.
2. Assumption of non-perfect separation since if the groups of outcome variables are
perfectly separated by the predictors then unrealistic coefficient will be estimated and
effect sizes will be greatly exaggerated.
3. Assumption of multicollinearity meaning having two or more independent variables that
are highly correlated with each other. It can be tested using Variance Inflation Factor
(VIF) values. VIF values greater than 10 indicate significant multicollinearity.

Probit Regression:
Explain the concept of probit regression and its application compared to logistic regression.

The probit model is used to model binary outcomes. It does this by assuming that the
probability of the outcome follows a cumulative standard normal distribution. The probit model
employs a probit link function, where the inverse of the cumulative standard normal
distribution is modeled as a linear combination of the predictors.

In contrast, logistic regression uses the cumulative logistic distribution and a logit link function.
Specifically, the logistic model maps the linear combination of predictors to the cumulative
logistic distribution. Although both models are used for binary outcomes, they differ in terms of
interpreting the coefficients

For Logistic Regression the coefficient represents the change in the log odds of the dependent
variable (y) for a one-unit increase in the predictor variable. Exponentiating the coefficient gives
the odds ratio, indicating how the odds of the outcome change with the predictor. For Probit
Model the coefficient represents the change in the z-score (the standard normal deviate) for a
one-unit increase in the predictor variable.

Implement probit regression in R for a given dataset and discuss the interpretation of results.

Discuss situations where probit regression might be preferred over logistic regression.
If the underlying variable that determines the binary outcome (latent variables) is assumed to
follow a normal distribution, then probit regression is more appropriate.
Probit regression is particularly suited for experiments where the assumptions about the
normality of the latent variable hold true. On the other hand, logistic regression is often
preferred for observational studies
Poisson Regression:
Define the Poisson regression model and its assumptions.
Poisson Regression model is used to model count data and model response variables (Y-values)
that are counts. It is used to model count data by estimating the log of the expected count as a
linear combination of predictor variables, assuming that the mean and variance of the counts
are equal.

The following are assumption made when using the Poisson regression model:

1. The response Y has a Poisson distribution.


2. The dependent variable, Y consists of count data that must be positive, count variables
require integer data that must be zero or greater.
3. There are one or more independent variables, which can be measured on a continuous,
ordinal or nominal/dichotomous scale.
4. There is independence of observations that is, each observation is independent of the
other observations.
5. The mean and variance of the model are identical.

Provide an example of when Poisson regression is appropriate.


Using count data like the bicycle data set we are able to use to analyze on bike utilization in
New York therefore, the "Total" column would generally be the outcome variable of interest
crossing all bridges combined.
Perform Poisson regression analysis using R and interpret the coefficients.

Negative Binomial Regression:


Describe the situations where negative binomial regression is preferred over Poisson
regression.
If the two measures that is mean and variance are not the same resulting to over dispersion,
then negative binomial is preferred.
If the count data has high number of zero which exceeds what may be under Poisson
distribution, negative binomial would be preferred.
If the count data exhibit heterogeneity, negative binomial would be preferred over Poisson
distribution.
Implement negative binomial regression in R and compare the results with Poisson
regression.

Discuss the interpretation of the dispersion parameter in negative binomial regression.


In negative binomial regression the over dispersion is represented by theta. It tells us about the
additional variability in the count data which cannot be explained by Poisson model. A large
theta value often suggests that the data is over dispersed compared to a Poisson model, when
zero it reduces model to Poisson model as there is no variability.
Quasi-Poisson and Log-linear Models:
Explain the purpose of using Quasi-Poisson models and how they differ from traditional
Poisson models.
Quasi Poisson models differ from traditional Poisson models in that they model the over
dispersion of count variable. That means although the Poisson model assumes that the mean
and the variance is always equal this is not true as there are times when the variance is greater
than the mean and therefore we use the Quasi- Poisson in such scenarios where it assumes that
the variance is a linear function of the mean.
Illustrate the application of Quasi-Poisson regression in R with an example dataset.

Discuss the advantages and limitations of using log-linear models for count data analysis.
Log linear models are advantageous in that:
1. They can analyze count data in multidimensional contingency tables enabling us to
understand the relationship between categorical variables.
2. Thy are simple, easy to understand and have the flexibility associated with ANOVA and
regression
The disadvantageous think about in count data is:
1. They can be highly complex with high dimensional data
2. They cannot handle over dispersion effectively like Quasi - Poisson regression .

Kernel Smoothing:
Define kernel smoothing and its use in non-parametric regression.
Kernel smoothing is an extension of the kernel density estimation to regression problems. It is
used to smooth the curve on the data points. It can be used to model the non-linear
relationship between the outcome variable and predictors. It uses the kernel function to weight
nearby observations when estimating the regression function at a particular period.
Implement kernel smoothing in R to fit a smooth curve to data and interpret the bandwidth
parameter.
Compare kernel smoothing with parametric regression methods like linear regression.
Kernel smoothing is non-parametric technique while methods like linear regression are
parametric.
Kernel smoothing doesn’t assume any type of relationship between the dependent and the
independent variable while methods like the linear regression assumes a linear relationship
between the dependent and independent variable.
Kernel smoothing is a technique use to smoothen the curve on the data points while methods
like linear regression are used to fit a curve or regression line to the variables.
Splines:
Explain the concept of splines and how they are used to model non-linear relationships.
Spline smoothing is a technique that involves fitting a piecewise continuous curve (spline) to the
data. It is used when the relationship between a dependent variable and an independent
variable are not captured properly by the linear model. Here a dataset is divided into bins at
intervals or points called as knots. The bins have its separate fit.
Perform spline regression in R using natural splines and compare with polynomial regression.

Discuss the advantages of using splines over polynomial regression for modeling complex
curves.
The advantage of using the spline over polynomial are:
1. While polynomial regression only captures a certain amount of curvature in a nonlinear
relationship spline is able to model nonlinear relationships without any issue.
2. Splines provide a smoother interpolate between fixed points, called knots as compared
to Polynomial regression which is computed between knots.
3. The are able to avoid many of the pitfalls associated with polynomial regression, such as
overfitting and instability, making them a preferred choice for modeling complex curves.
Generalized Additive Models (GAM):
Define GAM and explain its advantages over traditional linear models.
GAM is a generalized linear model in which the dependent variable depends linearly on
unknown smooth functions of some independent variables, and interest focuses on inference
about these smooth functions.
It has several advantageous against the traditional linear models such as:
1. GAM address the limitation of traditional linear model that is they assume a linear
relationship between the dependent variable and the independent variable which is not
the case by allowing flexible modelling of this models through use of smoothing
functions.
2. They are able to capture patterns in which can be missed by the traditional linear
models.
Fit a GAM in R to a dataset containing non-linear relationships and interpret the results.

Compare GAM with other non-parametric regression methods like regression trees and
random forest.
GAM is flexible, easy to intercept uses smoothing function but its downside is that it requires
careful selection of smoothing parameters and they may not capture interaction well like tree
based models. It is used to show the non-relationship between the variables.
Regression Trees is easy to interpret, can be used to model the non-relationship between the
variables. Its disadvantage is that it can over fit, it lacks smoothness and piecewise constant fits
like GAM. It handles categorical variables and helps in capturing their interactions.
Random Forest are commonly used in high dimensional datasets with complex interactions.
They have a high predictive accuracy and are robust to overfitting. The disadvantage of this
regression model is that it is less interpretable than GAMs and require heavy computation
power.
Panel Data Analysis:
Define panel data and distinguish between balanced and unbalanced panel datasets.
Panel data, also known as longitudinal data or cross-sectional time-series data, is data collected
over time for the same entities e.g. countries. This type of data combines both cross-sectional
data which is data collected at one point in time and time-series data that is data collected
within or over multiple time periods. For example, you have data on GDP growth rate for 10
countries over 10 years.
Balanced panel dataset is when the data has the same number of observations for each entity
across all time periods. This means there are no missing values for any time period for any
entity in the dataset. Hence balance data is when we have all the data for all the countries that
is each of the 10 countries has GDP growth rate data for all 10 years without missing
observation.
Unbalanced panel datasets when they data has a different number of observations for different
entities over time. This means there are some missing values for some time periods for some
entities in the dataset. Hence unbalanced data will be when you have data for 8 countries for
GDP group 10 years while the other 2 only have 7 years of GDP growth.
Implement fixed effects and random effects models in R for panel data and interpret the
results.

Discuss the advantages of using panel data models over cross-sectional models.
The advantage of panel data over cross-section model is that:
1. They can control for unobserved heterogeneity by accounting for individual-specific
effects, providing consistent estimates than cross-sectional models.
2. Unlike cross-sectional models, which only capture a snapshot at one point in time, panel
data models allow you to analyze changes over time
3. Panel data helps in identifying temporary variations and understanding how causal
impacts might fluctuate over time. This is useful for observing how short-term changes
influence outcomes.
4. By using the timing of changes within individuals over multiple periods, panel data
models can better distinguish between causal relationships. This timing information
helps in identifying whether changes in one variable cause changes in another.
Generalized Estimating Equations (GEE):
Explain the concept of GEE and how it differs from traditional regression models.
GEE is a method for modeling longitudinal or clustered data. It is used with non-normal data
such as binary or count data. It is different from other traditional regression model in that:
1. It seeks to model a population average. They allow us to estimate different parameters
for each subject or cluster that is, the parameter estimates are conditional on the
subject/cluster.
2. It allows us to specify a correlation structure for different responses within a subject or
group.
3. It is used for simple clustering or repeated measures.

Apply GEE in R to analyze longitudinal data and interpret the population-averaged effects.

Discuss the assumptions of GEE and how they are assessed in practice.
GEE assumes that the mean structure should be correctly specified and can be assessed
through talking goodness of fit or residual plots.
GEE assumes correlation structure for different responses within a subject or group and can be
assessed through use of criteria such as Quasi-likelihood.
GEE assumes stationarity and can be assessed by plotting residuals over time.

You might also like