R Handbook - Regression For Count Data
R Handbook - Regression For Count Data
https://rcompanion.org/handbook/J_01.html 1/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
• DescTools
• ggplot2
• car
• multcompView
• emmeans
• MASS
• pscl
• rcompanion
• robust
The following commands will install these packages if they are not already installed:
if(!require(psych)){install.packages("psych")}
if(!require(hermite)){install.packages("hermite")}
if(!require(lattice)){install.packages("lattice")}
if(!require(plyr)){install.packages("plyr")}
if(!require(boot)){install.packages("boot")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(car)){install.packages("car")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(emmeans)){install.packages("emmeans")}
if(!require(MASS)){install.packages("MASS")}
if(!require(pscl)){install.packages("pscl")}
if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(robust)){install.packages("robust")}
Data = read.table(textConnection(Input),header=TRUE)
https://rcompanion.org/handbook/J_01.html 2/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
Histograms
library(lattice)
histogram(~ Monarchs | Garden,
data=Data,
layout=c(1,3) # columns and rows of individual plots
)
An alternative approach for data with many zeros is zero-inflated Poisson regression.
For further discussion, see the “Count data may not be appropriate for common parametric tests”
section in the Introduction to Parametric Tests chapter.
Note that model assumptions and pitfalls of this approach are not discussed here. The reader is urged to
understand the assumptions of this kind of modeling before proceeding.
model.p = glm(Monarchs ~ Garden,
https://rcompanion.org/handbook/J_01.html 3/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
data=Data,
family="poisson")
library(car)
Anova(model.p,
type="II",
test="LR")
Analysis of Deviance Table (Type II tests)
LR Chisq Df Pr(>Chisq)
Garden 66.463 2 3.697e-15 ***
library(rcompanion)
nagelkerke(model.p)
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.387929
Cox and Snell (ML) 0.937293
Nagelkerke (Cragg and Uhler) 0.938037
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
-2 -33.231 66.463 3.6967e-15
library(multcompView)
library(emmeans)
marginal = emmeans(model.p,
~ Garden)
pairs(marginal,
adjust="tukey")
cld(marginal,
alpha=0.05,
Letters=letters, ### Use lower-case letters for .group
adjust="tukey") ### Tukey adjustment for multiple comparisons
This example will use the glm.nb function in the MASS package. The Anova function in the car package
will be used for an analysis of deviance, and the nagelkerke function will be used to determine a p-value
and pseudo R-squared value for the model. Post-hoc analysis can be conducted with the emmeans
package.
Note that model assumptions and pitfalls of this approach are not discussed here. The reader is urged to
understand the assumptions of this kind of modeling before proceeding.
https://rcompanion.org/handbook/J_01.html 4/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
library(MASS)
model.nb = glm.nb(Monarchs ~ Garden,
data=Data,
control = glm.control(maxit=10000))
library(car)
Anova(model.nb,
type="II",
test="LR")
Analysis of Deviance Table (Type II tests)
LR Chisq Df Pr(>Chisq)
Garden 66.464 2 3.694e-15 ***
library(rcompanion)
nagelkerke(model.nb)
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.255141
Cox and Snell (ML) 0.776007
Nagelkerke (Cragg and Uhler) 0.778217
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
-2 -17.954 35.907 1.5952e-08
library(multcompView)
library(emmeans)
marginal = emmeans(model.nb,
~ Garden)
pairs(marginal,
adjust="tukey")
cld(marginal,
alpha = 0.05,
Letters = letters, ### Use lower-case letters for .group
type = "response", ### Report emmeans in orginal scale
adjust = "tukey") ### Tukey adjustment for multiple comparisons
https://rcompanion.org/handbook/J_01.html 5/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
model.zi = zeroinfl(Monarchs ~ Garden,
data = Data,
dist = "poisson")
### dist = "negbin" may be used
summary(model.zi)
Call:
zeroinfl(formula = Monarchs ~ Garden | Garden, data = Data, dist = "poisson")
library(car)
Anova(model.zi,
type="II",
test="Chisq")
Analysis of Deviance Table (Type II tests)
Df Chisq Pr(>Chisq)
Garden 2 23.914 6.414e-06 ***
library(rcompanion)
nagelkerke(model.zi)
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.284636
Cox and Snell (ML) 0.797356
Nagelkerke (Cragg and Uhler) 0.800291
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
-4 -19.156 38.311 9.6649e-08
library(multcompView)
library(emmeans)
marginal = emmeans(model.zi,
~ Garden)
pairs(marginal,
adjust="tukey")
cld(marginal,
alpha=0.05,
Letters=letters, ### Use lower-case letters for .group
adjust="tukey") ### Tukey adjustment for multiple comparisons
https://rcompanion.org/handbook/J_01.html 6/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
P value adjustment: tukey method for comparing a family of 3 estimates
significance level used: alpha = 0.05
### Note, emmeans are on the original measurement scale
model.rob.null = glmRob(Monarchs ~ 1,
data = Data,
family = "poisson")
anova(model.rob.null, model.rob, test="Chisq")
Quasi-Poisson regression
Quasi-Poisson regression is useful since it has a variable dispersion parameter, so that it can model over-
dispersed data. It may be better than negative binomial regression in some circumstances (Verhoef and
Boveng. 2007).
At the time of writing, Quasi-Poisson regression doesn’t have complete set of support functions in R.
Using the quasipoisson family option in the glm function, the results will have the same parameter
coefficients as with the poisson option, but the inference statistics are adjusted in the summary function.
The Anova function in the car package can be used for an analysis of deviance table, and the emmeans
package can be used for post-hoc comparisons. Since the model doesn’t produce a log-likelihood value, I
don’t know a way to produce a p-value for the mode, for a pseudo R-squared value for the model.
.
model.qp = glm(Monarchs ~ Garden,
data=Data,
family="quasipoisson")
library(car)
Anova(model.qp,
type="II",
test="LR")
https://rcompanion.org/handbook/J_01.html 7/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
Response: Monarchs
LR Chisq Df Pr(>Chisq)
Garden 52.286 2 4.429e-12 ***
library(multcompView)
library(emmeans)
marginal = emmeans(model.qp,
~ Garden)
pairs(marginal,
adjust="tukey")
cld(marginal,
alpha=0.05,
Letters=letters,
adjust="tukey")
Hermite regression
The generalized Hermite distribution is a more general distribution that can handle overdispersion or
multimodality (Moriñ a and others, 2015). This makes generalized Hermite regression a powerful and
flexible tool for modeling count data. It is implemented with the hermite package.
Fitting models with the hermite package can be somewhat difficult. One issue is that model fitting may
fail without some parameters being specified. Often specifying an appropriate value for the m option will
help.
A further difficulty with this approach is that, at the time writing, the package isn’t supported by the
anova function to compare models, the Anova function to test effects, or other useful functions like
emmeans for factor effects.
The hermite package is used to conduct hermite regression. Here, the m=3 option is specified. Often the
default m=NULL can be used. In this case, if the m value is not specified, the function cannot complete the
model fitting, and errors are produced. Using m=2 often works. Here, m=3 was used because it
produced a model with a lower AIC than did the m=2 option.
library(hermite)
summary(model)
Coefficients:
Estimate Std. Error z value p-value
(Intercept) 0.5081083 0.3251349 1.5627612 1.181088e-01
GardenB 1.3700567 0.3641379 3.7624662 1.682461e-04
GardenC 1.9596153 0.3476326 5.6370291 1.730089e-08
dispersion.index 1.0820807 0.2877977 0.1281707 3.601681e-01
order 3.0000000 NA NA NA
(Likelihood ratio test against Poisson is reported by *z value* for *dispersion.index*)
AIC: 112.7762
https://rcompanion.org/handbook/J_01.html 8/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
However, this approach does not represent any information learned from the Hermite regression.
A second issue is that, because the dependent variable is not continuous, the distribution of the
bootstrapped confidence intervals is not likely to be continuous, and so is may not be reliable.
To get confidence intervals for the medians for each group, we will use the groupwiseMedian function.
Here I used the percentile method for confidence intervals.
library(rcompanion)
annotate("text",
x = 1:3,
y = c(5, 10, 15),
label = c("Group 3", "Group 2", "Group 1"))
https://rcompanion.org/handbook/J_01.html 9/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
Omnibus test
Tabla = xtabs(Monarchs ~ Garden,
data = Data)
Tabla
Garden
A B C
14 52 94
chisq.test(Tabla)
Chi-squared test for given probabilities
chisq.test(x = observed,
p = expected)
https://rcompanion.org/handbook/J_01.html 10/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
chisq.test(x = observed,
p = expected)
Chi-squared test for given probabilities
chisq.test(x = observed,
p = expected)
Chi-squared test for given probabilities
Optional analysis: Vuong test to compare Poisson, negative binomial, and zero-
inflated models
The Vuong test, implemented by the pscl package, can test two non-nested models. It works with negbin,
zeroinfl, and some glm model objects which are fitted to the same data.
The null hypothesis is that there is no difference in models. The function produces three tests, a “Raw”
test, an AIC-corrected, and a BIC-corrected, any of which could be used.
It has been suggested that the Vuong test not be used to test for zero-inflation (Wilson, 2015).
Define models
model.p = glm(Monarchs ~ Garden,
data=Data,
family="poisson")
library(MASS)
library(pscl)
model.zi = zeroinfl(Monarchs ~ Garden,
data = Data,
dist = "poisson")
Vuong test
library(pscl)
vuong(model.p,
model.nb,
digits = 4)
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic H_A p-value
Raw 0.03324988 model1 > model2 0.48674
AIC-corrected 0.03324988 model1 > model2 0.48674
BIC-corrected 0.03324988 model1 > model2 0.48674
### Positive Vuong z-statistic suggests that model 1 is superior,
### but, in this case, the difference is not significant,
https://rcompanion.org/handbook/J_01.html 11/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
### and the value of the statistic is probably too tiny to be meaningful.
vuong(model.p,
model.zi,
digits = 4)
vuong(model.nb,
model.zi,
digits = 4)
References
Moriñ a, D., M. Higueras, P. Puig, and M. Oliveira. 2015. Generalized Hermite Distribution
Modelling with the R Package hermite. The R Journal 7(2):263–274. journal.r-
project.org/archive/2015-2/morina-higueras-puig-etal.pdf.
help(package="hermite")
library(hermite); ?glm.hermite
library(MASS); ?glm.nb
library(pscl); ?zeroinfl
library(pscl); ?vuong
“Simple Logistic Regression” in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological
Statistics, version 1.09. rcompanion.org/rcompanion/e_06.html.
"Generalized linear model: Link function". No date. Wikipedia. Retrieved 31 Jan. 2016.
en.wikipedia.org/wiki/Generalized_linear_model#Link_function.
Verhoef, J.M. and P.L. Boveng. 2007. Quasi-Poisson vs. negative binomial regression: How should we
model overdispersed count data? Ecology 88(11) 2766–2772.
Wilson, P. 2015. The Misuse of the Vuong Test for Non-Nested Models to Test for Zero-Inflation.
Economic Letters 127: 51–53. cybermetrics.wlv.ac.uk/paperdata/misusevuong.pdf
https://rcompanion.org/handbook/J_01.html 12/13
27/03/2024, 11:09 R Handbook: Regression for Count Data
Grace-Martin, K. No date. "Regression Models for Count Data". The Analysis Factor.
www.theanalysisfactor.com/regression-models-for-count-data/.
Grace-Martin, K. No date. " Zero-Inflated Poisson Models for Count Outcomes". The Analysis Factor.
www.theanalysisfactor.com/zero-inflated-poisson-models-for-count-outcomes/.
Citation
Mangiafico, S.S. 2016. Summary and Analysis of Extension Program Evaluation in R, version 1.20.05, revised 2023.
rcompanion.org/handbook/. (Pdf version: rcompanion.org/documents/RHandbookProgramEvaluation.pdf.)
https://rcompanion.org/handbook/J_01.html 13/13