Lecture 13: Introduction To Generalized Linear Models: 21 November 2007
Lecture 13: Introduction To Generalized Linear Models: 21 November 2007
21 November 2007
Introduction
Recall that weve looked at linear models, which specify a conditional probability density P (Y |X ) of the form Y = + 1 X1 + + n Xn + (1)
Linear models thus assume that the only stochastic part of the data is the normally-distributed noise around the predicted mean. Yet many (most?) types of data do not meet this assumption at all. These include: Continuous data in which noise is not normally distributed; Count data, in which the outcome is restricted to non-negative integers; Categorical data, where the outcome is one of a number of discrete classes. One of the important developments in statistical theory over the past several decades has been the broadening of linear models from the classic form given in Equation (1) to encompass a much more diverse class of probabilistic distributions. This is the class of generalized linear models (GLMs). The next section will describe, step by step, how the generalization from classic linear models is attained.
1.1
The right-hand side of Equation (1) has two components: a deterministic component determining the predicted mean, and a stochastic component expressing the noise distribution around that mean:
Predicted Mean Noise(N (0,2 ))
Y = + 1 X1 + + n Xn +
(2)
The rst step from classic linear models to generalized linear models is to break these two components apart and specify a more indirect functional relationship between them. In the rst step, we start with the idea that for any particular set of predictor variables {Xi }, there is a predicted mean . The probability distribution on the response Y is a function of that .1 Well review here what this means for linear models, writing both the abbreviated form of the model and the resulting probability density on the response Y : Y =+ (y )2 1 p (Y = y ; ) = e 2 2 2 (3) (4) (5) By choosing other functions to map from to p(y ), we can get to other probability distributions, such as the Poisson distribution over the non-negative integers (see Lecture 2, section 5): e y P (Y = y ; ) = y!
(6)
In the second step, we loosen the relationship between the predicted mean and the predictor variables. In the classic linear model of Equation (1), the predicted mean was a linear combination of the predictor variables. In generalized linear models, we call this linear combination and allow the
There is actually a further constraint on the functional relationship between and f (y ), which Im not going intosee McCullagh and Nelder (1989) or Venables and Ripley (2002, Chapter 7) for more details.
1
predicted mean to be an invertible function of . We call the linear predictor, and call the function relating to the link function: = + 1 X1 + + n Xn l ( ) = (linear predictor) (link function) (7) (8)
In classic linear regression, the link function is particularly simple: it is the identity function, so that = . Summary: generalized linear models are a broad class of models predicting the outcome of a response as a function of some linear combination of a set of predictors. To dene a GLM, you need to choose (a) a link function relating the linear predictor to the predicted mean of the response; and (b) a function dening the noise or error probability distribution around that mean. For a classical linear model, the link function is the identity function
1.2
Suppose we want a GLM that models binomially distributed data from n trials. We will use a slightly dierent formulation of the binomial distribution from what we introduced in Lecture 2: instead of viewing the response as the number of successful trials r , we view the response as the proportion of r ; call this Y . Now, the mean number of successes for a successful trials n binomial distribution is pn; hence the mean proportion is p. Thus p is the predicted mean of our GLM. This gives us enough information to specify precisely the resulting model: n ny (1 )n(1y) yn
P (Y = y ; ) =
This should look familiar from Lecture 2, Section 2. This is part (b) of designing a GLM: choosing the distribution on Y given the mean . Having done this means that we have placed ourselves in the binomial GLM family. The other part of specifying our GLM is (a): choosing a relationship between the linear predictor and the mean . Unlike the case with the classical linear model, the identity link function is not a possibility, because can potentially be any real number, whereas the Linguistics 251 lecture 13 notes, page 3 Roger Levy, Fall 2007
mean proportion of successes can only vary between 0 and 1. There are many link functions that can be chosen to make this mapping valid, but here we will use the logit link function (here we replace with p for simplicity):2 log or equivalently, p= e 1 + e (11) p = 1p (10)
When we plunk the full form of the linear predictor from Equation (7) back in, we arrive at the nal formula for logistic regression: Logistic regression formula: e+1 X1 ++n Xn 1 + e+1 X1 ++n Xn This type of model is also called a logit model. p=
(12)
1.3
The most common criterion by which a logistic regression model for a dataset is exactly the way that we chose the parameter estimates for a linear regression model: the method of maximum likelihood. That is, we choose the parameter estimates that give our dataset the highest likelihood. We will give a simple example using the dative dataset. The response variable here is whether the recipient was realized as an NP (i.e., the doubleobject construction) or as a PP (i.e., the prepositional object construction). This corresponds to the RealizationOfRecipient variable in the dataset. There are several options in R for tting basic logistic regression models, including glm() in the stats package and lrm() in the Design package. In this case we will use lrm(). We will start with a simple study of the eect of recipient pronominality on the dative alternation. Before tting a model, we examine a contingency table of the outcomes of the two factors:
Two other popular link functions for binomial GLMs are the probit link and the complementary log-log link. See Venables and Ripley (2002, Chapter 7) for more details.
2
> xtabs(~ PronomOfRec + RealizationOfRecipient,dative) RealizationOfRecipient PronomOfRec NP PP nonpronominal 600 629 pronominal 1814 220 So sentences with nonpronominal recipients are realized roughly equally often with DO and PO constructions; but sentences with pronominal recipients are recognized nearly 90% of the time with the DO construction. We expect our model to be able to encode these ndings. It is now time to construct the model. To be totally explicit, we will choose ourselves which realization of the recipient counts as a success and which counts as a failure (although lrm() will silently make its own decision if given a factor as a response). In addition, our predictor variable is a factor, so we need to use dummy-variable encoding; we will satisce with the R default of taking the alphabetically rst factor level, nonpronominal, as the baseline level. > response <- ifelse(dative$RealizationOfRecipient=="PP", 1,0) # code PO realization as success, DO as failure > lrm(response ~ PronomOfRec, dative) Logistic Regression Model lrm(formula = response ~ PronomOfRec, data = dative)
Frequencies of Responses 0 1 2414 849 Obs 3263 Tau-a 0.19 Max Deriv Model L.R. 2e-12 644.08 R2 Brier 0.263 0.154 d.f. 1 P 0 C 0.746 Dxy 0.492
Coef S.E. Wald Z P Intercept 0.0472 0.05707 0.83 0.4082 PronomOfRec=pronominal -2.1569 0.09140 -23.60 0.0000 Linguistics 251 lecture 13 notes, page 5 Roger Levy, Fall 2007
The thing to pay attention to for now is the estimated coecients for the intercept and the dummy indicator variable for a pronominal recipient. We can use these coecients to determine the values of the linear predictor and the predicted mean success rate p using Equations (7) and (12): = 0.0472 + (2.1569) 0 = 0.0472 (non-pronominal receipient) (13) (14) (15) (16)
+ = 0.0472 + (2.1569) 1 = 2.1097 (pronominal recipient) pnonpron = ppron e 1 + e0.0472 e2.1097 = 1 + e2.1097
0.0472
= 0.512 = 0.108
When we check these predicted probabilities of PO realization for nonpronominal and pronominal recipients, we see that they are equal to the proportions seen in the corresponding rows of the cross-tabulation we calculated above: 220 629 = 0.518 and 220+1814 = 0.108. This is exactly the expected behavior, 629+600 because (a) we have two parameters in our model, and 1 , which is enough to encode an arbitrary predicted mean for each of the cells in our current representation of the dataset; and (b) as we have seen before (Lecture 5, Section 2), the maximum-likelihood estimate for a binomial distribution is the relative-frequency estimatethat is, the observed proportion of successes.
1.4
Just as we were able to perform multiple linear regression for a linear model with multiple predictors, we can perform multiple logistic regression. Suppose that we want to take into account pronominality of both recipient and theme. First we conduct a complete cross-tabulation and get proportions of PO realization for each combination of pronominality status: apply() > tab <- xtabs(~ RealizationOfRecipient + PronomOfRec + PronomOfTheme, dative) > tab , , PronomOfTheme = nonpronominal PronomOfRec Linguistics 251 lecture 13 notes, page 6 Roger Levy, Fall 2007
RealizationOfRecipient nonpronominal pronominal NP 583 1676 PP 512 71 , , PronomOfTheme = pronominal PronomOfRec RealizationOfRecipient nonpronominal pronominal NP 17 138 PP 117 149 > apply(tab,c(2,3),function(x) x[2] / sum(x)) PronomOfTheme PronomOfRec nonpronominal pronominal nonpronominal 0.4675799 0.8731343 pronominal 0.0406411 0.5191638 Pronominality of the theme consistently increases the probability of PO realization; pronominality of the recipient consistently increases the probability of DO realization. We can construct a logit model with independent eects of theme and recipient pronominality as follows: > dative.lrm < - lrm(response ~ PronomOfRec + PronomOfTheme, dative) > dative.lrm Logistic Regression Model lrm(formula = response ~ PronomOfRec + PronomOfTheme, data = dative)
Frequencies of Responses 0 1 2414 849 Obs 3263 Tau-a 0.252 Max Deriv Model L.R. 1e-12 1122.32 R2 Brier 0.427 0.131 d.f. 2 P 0 C 0.827 Dxy 0.654
Coef S.E. Wald Z Intercept -0.1644 0.05999 -2.74 PronomOfRec=pronominal -2.8670 0.12278 -23.35 PronomOfTheme=pronominal 2.9769 0.15069 19.75
And once again, we can calculate the predicted mean success rates for each of the four combinations of predictor variables: Recipient Theme p nonpron nonpron -0.1644 0.459 pron nonpron -3.0314 0.046 nonpron pron 2.8125 0.943 pron pron -0.0545 0.486 In this case, note the predicted proportions of success are not the same as the observed proportions in each of the four cells. This is sensible we cannot t four arbitrary means with only three parameters. If we added in an interactive term, we would be able to t four arbitrary means, and the resulting predicted proportions would be the observed proportions for the four dierent cells.
1.5
Let us consider the case of a dative construction in which both the recipient and theme are encoded with pronouns. In this situation, both the dummy indicator variables (indicating that the theme and recipient are pronouns) have a value of 1, and thus the linear predictor consists of the sum of three terms. From Equation (10) we can write p = e+1 +2 1p = e e1 e2 (17) (18)
p The ratio 1 is the odds of success, and in logit models the eect of any p predictor variable on the response variable is multiplicative in the odds of success. If a predictor has coecent in a logit model, then a unit of that predictor has a multiplicative eect of e on the odds of success. Unlike the raw coecient , the quantity e is not linearly symmetricit falls in the range (0, ). However, we can also perform the full reverse
Table 1: Logistic regression coecients and corresponding factor weights for each predictor variable in the dative dataset.
Recip. pron +pron pron +pron Theme pron pron +pron +pron Linear Predictor 0.16 0.16 2.87 = 3.03 0.16 + 2.98 = 2.81 0.16 2.87 + 2.98 = 0.05 Multiplicative odds 0.8484 0.85 0.06 = 0.049 0.85 19.6 = 16.7 0.85 0.06 19.63 = 0.947 P(PO) 0.46 0.046 0.94 0.49
Table 2: Linear predictor, multiplicative odds, and predicted values for each combination of recipient and theme pronominality in the dative dataset. In each case, the linear predictor is the log of the multiplicative odds.
e logit transform of Equation (11), mapping to 1+ which ranges bee tween zero and 1, and is linearly symmetric around 0.5. The use of logistic regression with the reverse logit transform has been used in quantitative sociolinguistics since Cedergren and Sanko (1974) (see also Sanko and Labov, 1979), and is still in widespread use in that eld. In quantitative sociolinguistics, the use of logistic regression is often called VARBRUL (variable rule) analysis, and the parameter estimates are reported in the reverse logit transform, typically being called factor weights. Tables 1 and 2 show the relationship between the components of the linear predictor, the components of the multiplicative odds, and the resulting predictions for each possible combination of our predictor variables.
Well close our introduction to logistic regression with discussion of condence intervals and model comparison. Linguistics 251 lecture 13 notes, page 9 Roger Levy, Fall 2007
2.1
When there are a relatively large number of observations in comparison with the number of parameters estimated, the standardized deviation of the MLE for a logit model parameter is approximately normally distributed:
approx
) StdErr(
N (0, 1)
(19)
This is called the Wald statistic3 note the close similarity with the t statistic that we were able to use for classic linear regression in Lecture 11 (remember that once the t distribution has a fair number of degrees of freedom, it looks a great deal like a standard normal distribution). If we look again at the output of the logit model we tted in the previous section, we see the standard error, which allows us to construct condence intervals on our model parameters. Coef S.E. Wald Z Intercept -0.1644 0.05999 -2.74 PronomOfRec=pronominal -2.8670 0.12278 -23.35 PronomOfTheme=pronominal 2.9769 0.15069 19.75 P 0.0061 0.0000 0.0000
The Wald statistic can also be used for a frequentist test on the null hypothesis that an individual model parameter is 0. This is the source of the p-values given for the model parameters above.
2.2
Model comparison
Just as in the analysis of variance, we are often interested in conducting tests of the hypothesis that introducing several model parameters simultaneously leads to a better overall model. In this case, we cannot simply use a single Wald statistic for hypothesis testing. Instead, the most common approach is to use the likelihood-ratio test. A generalized linear model assigns a likelihood to its data as follows:
It is also sometimes called the Wald Z statistic, because of the convention that standard normal variables are often denoted with a Z, and the Wald statistic is distributed approximately as a standard normal.
3
) = Lik(x;
i
) P (xi |
(20)
Now suppose that we have two classes of models, M0 and M1 , and M0 is nested inside of M1 (that is, the class M0 is a special case of the class M1 ). It turns out that if the data are generated from a model M0 is the correct model, the ratio of the data likelihoods in the ML estimates for M0 and M1 is well-behaved. In particular, twice the log of the likelihood ratio is distributed as a 2 random variable with degrees of freedom equal to the dierence k in the number of free parameters in the two models. This quantity is sometimes called the deviance: 2 log LikM1 (x) = 2 [log LikM1 (x) log LikM0 (x)] LikM0 (x) 2 k (21)
As an example of using the likelihood ratio test, we will hypothesize a model in which pronominality of theme and recipient both still have additive eects but that these eects may vary depending on the modality (spoken versus written) of the dataset. We t this model and our modality-independent model using glm(), and use anova() to calculate the likelihood ratio: glm()
> m.0 <- glm(response ~ PronomOfRec + PronomOfTheme,dative,family="binomial") > m.A <- glm(response ~ PronomOfRec*Modality + PronomOfTheme*Modality,dative,famil > anova(m.0,m.A) Analysis of Deviance Table Model 1: response ~ PronomOfRec + PronomOfTheme Model 2: response ~ PronomOfRec * Modality + PronomOfTheme * Modality Resid. Df Resid. Dev Df Deviance 1 3260 2618.74 2 3257 2609.67 3 9.07 We can look up the p-value of this deviance result in the 2 3 distribution: > 1-pchisq(9.07,3) [1] 0.02837453 Thus there is some evidence that we should reject a model that doesnt include modality-specic eects of recipient and theme pronominality. Linguistics 251 lecture 13 notes, page 11 Roger Levy, Fall 2007
Further reading
There are many places to go for reading more about generalized linear models and logistic regression in particular. The classic comprehensive reference on generalized linear models is McCullagh and Nelder (1989). For GLMs on categorical data, Agresti (2002) and the more introductory Agresti (2007) are highly recommended. For more information specic to the use of GLMs and logistic regression in R, Venables and Ripley (2002, Section 7), Harrell (2001, Chapters 1012), and Maindonald and Braun (2007, Section 8.2) are all good places to look.
References
Agresti, A. (2002). Categorical Data Analysis. Wiley. Agresti, A. (2007). An Introduction to Categorical Data Analysis. Wiley, second edition. Cedergren, H. J. and Sanko, D. (1974). Variable rules: Performance as a statistical reection of competence. Language, 50(2):333355. Harrell, Jr, F. E. (2001). Regression Modeling Strategies. Springer. Maindonald, J. and Braun, J. (2007). Data Analysis and Graphics using R. Cambridge, second edition. McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall, second edition. Sanko, D. and Labov, W. (1979). On the uses of variable rules. Language in Society, 8:189222. Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, fourth edition.