Maximum Likelihood Estimation
Maximum Likelihood Estimation
inear estimators, such as the least squares and instrumental variables estimators, are only one weapon in the econometricians arsenal. When the relationships of interest to us are not linear in their parameters, attractive linear estimators are difficult, or even impossible, to come by. This chapter expands our horizons by adding a nonlinear estimation strategy to the linear estimation strategy of earlier chapters. This chapter offers maximum likelihood estimation as a strategy for obtaining asymptotically efficient estimators and concludes by examining hypothesis testing from a large-sample perspective. There are two pitfalls to avoid as we read this chapter. The chapter shows that if the GaussMarkov Assumptions hold and the disturbances are normally distributed, then ordinary least squares (OLS) is also the maximum likelihood estimator. If the new method just returns us to the estimators weve already settled upon, why bother with them, we might think. However, in many settings more complex than the data-generating processes (DGPs) we have analyzed thus far, best linear unbiased estimators (BLUE) dont exist, and econometricians need alternative estimation tools. The second pitfall is to think that maximum likelihood estimators are always going to have the fine small sample properties that OLS has under the GaussMarkov Assumptions. Although maximum likelihood estimators are valuable tools when our data-generating processes grow more complex and finite sample tools become unmanageable, the value of maximum likelihood is sometimes limited to large samples, because their small sample properties are sometimes quite unattractive.
EXT 11-1
EXT 11-2
Web Extension 11
21.1
HOW DO WE CREATE AN ESTIMATOR?
EXT 11-3
mum likelihood estimates. To determine the maximum likelihood estimator, we need also to assume the specific statistical distribution of the disturbances. The next example adds the assumption of normally distributed disturbances to our usual GaussMarkov Assumptions. Thus the DGP we study is Yi = bXi + ei
ei ~ N(0, s2)
cov(ei, ej) = 0 i Z j Xs fixed across samples.
f1(Y1)f2(Y2)f3(Y3).
Because we assume in this example that the observations are normally distributed with mean bXi, and variance s2,
so that
With an explicit expression for the likelihood of the sample data expressed in terms of the observed Xs and the parameters of the DGP, we can ask what parameter values would make a given sample least surprising. Mathematically, the problem becomes to choose the estimate of b that minimize a particular mathematical function, as we shall now see.
EXT 11-4
Web Extension 11
A e-(Y1 - bX1) >2s B A e-(Y2 - bX2) >2s B A e-(Y3 - bX3) >2s B > A 22s2p B 3 = g(b;Y1,Y2,Y3)
2 2 2 2 2 2
and maximize g(b;Y1,Y2,Y3) with respect to b . We compute the value of b that makes the particular sample in hand least surprising and call that value the maximum likelihood estimate. As a practical matter, econometricians usually maximize not g(b;Y1,Y2,Y3), but the natural logarithm of g(b;Y1,Y2,Y3). Working with the log of the likelihood function, g(b;Y1,Y2,Y3), which we call L(b), doesnt alter the solution, but it does simplify the calculus because it enables us to take the derivative of a sum rather than the derivative of a product when solving for the maximum. Figure 21.1, Panel A, pictures the log of the likelihood function for a particu~ lar sample of data. On the horizontal axis are the values of possible guesses, b . On the vertical axis is the log of the likelihood of observing the sample in hand if the ~ guess we make is the true parameter, z = L(b ). The particular sample in question is most likely to arise when b = b mle; b mle is the maximum likelihood estimate of b given this sample. In Panel A, b mle = 15.7; the true value of b in this DGP is 12. ~ Values of b close to b mle are almost as likely to give rise to this particular mle sample as b is, but values further from b mle are increasingly less likely to give rise to this sample of data because the likelihood function declines as we move away from b mle. Panel B of Figure 21.1 shows the log of the likelihood function for the same sample as in Panel A, superimposed over the log of the likelihood function for a different sample from the same DGP. The maximum likelihood estimate given the second sample is 12.1, which maximizes the likelihood function for that second observed sample. Next, we use calculus to obtain the maximum likelihood estimator for b . We maximize L(b) = ln[g(b;Y1,Y2,Y3)]
3
2
B - ln A 22s2p B 3
21.1
EXT 11-5
Figure 21.1
-10
z 10 -2 -4 20 30
-6 -8 -10 -12
z -10 10 -2 -4 20
-6 -8 -10 -12
with respect to b . Notice that any logarithmic transformation would have turned the products into sums, but it is the natural logarithm that made the (e) terms disappear, because ln(e) = 1. To obtain the maximum likelihood estimator for b, b mle, we take the derivative of L with respect to b, dL>db, and set it to zero: dL>db = a (-2Xi(Yi - b mleXi)>2s2) = 0
i=1 3
so
mle 2 a (-Xi(Yi - b Xi)>s ) = 0 i=1 3
thus
mle 2 a (XiYi - b Xi ) = 0 i=1 3
EXT 11-6
Web Extension 11
Table 21.1 The CobbDouglas Production Function Dependent Variable: LOGOUTCP Method: Least Squares Date: 06/18/02 Time: 22:04 Sample: 1899 1922 Included observations: 24 Variable C LOGLABCP
R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood DurbinWatson stat
which yields
mle 2 a XiYi - a b Xi = 0 i=1 i=1 3 3
so that
Voila! We discover that b g4>( = b mle) is not only the BLUE estimator for b , it is also the maximum likelihood estimator if the disturbances are normally distributed! Conveniently, the maximum likelihood estimate of b does not depend on the variance of the disturbances, s2, even though the likelihood function does depend on s2. We are often, but not always, so fortunate. Table 21.1 shows the regression output from Table 4.3. Notice that this table reports the log of the likelihood function for a regression in a row titled log likelihood. The reported regression included a constant term. Normally distributed disturbances are not the only ones for which b g4 is the maximum likelihood estimator of the slope of a line through the origin. It is also the maximum likelihood for a broad statistical family called the exponential distribution. Because maximum likelihood estimation generally yields consistent and
EXT 11-7
Figure 21.2
-10
z 10 -2 -4 -6 -8 -10 -12 20 30
The Log of the Likelihood Function for One Sample from Two DGPs
40 b
asymptotically efficient estimators, b g4 is asymptotically efficient among all estimators if the disturbances in the DGP follow an exponential distribution. There are, however, distributions for the disturbances for which b g4 is not the maximum likelihood estimator of b .1 Note carefully the precise meaning of maximum likelihood. b mle is not necessarily the most likely value of b . Rather b mle is the value of b for which this sample that we obtain is least surprising; that is, for any other value of b, this particular sample would be less likely to arise than it was if b = b mle.
EXT 11-8
Web Extension 11
b0
10
b0
50 0
-100 -50
20 30 0 -2 z -4 -6 -8 -10 0
mle b1
-50
50
10
b1
20
b1 PANEL B
mle b0
100
PANEL A
-20 -10 -20
-10
b0
10 20 300 z -2.5 -5 -7.5 -10 -100
b0
10 20 30 0 -2 z -4 -6 -10
-50 0
0
mle b1
b1
50
10
20
100
b1 PANEL D
PANEL C
The likelihood function in Panel A that falls off rapidly in all directions indicates that both b 0 and b 1 are estimated relatively precisely by maximum likelihood in this instancewe cannot change either guess much from the maximum likelihood values without making the observed data much more surprising. The maximum likelihood estimates from this one sample are, in fact, very close to 10 ~ and 5. The much flatter likelihood function in Panel B (note the ranges of b 0 and ~ b 1 in this figure compared to those in Panel A), indicates that both b 0 and b 1 are relatively imprecisely estimated by maximum likelihood in this instancemany values other than the maximum likelihood estimates would make the observed data only a little more surprising. The maximum likelihood estimates from this one sample are, in fact, 21.4 and 1.1. The more tunnel-like surface in Panel C indicates that b 0 is relatively precisely estimated, and b 1 is relatively imprecisely estimated by maximum likelihood. Changing the guess of b 0 would make the data much more surprising, but changing the guess of b 1 would not.
EXT 11-9
The shape of a likelihood function also tells us about the covariance between maximum likelihood estimators. If we were to cut a slice of the surface in ~ Panel A of Figure 12.3 along the axis of a particular value of b 1, the maximum of ~ ~ the likelihood with respect to b 0, given b 1, would be just about the same, no mat~ ter where along the b 1 axis the slice were cut. This implies that the maximum likelihood estimators in Panel A are almost completely uncorrelated. In contrast, ~ ~ in Panel D, slicing the surface at a different value of b 1 would yield a different b 0 ~ that maximized the likelihood, given b 1. This implies that in Panel D the maximum likelihood estimators of the slope and intercept are quite correlated.
Estimating S2
We can use the maximum likelihood approach to estimate s2 just as we use it to estimate b . To estimate s2, we would look at f(Y1,Y2,Y3) as a function of s2, and would maximize that function with respect to s2. The maximum likelihood estimator for s2 is the variance of the residuals, 1 ge2. We learned in Chapter 5 that i n this is a biased estimator of the variance of the disturbances. Maximum likelihood estimators are frequently biased estimators, even though they are generally consistent estimators.
EXT 11-10
Web Extension 11
10 and 11 that robust standard error estimators are available for such cases. Similar care needs to be taken when conducting quasi-maximum likelihood estimation in more general settings.
21.2
HOW DO WE TEST HYPOTHESES?
EXT 11-11
Figure 21.4 The Log of the Likelihood for One Specific Sample: Testing H0: b = b 0
z Lmax
L0
b0
bmle
there might be a large friends sweater on the bed, making the sweater a weak, and therefore unappealing, test. In econometrics, we often choose among alternative statistics to test a single hypothesis. Which specific statistic provides the best test varies with circumstances. Sample size, the alternative hypotheses at hand, ease of computation, and the sensitivity of a statistics distribution to violations of the assumptions of our DGP all play a role in determining the best test statistic in a given circumstance.
EXT 11-12
Web Extension 11
where L(b) is the log-likelihood function defined in Equation 21.1. (Recall that the log of a ratio is a difference.) In large samples, LR has approximately the chisquare distribution with as many degrees of freedom as there are constraints (r is the number of constraints; it equals one in our example). We obtain the critical value for any significance level from a chi-square table. When LR exceeds the critical value, we reject the null hypothesis at the selected level of significance. In Figure 21.4, if the difference between Lmax and L0 is large enough, we reject the null hypothesis that b = b 0. When might the likelihood ratio test be superior to the t-test? Not for a small sample from a DGP with normally distributed disturbances, because in that case the t-distribution holds exactly, whereas the likelihood ratio may not yet have converged to its asymptotic chi-square distribution. In practice, econometricians almost always rely on the t-test to test hypotheses about a single slope coefficient. But in small samples from a DGP with a known, sharply non-normal distribution of the disturbances, the likelihood ratio test may prove better than the t-test. In large samples, the two estimators tend toward yielding the same answer. Likelihood ratio tests are used extensively in studying nonlinear models and nonlinear constraints, cases in which maximum likelihood estimation is common.
EXT 11-13
Table 21.2 An Unconstrained Model of Long-term Interest Rates Dependent Variable: Long Term Interest Rate Method: Least Squares Date: 08/02/03 Time: 21:20 Sample: 1982 2000 Included observations: 19 Variable C Real Short Rate Expected Inflation Change in Income US Deficit
R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood DurbinWatson stat
chi-square distribution with one degree of freedom is 3.84 , so we fail to reject the N null hypothesis. The t-distribution applied to the t-statistic for b 1 in Table 21.2 leads to the same conclusion, though it need not do so. For example, in the special case of the GaussMarkov Assumptions and normally distributed disturbances, the likelihood ratio statistic equals the square of the t-statistic. Because the disturbances are normally distributed in such cases, the t-distribution applies to the t-statistic exactly for any sample size. In small samples, the likelihood ratio statistic does not follow its asymptotic distributionjust as the t-statistic does not follow its asymptotic distribution (which is the normal distribution) in small samples. When the disturbances are normally distributed and the sample size is small, the t-test, using the exact t-distribution, is better than the likelihood ratio test.
Wald Tests
Notice that the t-test and the likelihood ratio test call for quite different computations. We can compute the t-test knowing only the results of a single regression the unconstrained regression. That regression suffices to form the t-statistic: (b g4 - b 0)>sbg4. The likelihood ratio test, in contrast, requires that we compute
EXT 11-14
Web Extension 11
Table 21.3 A Constrained Model of Long-term Interest Rates Dependent Variable: Long Term Interest Rate Method: Least Squares Date: 08/02/03 Time: 21:22 Sample: 1982 2000 Included observations: 19 Variable C Real Short Rate Change in Income US Deficit
R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood DurbinWatson stat
Expected Inflation
the log of the likelihood for both the unconstrained model (Lmax) and the constrained model (L0). In general, likelihood ratio tests require computing both the unconstrained and the constrained model, whereas tests based on how hypothesized values of the parameters differ from the unconstrained estimates only require computing the unconstrained model. We call test statistics based on how much the hypothesized parameter values differ from the unconstrained estimates Wald tests. If computing the constrained model is particularly difficult (as sometimes happens when the constraints are highly nonlinear or when the alternative model is only broadly defined), Wald tests may be preferable to likelihood ratio tests because Wald tests side-step computing the constrained model. Figure 21.5 depicts Wald and maximum likelihood tests in a single picture. ~ Our possible guesses, b , are on the horizontal axis. On the vertical axis, we measure two thingsthe log of the likelihood for this sample were the true parameter ~ ~ value equal to b , and the value of the Wald statistic for a particular guess, c(b ). In ~ our example, the Wald statistic is the square of the t-statistic, so the function c(b ) shown in the figure is parabolic. As noted earlier, the likelihood ratio test looks at
EXT 11-15
Lmax
L0 c(b)
c(bmle) b0 bmle b
the difference between Lmax and L0. The Wald test looks at the difference between ~ the Wald function evaluated at b mle, c(b ), and 0. The comparison is to 0, because if the estimated value equals the hypothesized value exactly, b 0, the t-statistic, and therefore the Wald function, will be zero.
F =
Like the t-statistic, the F-statistic follows its F-distribution [with r and (n - k - 1) degrees of freedom] only if the underlying disturbances are normally distributed and only if the constraints among the parameters are linear in form. However, in
EXT 11-16
Web Extension 11
large samples, rF is asymptotically distributed chi-square (with r degrees of freedom) under the GaussMarkov Assumptions, even if the underlying disturbances are not normally distributed and even if the constraints are nonlinear. The Wald statistic for testing several linear hypotheses about the slope and intercept of a line is a weighted sum of the squares and cross products of the deviations of the unconstrained estimates from their hypothesized values. For example, when testing the equality of coefficients between two regression regimes, A and B, the Wald statistic, W, is a weighted sum of (b mleA - b mleB)2, (b mleA - b mleB)2, 0 0 1 1 and (b mleA - b mleB)(b mleA - b mleB), with the weights depending on the variances 0 0 1 1 and covariances of the estimators. The Wald statistic is asymptotically distributed chi-square with r degrees of freedom, where r is the number of constraints (in this example, r 2). Which estimator should we use to test the hypothesis that a regression models coefficients are the same in two time periods? The F-statistic (or its chisquare cousin, rF), or the Wald statistic?2 The F-statistic is not computationally difficult to obtain, despite needing both the constrained and unconstrained sums of squared residuals, so there is little computational reason for preferring the Wald statistic to the F-statistic. And there may be sound reason for preferring the F-statistic over the Wald statistic. In small samples, the Wald statistic will not follow its asymptotic distribution, whereas the F-statistic will exactly follow the F-distribution if the underlying disturbances are normally distributed. Thus, the F-statistic frequently performs better than the Wald statistic in small samples. Sometimes, though, the Wald statistic is, indeed, preferable to the F-statistic. For example, when testing the hypothesis that regression coefficients are the same in two time periods, the distribution of the F-statistic and its chi-square cousin depend on the GaussMarkov Assumption of homoskedastic disturbances between the two regimes, (s2 = s2 ). The Wald statistics asymptotic distribution does not A B depend on such homoskedasticity. Thus, when we have substantial doubts about the homoskedasticity of the disturbances across regimes, the Wald statistic is preferable to the F-statistic. This choice between the Wald and F when testing for changes in regime captures the elements econometricians generally face when choosing among competing tests. Which tests are easiest to perform? Which perform best in finite samples? Which are most robust to any likely violations of our maintained hypotheses?
EXT 11-17
Slope
c(b)
bmle
maximizes the likelihood function. At the maximum, calculus tells us, the slope of the likelihood function is zerothe log of the likelihood function is flat at its peak. One measure of being far from the unconstrained maximum likelihood estimate, then, is the slope of the likelihood function when we impose the null hypothesis. If there are multiple constraints, there is a slope of the likelihood function with respect to each constraint. Lagrange multiplier tests combine the slopes of the likelihood function with respect to each constraint in an appropriate fashion to obtain a test statistic that is asymptotically distributed chi-square with r degrees of freedom, where r is the number of constraints.3 In Chapter 10 we encountered the White test for heteroskedasticity. The White test is a Lagrange multiplier test. Because a Lagrange multiplier test only requires that we estimate the constrained (homoskedastic) model, the White test spares us specifying the actual form of heteroskedasticity that might plague us.
EXT 11-18
Web Extension 11
EXT 11-19
tious than the translog production function, with its single output and several inputs. The authors use the quadratic in logs form to describe the production possibilities frontier of the United States, with two outputs, investment goods and consumption goods, and two inputs, capital and labor. They have four ambitious objectives that we review here. First, they test the neoclassical theory of production that claims that firms maximize profits. If markets are competitive, the neoclassical theory says that the ratios of goods prices equal the ratios of those goods marginal rates of transformation. Second, they test the hypothesis that technological change can be measured by a single variable, an index called total factor productivity that appears like another input in the translog expression. Third, they test whether interaction terms like ln(Li)ln(Mi) belong in the model. Finally, they test the CobbDouglas and CES specifications of the production frontier. Christiansen, Jorgenson, and Lau test all these hypotheses by examining the coefficients of the translog production function and two other equations derived from the translog specification and economic theory. The first additional equation expresses the value of investment goods production relative to the value of capital inputs as a linear function of the logs of the two inputs, the two outputs, and the technology index. The second additional equation expresses the value of labor inputs relative to the value of capital inputs, also as a linear function of the logs of the two inputs, the two outputs, and the technology index. The data are annual observations on the U.S. economy from 1929 to 1969. Christiansen, Jorgenson, and Lau demonstrate that if the production frontier has the translog shape, then each of the hypotheses they offer implies a particular relationship among the coefficients of the two relative value regressions and the translog production frontier. Thus, the authors arrive at a complex vari-
ant on the hypothesis tests we have already devised. They estimate not one equation, but three, and their constraints on the coefficients constrain not just the coefficients within one equation, but also the coefficients across the three equations. Christiansen, Jorgenson, and Lau estimate their three equations by maximum likelihood, and their various test statistics are likelihood ratio tests that are asymptotically distributed chi-square with as many degrees of freedom as there are constraints. What do the authors conclude? They fail to reject the neoclassical theory of production and the appropriateness of total factor productivity as a one-dimensional measure of technology. But they do reject the exclusion of interaction terms like ln(Li)ln(Mi) and consequently both the CobbDouglas and CES forms for the U.S. production frontier. They find the technological patterns of substitution between capital and labor are more complex than allowed for by either the CobbDouglas or the CES production function.
Final Notes
The translog production function has been used in hundreds of papers since it was introduced by Christiansen, Jorgenson, and Lau. Maximum likelihood estimation offers one attractive strategy for taming the complex of equations and hypotheses that the translog presents. To obtain their maximum likelihood estimates, Christiansen, Jorgenson, and Lau did not directly maximize the likelihood function. Instead, they followed an iterative computational algorithm that did not refer to the likelihood function, but that converged to the same solution as if they had maximized the likelihood directly. Contemporary econometric software packages, many of which were developed during the 1970s, usually avoid such indirect routes to maximum likelihood estimates. Advances in computer technology made the computations undertaken by Christiansen,
EXT 11-20
Web Extension 11
Jorgenson, and Lau possible. Since their work in 1973, even greater strides have been made in computing technologyand in the writing of econometric software programs. Students of econometrics today have access to computers far more powerful and to software far more friendly than were available to even the most sophisticated econometricians in 1973. These advances allow us to rely on more complex es-
timation methods, to conduct more complex tests, and to use much larger data sets than we could in the past. Computer scientists, econometric theorists, and data gatherers in both the public and the private sectors have combined to give economists empirical tools that were undreamed of a short while ago.
When several different tests are available for a single hypothesis, econometricians will balance computational ease, the burden of required assumptions, and the performance of each estimator in finite samples, to settle on a best choice. In large samples, if the null hypothesis is true, the likelihood ratio, Wald, and Lagrange multiplier tests all tend to the same answer. However, even in large samples, the three estimators can differ in their power against various alternatives.
Summary
The chapter began by introducing a strategy, maximum likelihood estimation, for constructing asymptotically efficient estimators. Maximum likelihood estimation
EXT 11-21
requires a complete specification of the distribution of variables and disturbances in the DGP, which is often an unfillable order. But when such detailed distributional information is available, maximum likelihood provides asymptotically normally distributed, asymptotically efficient estimators under quite general conditions. The chapter then introduced three classes of hypothesis tests: likelihood ratio tests, Wald tests, and Lagrange multiplier tests. The three differ in their computational requirements, maintained hypotheses, small sample performance, and power against various alternative hypotheses. Likelihood ratio tests require estimates of both the unconstrained and the constrained model. Wald tests only require estimates of the unconstrained model. Lagrange multiplier tests only require estimates of the constrained model. When the null hypothesis is true, all three tests tend toward the same conclusion in large samples, but in small samples, the statistics may lead to conflicting conclusions, even when the null hypothesis is true.
EXT 11-22
Web Extension 11
Endnotes
1. For example, if the disturbances follow the double exponential distribution, the max~ ~ imum likelihood estimator for b is the estimator b such that (Yi - Xi b ) has a median of zero. 2. In the special case of linear constraints on a linear regression with a DGP that satisfies the Gauss-Markov Assumptions, the Wald test statistic is exactly rF. Thus, in this special (but frequently encountered) case, the F-statistic is close kin (essentially an identical twin) to the Wald statistic, as well as to the likelihood ratio statistic. For this reason, some statistical packages refer to the F-test as a Wald test, and contrast it with likelihood ratio tests. However, in general, the Wald statistic differs substantively from the F-statistic. For this reason, other statistical packages report both an F-statistic and a Wald test statistic when testing constraints on coefficients. 3. When the Gauss-Markov Assumptions hold and disturbances are normally distributed, the Lagrange multiplier statistic is the same as rF, except that in the denominator we replace the sum of squared residuals from the unconstrained regression, with those from the constrained regression. 4. K. A. Arrow, Hollis Chenery, B. S. Minhas, and Robert Solow, Capital-Labor Substitution and Economic Efficiency, Review of Economics and Statistics (August 1961): 225250.