Logistic Regression in Data Analysis: An Overview
Logistic Regression in Data Analysis: An Overview
net/publication/227441142
CITATIONS READS
65 8,749
1 author:
Maher Maalouf
Khalifa University
29 PUBLICATIONS 257 CITATIONS
SEE PROFILE
All content following this page was uploaded by Maher Maalouf on 24 October 2015.
Maher Maalouf*
School of Industrial Engineering
University of Oklahoma
202 W. Boyd St., Room 124
Norman, OK. 73019 USA
E-mail: [email protected]
*Corresponding author
1 Introduction
Logistic Regression (LR) is one of the most important statistical and data
mining techniques employed by statisticians and researchers for the analysis and
classification of binary and proportional response data sets [2, 27, 29, 40]. Some
of the main advantages of LR are that it can naturally provide probabilities and
extend to multi-class classification problems [27, 37]. Another advantage is that
most of the methods used in LR model analysis follow the same principles used
in linear regression [31]. What’s more, most of the unconstrained optimization
y = Xβ + ε, (1)
Logistic Regression in Data Analysis 3
where ε is the error vector, and where
y1 1 x11 x12 · · · x1d β0 ε1
y2 1 x21 x22 · · · x2d β1 ε2
y = . , X = . . .. .. , β = .. , and ε = .. . (2)
.. .. .. . . . .
yn 1 xn1 xn2 · · · xnd βd εn
The vector β is the vector of unknown parameters such that xi ← [1, xi ] and
β ← [β0 , β T ]. From now on, the assumption is that the intercept is included in
the vector β. Now, since y is a Bernoulli random variable with a probability
distribution
(
pi , if yi = 1;
P (yi ) = (3)
1 − pi , if yi = 0;
with a variance
V (yi ) = pi (1 − pi ). (5)
yi = xi β + εi (6)
that
(
1 − pi , if yi = 1 with probability pi ;
εi = (7)
−pi , if yi = 0 with probability 1 − pi ;
and a variance
Since the expected value and variance of both the response and the error are not
constant (heteroskedastic), and the errors are not normally distributed, the least
squares approach cannot be applied. In addition, since yi ∈ {0, 1}, linear regression
would lead to values above one or below zero. Thus, when the response vector is
binary, the logistic response function, as shown in Figure 2, is the appropriate one.
4 M. Maalouf
1
0.9
0.8
0.7
0.6
P 0.5
0.4
0.3
0.2
0.1
0
−10.00 −5.00 0.00 5.00 10.00
x
The logistic function commonly used to model each positive instance xi with
its expected binary outcome is given by
exi β 1
E[yi = 1|xi , β] = pi = = , for i = 1, . . . n. (11)
1 + exi β 1 + e−xi β
The logistic (logit) transformation is the logarithm of the odds of the positive
response, and is defined as
µ ¶
pi
ηi = g(pi ) = ln = xi β. (12)
1 − pi
η = Xβ. (13)
n µ
exi β
µ ¶ µ ¶¶
X 1
ln L(β) = yi ln + (1 − yi ) ln . (15)
i=1
1 + exi β 1 + exi β
Amemiya [3] provides formal proofs that the ML estimator for LR satisfies the ML
estimators’ desirable properties. Unfortunately, there is no closed form solution
to maximize ln L(β) with respect to β. The LR Maximum Likelihood Estimates
(MLE) are therefore obtained using numerical optimization methods, which start
with a guess and iterate to improve on that guess. One of the most commonly used
numerical methods is the Newton-Raphson method, for which, both the gradient
vector and the Hessian matrix are needed:
n µ µ
−xij exi β
¶ µ ¶¶
∂ X xij
ln L(β) = yi + (1 − y i ) (16)
∂βj i=1
1 + exi β 1 + exi β
n µ µ ¶ µ xi β ¶¶
X 1 e
= yi xij xi β
− (1 − y )x
i ij (17)
i=1
1 + e 1 + exi β
n
X
= (yi xij (1 − pi ) − (1 − yi )xij (pi )) (18)
i=1
Xn
= (xij (yi − pi )) = 0, (19)
i=1
where j = 0, ...d and d is the number of parameters. Each of the partial derivatives
is then set to zero. In matrix form, equation (19) is written as
n µ
exi β
µ ¶ µ ¶¶
X 1 λ
ln L(β) = yi ln xi β
+ (1 − y i ) ln xi β
− ||β||2 (25)
i=1
1 + e 1 + e 2
n
eyi xi β
µ ¶
X λ
= ln xi β
− ||β||2 , (26)
i=1
1 + e 2
λ
where λ > 0 is the regularization parameter and ||β||2 is the regularization
2
(penalty) term. For binary outputs, the loss function or the deviance (DEV), also
useful for measuring the goodness-of-fit of the model, is the negative log-likelihood
and is given by the formula [31, 42]
Minimizing the deviance DEV(β̂) given in (27) is equivalent to maximizing the log-
likelihood [31]. Recent studies showed that the Conjugate Gradient (CG) method,
when applied to the method of Iteratively Re-weighted Least Squares (IRLS)
provides better results to estimate β than any other numerical method [51, 55].
One of the most popular techniques used to find the MLE of β is the iteratively re-
weighted least squares (IRLS) method, which uses Newton-Raphson algorithm to
solve LR score equations. Each iteration finds the Weighted Least Squares (WLS)
estimates for a given set of weights, which are used to construct a new set of
weights [26]. The gradient and the Hessian are obtained by differentiating the
regularized likelihood in (26) with respect to β, obtaining, in matrix form
∇β ln L(β) = XT (y − p) − λβ = 0, (28)
where I is a d × d identity matrix. Now that the first and second derivatives are
obtained, the Newton-Raphson update formula on the (c + 1) − th iteration is
Logistic Regression in Data Analysis 7
given by
Since β̂ (c) = (XT VX + λI)−1 (XT VX + λI)β̂ (c) , then (30) can be rewritten as
where z(c) = Xβ̂ (c) + V−1 (y − p) and is referred to as the adjusted response [27].
Despite the advantage of the regularization parameter, λ, in forcing positive
definiteness, if the matrix (XT VX + λI) were dense, the iterative computation
could become unacceptably slow [42]. This necessitates the need for a “trade off”
between convergence speed and accurate Newton direction [44]. The method which
provides such a trade-off is known as the truncated Newton’s method.
1 (c+1) T
β̂ (X VX + λI)β̂ (c+1) − β̂ (c+1) (XT Vz(c) ). (33)
2
Komarek and Moore [43] were the first to implement a modified linear CG to
approximate the Newton direction in solving the IRLS for LR. This technique is
called Truncated-Regularized Iteratively-Reweighted Least Squares (TR-IRLS). The
main advantage of the CG method is that it guarantees convergence in at most d
steps [44]. The TR-IRLS algorithm consists of two loops. Algorithm 1 represents
the outer loop which finds the solution to the WLS problem and is terminated
when the relative difference of deviance between two consecutive iterations is no
larger than a specified threshold ε1 . Algorithm 2 represents the inner loop, which
solves the WLS subproblems in Algorithm 1 through the linear CG method, which
is the Newton direction. Algorithm 2 is terminated when the residual
Once the optimal MLE for β̂ are found, classification of any given i − th
instance, xi , is carried out according to the following rules
(
1, if η̂i ≥ 0 or p̂i ≥ 0.5 ;
ŷi = (34)
0, otherwise.
Almost all of the conventional classification methods are based on the assumption
that the training data consist of examples drawn from the same distribution as
the testing data (or real-life data) [65, 68]. Likewise in Generalized Linear Models
(GLM), likelihood functions solved by methods such as LR are based on the
concepts of random sampling or exogenous sampling [39, 67]. To see why this is
the case [3, 10], under random sampling, the true joint distribution of y and X is
P (y|X)P (X), and the likelihood function based on n binary observations is given
by
n
Y
LRandom = P (yi |xi , β)P (xi ). (35)
i=1
n
Y
L= P (yi |xi , β), (37)
i=1
1 i30 + i11
E[β̂ − β] = − , (38)
2n i220
·³ ´3 ¸ h³ ´³ ´i ·³ ´2 ¸
∂L ∂L ∂2L ∂L
where i30 = E ∂β , i11 = E ∂β ∂β 2 , and i20 = E ∂β , are
1
evaluated at β̂. Following King and Zeng [39], if pi = 1+e−β0 +xi
, then the
asymptotic bias is
2
1 E[(0.5 − p̂i )((1 − p̂i )2 yi + p̂2i (1 − yi ))]
E[β̂0 − β0 ] = − (39)
n (E[(1 − p̂i )2 yi + p̂2i (1 − yi )])
p − 0.5
≈ , (40)
np(1 − p)
" n
#−1
X
V(β̂) = pi (1 − pi )xT
i xi . (41)
i=1
The variance given in (41) is smallest when the part pi (1 − pi ), which is affected by
rare events, is closer to 0.5. This occurs when the number of ones is large enough
in the sample. However, the estimate of pi with observations related to rare events
is usually small, and hence additional ones would cause the variance to drop while
additional zeros at the expense of events would cause the variance to increase [39,
58]. The strategy is to select on y by collecting observations for which yi = 1 (the
cases), and then selecting random observations for which yi = 0 (the controls). The
objective then is to keep the variance as small as possible by keeping a balance
between the number of events (ones) and non-events (zeros) in the sample under
study. This is achieved through endogenous sampling or choice − based sampling.
Endogenous sampling occurs whenever sample selection is based on the dependent
variable (y), rather than on the independent (exogenous) variable (X).
However, since the objective is to derive inferences about the population
from the sample, the estimates obtained by the common likelihood using pure
endogenous sampling are inconsistent. King and Zeng [39] recommend two
methods of estimation for choice-based sampling, prior correction and weighting.
Logistic Regression in Data Analysis 11
4.2 Correcting Estimates Under Endogenous Sampling
P (s = 1|y = 1)P (y = 1)
p̂ = P (y = 1|s = 1) = (42)
P (s = 1|y = 1)P (y = 1) + P (s = 1|y = 0)P (y = 0)
³ ´
y
τ p̃
=³ ´ ³ ´ . (43)
y 1−y
τ p̃ + 1−τ (1 − p̃)
ν1 p̃
p̂ = . (44)
ν1 p̃ + ν0 (1 − p̃)
p̂ ν1 p̃
O= = , (45)
1 − p̂ ν0 (1 − p̃)
1
p̃i = , for i = 1 . . . n. (49)
ln[( 1−τ
τ )( 1−y )]−xi β
y
1+e
4.2.2 Weighting
Under pure endogenous sampling, the conditioning is on X rather than y [10, 54],
and the joint distribution of y and X in the sample is
f (y, X|β)
P (X|y, β) = , (51)
P (y)
but
Ps (y)
fs (y, X|β) = P (y|X, β)P (X) (53)
P (y)
H
= P (y|X, β)P (X), (54)
Q
H Ps (y)
where = . The likelihood is then
Q P (y)
n
Y Hi
LEndogenous = P (yi |xi , β)P (xi ), (55)
i=1
Qi
µ ¶ µ ¶
Hi y 1−y
where = yi + (1 − yi ). Therefore, when dealing with REs and
Qi τ 1−τ
imbalanced data, it is the likelihood in (55) that needs to be maximized [3, 67, 10,
52, 33]. Several consistent estimators of this type of likelihood have been proposed
in the literature. Amemiya [3] and Ben Akiva and Lerman [4] provide an excellent
survey of these methods.
Logistic Regression in Data Analysis 13
Manski and Lerman [52] proposed the Weighted Exogenous Sampling
Maximum Likelihood (WESML), and proved that WESML yields a consistent
and asymptotically normal estimator so long as knowledge of the population
probability is available. More recently, Ramalho and Ramalho [60] extended the
work of Manski and Lerman [52] to cases where such knowledge may not be
available. Knowledge of population probability or proportions, however, can be
acquired from previous surveys or existing databases. The log-likelihood for LR
can then be rewritten as
n
X Qi
ln L(β|y, X) = ln P (yi |xi , β) (56)
i=1
Hi
n µ yi xi β ¶
X Qi e
= ln (57)
i=1
Hi 1 + exi β
n µ yi xi β ¶
X e
= wi ln , (58)
i=1
1 + exi β
Qi
where wi = . Thus, in order to obtain consistent estimators, the likelihood is
Hi
multiplied by the inverse of the fractions. The intuition behind weighting is that if
the proportion
µ ¶of events in the sample is more than that in the population, then
Q
the ratio < 1 and hence the events are given less weight, while the non-
H
events would be given more weight if their proportion in the sample is less than
that in the population. This estimator, however, is not fully efficient, because the
information matrix equality does not hold. This is demonstrated as
· ¸ "µ ¶µ ¶T #
Q 2 Q Q
−E ∇ ln P (y|X, β) 6= E ∇β ln P (y|X, β) ∇β ln P (y|X, β) ,(59)
H β H H
n µ ¶ n µ ¶2
1 X Qi 1 X Qi
Let A = pi (1 − pi )xi xj , and B = pi (1 − pi )xi xj , then
n i=1 Hi n i=1 Hi
the asymptotic variance matrix of the estimator β is given by the sandwich
estimate, such that V(β) = A−1 BA−1 [3, 67, 52].
Now that consistent estimators are obtained, finite-sample/rare-event bias
corrections could be applied. King and Zeng [39] extended the small-sample bias
corrections, as described by McCullagh and Nelder [53], to include the weighted
likelihood (58), and demonstrated that even with choice-based sampling, these
corrections can make a difference when the population probability of the event of
interest is low. According to McCullagh and Nelder [53], and later Cordeiro and
McCullagh [15], the bias vector is given by
β̃ = β̂ − bias(β̂). (62)
A recent comparative simulation study by Maiti and Pradhan [50] showed that
the bias correction of McCullagh and Nelder [53], provides the smallest Mean
Squared Error (MSE) when compared to that of Firth [23] and others using
LR. Cordeiro and Barroso [14] more recently derived a third-order bias corrected
estimator and showed that in some cases it could deliver improvements in terms of
bias and MSE over the usual ML estimator and that of Cordeiro and McCullagh
[15].
Now, as mentioned earlier, LR regularization is used in the form of the ridge
penalty λ2 ||β||2 . When regularization is introduced, none of the coefficients is set
to zero [59], and hence the problem of infinite parameter values is avoided. In
addition, the importance of the parameter λ lies in determining the bias-variance
trade-off of an estimator [17, 49]. When λ is very small, there is less bias but
more variance. On the other hand, larger values of λ would lead to more bias
Logistic Regression in Data Analysis 15
but less variance [5]. Therefore, the inclusion of regularization in the LR model
is very important to reduce any potential inefficiency. However, as regularization
carries the risk of a non-negligible bias, even asymptotically [5], the need for bias
correction becomes inevitable [48]. In sum, bias correction is needed to account for
any bias resulting from regularization, small samples, and rare events.
The challenge remains on finding the best class distribution in the training
dataset. First, when both the events and non-events are easy to collect and both
are available, then a sample with equal number of ones and zeros would be
generally optimum [16, 32]. Second, when the number of events in the population is
very small, the decision is then how many more non-events to collect in addition to
the events. If collecting more non-events is inexpensive, then the general judgment
is to collect as many non-events as possible. However, as the number of non-
events exceed the number of events, the marginal contribution to the explanatory
variables’ information content starts to drop, and hence the number of zeros should
be no more than two to five times the number of ones [39].
Applying the above corrections, offered by King and Zeng [39], along with the
recommended sampling strategies, such as collecting all of the available events
and only a matching proportion of non-events, could (1) significantly decrease the
sample size under study, (2) cut data collection costs, (3) increase the rare event
probability, and, (4) enable researchers to focus more on analyzing the variables.
6 Conclusions
Logistic regression provides a great means for modeling binary as well as multiple
class response variable dependence on one or more independent variables. Those
independent variables could be categorical or continuous, or both. The fit of the
resulting model can be assessed using a number of methods, the most important of
which is the IRLS method, which in turn is best solved using the CG method in the
form of the truncated Newton method. Furthermore, with regard to imbalanced
and rare events data sets, certain sampling strategies and appropriate corrections
should be applied to the LR method. The most common correction techniques are
prior correction and weighting.
In addition, LR is adaptable to handle other data mining challenges, such
as the problems of collinearity, missing data, redundant attributes and nonlinear
separability, among others, making LR a powerful and resilient data mining
method. It is our hope that this overview of the LR method and the developed
Logistic Regression in Data Analysis 17
state-of-the-art techniques in it, as provided by the literature, would shed further
light on this method as well as encourage and direct future theoretical and applied
research in it.
References