Sparse Estimation With Math and R 100 Exercises For Building Logic 1st Ed 2021 9811614458 9789811614453
Sparse Estimation With Math and R 100 Exercises For Building Logic 1st Ed 2021 9811614458 9789811614453
Sparse
Estimation
with Math
and R
100 Exercises for Building Logic
Sparse Estimation with Math and R
Joe Suzuki
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
I started considering the sparse estimation problems around 2017 when I moved
from the mathematics department to statistics in Osaka University, Japan. I have
been studying information theory and graphical models for over 30 years.
The first book I found is “Statistical Learning with Sparsity” by T. Hastie, R.
Tibshirani, and M. Wainwright. I thought it was a monograph rather than a textbook
and that it would be tough for a nonexpert to read it through. However, I downloaded
more than 50 papers that were cited in the book and read them all. In fact, the book
does not instruct anything but only suggests how to study sparsity. The contrast
between statistics and convex optimization is gradually attracting me as I understand
the material.
On the other hand, it seems that the core results on sparsity have come out around
2010–2015 for research. However, I still think further possibilities and expansions
are there. This book contains all the mathematical derivations and source programs,
so graduate students can construct any procedure from scratch by getting help from
this book.
Recently, I published books “Statistical Learning with Math and R” (SLMR)
and “Statistical Learning with Math and Python” (SLMP) and will publish “Sparse
Estimation with Math and Python” (SEMP). The common idea is behind the books
(XXMR/XXMP). They not only give knowledge on statistical learning and sparse
estimation but also help build logic in your brain by following each step of the
derivations and each line of the source programs. I often meet with data scientists
engaged in machine learning and statistical analyses for research collaborations and
introduce my students to them. I recently found out that almost all of them think
that (mathematical) logic rather than knowledge and experience is the most crucial
ability for grasping the essence in their jobs. Our necessary knowledge is changing
every day and can be obtained when needed. However, logic allows us to examine
whether each item on the Internet is correct and follow any changes; we might miss
even chances without it.
v
vi Preface
1. Developing logic
To grasp the essence of the subject, we mathematically formulate and solve each
ML problem and build those programs. The SEMR instills “logic” in the minds
of the readers. The reader will acquire both the knowledge and ideas of ML, so
that even if new technology emerges, they will be able to follow the changes
smoothly. After solving the 100 problems, most of the students would say “I
learned a lot”.
2. Not just a story
If programming codes are available, you can immediately take action. It is
unfortunate when an ML book does not offer the source codes. Even if a package
is available, if we cannot see the inner workings of the programs, all we can do
is input data into those programs. In SEMR, the program codes are available
for most of the procedures. In cases where the reader does not understand the
math, the codes will help them understand what it means.
3. Not just a how to book: an academic book written by a university professor.
This book explains how to use the package and provides examples of executions
for those who are not familiar with them. Still, because only the inputs and
outputs are visible, we can only see the procedure as a black box. In this sense,
the reader will have limited satisfaction because they will not be able to obtain
the essence of the subject. SEMR intends to show the reader the heart of ML
and is more of a full-fledged academic book.
4. Solve 100 exercises: problems are improved with feedback from university
students
The exercises in this book have been used in university lectures and have been
refined based on feedback from students. The best 100 problems were selected.
Each chapter (except the exercises) explains the solutions, and you can solve
all of the exercises by reading the book.
5. Self-contained
All of us have been discouraged by phrases such as “for the details, please refer
to the literature XX”. Unless you are an enthusiastic reader or researcher, nobody
will seek out those references. In this book, we have presented the material in
such a way that consulting external references is not required. Additionally, the
proofs are simple derivations, and the complicated proofs are given in the appen-
dices at the end of each chapter. SEMR completes all discussions, including the
appendices.
6. Readers’ pages: questions, discussion, and program files. The reader can ask
any question on the book via https://bayesnet.org/books.
Acknowledgments The author wishes to thank Tianle Yang, Ryosuke Shinmura, and Tomohiro
Kamei for checking the manuscript in Japanese. This English book is largely based on the Japanese
book published by Kyoritsu Shuppan Co., Ltd. in 2020. The author would like to thank Kyoritsu
Shuppan Co., Ltd., in particular, its editorial members Mr. Tetsuya Ishii and Ms. Saki Otani. The
author also appreciates Ms. Mio Sugino, Springer, for preparing the publication and providing
advice on the manuscript.
Contents
1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Subderivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 A Comparison Between Lasso and Ridge . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 About How to Set the Value of λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Exercises 1–20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Generalized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1 Generalization of Lasso in Linear Regression . . . . . . . . . . . . . . . . . . . 37
2.2 Logistic Regression for Binary Values . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Logistic Regression for Multiple Values . . . . . . . . . . . . . . . . . . . . . . . 46
2.4 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Appendix: Proof of Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Exercises 21–34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1 When One Group Exists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2 Proximal Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4 Sparse Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5 Overlap Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.6 Group Lasso with Multiple Responses . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.7 Group Lasso via Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.8 Group Lasso for the Generalized Additive Models . . . . . . . . . . . . . . . 96
Appendix: Proof of Proposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Exercises 35–46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
ix
x Contents
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Chapter 1
Linear Regression
In general statistics, we often assume that the number of samples N is greater than
the number of variables p. If this is not the case, it may not be possible to solve
for the best-fitting regression coefficients using the least squares method, or it is
too computationally costly to compare a total of 2 p models using some information
criterion.
When p is greater than N (also known as the sparse situation), even for linear
regression, it is more common to minimize, instead of the usual squared error, the
modified objective function to which a term is added to prevent the coefficients from
being too large (the so-called regularization term). If the regularization term is a
constant λ times the L1-norm (resp. L2-norm) of the coefficient vector, it is called
Lasso (resp. Ridge). In the case of Lasso, if the value of λ increases, there will be
more coefficients that go to 0, and when λ reaches a certain value, all the coefficients
will eventually become 0. In that sense, we can say that Lasso also plays a role in
model selection.
In this chapter, we examine the properties of Lasso in comparison to those of
Ridge. After that, we investigate the elastic net, a regularized regression method that
combines the advantages of both Ridge and Lasso. Finally, we consider how to select
an appropriate value of λ.
Throughout this chapter, let N ≥ 1 and p ≥ 1 be integers, and let the (i, j) element
of the matrix X ∈ R N × p and the k-th element of the vector y ∈ R N be denoted
by xi, j and yk , respectively. Using these X, y, we find the intercept β0 ∈ R and
the slope β = [β1 , . . . , β p ]T that minimize
y − β0 − X β2 . Here, the L2-norm of
N
z = [z 1 , . . . , z N ]T is denoted by z := 2
i=1 z i .
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 1
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_1
2 1 Linear Regression
First, for the sake of simplicity, we assume that the j-th column of X , ( j =
1, . . . , p),
and y have already been centered. That is, for each j = 1, . . . , p, define
N
x̄ j := N1 i=1 xi, j , and assume that x̄ j has already been subtracted from each xi, j so
N
that x̄ j = 0 is satisfied. Similarly, defining ȳ := N1 i=1 yi , we assume that ȳ was
subtracted in advance from each yi so that ȳ = 0 holds. Under this condition, one of
the parameters (β̂0 , β̂) for which we need to solve, say, β̂0 , is always 0. In particular,
∂
N p p
0= (yi − β0 − xi, j β j )2 = −2N ( ȳ − β0 − x̄ j β j ) = 2N β0
∂β0 i=1 j=1 j=1
holds. Thus, from now, without loss of generality, we may assume that the intercept
β0 is zero and use this in our further calculations.
We begin by first observing the following equality:
⎡ ⎤ ⎡ ⎤
∂
N p p
(yi − βk xi,k ) ⎥
2 ⎡ ⎤ y1 − βk x1,k ⎥
⎢ x1,1 · · · x N ,1 ⎢
⎢ ∂β1 i=1 ⎥ ⎢ ⎥
⎢ k=1 ⎥ ⎢ ⎥⎢ k=1 ⎥
⎢ .. ⎥ ⎢ ⎥⎢ .
. ⎥
⎢ ⎥ = −2 ⎢ . ⎥ ⎢ ⎥
⎢ . ⎥ ⎣ vdots . . . .. ⎦ ⎢ . ⎥
⎢ ⎥ ⎢ ⎥
⎣ ∂
N p p
2⎦ x · · · x ⎣
yN − βk x N ,k
⎦
(yi − βk xi,k ) 1, p N,p
∂β p i=1 k=1 k=1
N
p
−2 xi, j (yi − βk xi,k ).
i=1 k=1
Thus, when we set the right-hand side of (1.1) equal to 0, and if X T X is invertible,
then β becomes
β̂ = (X T X )−1 X T y. (1.2)
If we had not performed the data centering, we would still obtain the same slope
β̂, though the intercept β̂0 would be
p
β̂0 = ȳ − x̄ j β̂ j . (1.4)
j=1
1.1 Linear Regression 3
In this book, we focus more on the sparse case, i.e., when p is larger than N . In
this case, a problem arises. When N < p, the matrix X T X does not have an inverse.
In fact, since
rank(X T X ) ≤ rank(X ) ≤ min{N , p} = N < p,
X T X is singular. Moreover, when X has precisely the same two columns, rank(X ) <
p, and the inverse matrix does not exist.
On the other hand, if p is rather large, we have p independent variables to choose
from for the predictors of the variable when carrying out model selection. Thus, the
combinations are
{}, {1}, {2}, . . . , {1, . . . , p}
(that is, whether we choose each of the variables or not). This means we have to
compare a total of 2 p models. Then, to extract the proper model combination by using
an information criterion or cross-validation, the computational resources required
will grow exponentially with the number of variables p.
To deal with this kind of problem, let us consider the following. Let λ ≥ 0 be a
constant. We add a term to y − X β2 that penalizes β for being too large in size.
Specifically, we define
1
L := y − X β2 + λβ1 (1.5)
2N
or
1
L := y − X β2 + λβ22 . (1.6)
N
X ∈ R N × p and y ∈ R N into either (1.5) or (1.6), then minimize it with respect to the
slope β̂ ∈ R p (this is called Lasso or Ridge, respectively), and finally we use (1.4)
to compute β̂0 .
1.2 Subderivative
To address the minimization problem for Lasso, we need a method for optimizing
functions that are not differentiable. When we want to find the points x of the max-
ima or minima of a single-variable polynomial, say, f (x) = x 3 − 2x + 1, we can
differentiate it and find the solution to f (x) = 0. However, what should we do when
we encounter functions such as f (x) = x 2 + x + 2|x|, which contain an absolute
value? To address this, we need to extend our concept of differentiation to a more
general one.
Throughout the following claims, let us assume that f is convex [4, 6]. In general,
we say that f is convex (downward)1 if, for any 0 < α < 1 and x, y ∈ R,
holds. For instance, f (x) = |x| is convex (Fig. 1.1, left) because
is satisfied. To check this, since both sides are nonnegative, the RHS squared minus
the LHS squared gives 2α(1 − α)(|x y| − x y) ≥ 0. As another example, consider
1, x = 0
f (x) = (1.7)
0, x = 0.
Therefore, it is not convex (Fig. 1.1, right). If functions f, g are convex, then for any
β, γ ≥ 0, the function β f (x) + γg(x) has to be convex since the following holds:
1, x = 0
f (x) = |x| f (x) =
0, x = 0
1
f (x) = −1 f (x) = 1 Not Convex
0 x
Convex
Fig. 1.1 Left: f (x) = |x| is convex. However, at the origin, the derivatives from each side differ;
thus, it is not differentiable. Right: We cannot simply judge from its shape, but this function is not
convex
is a subderivative of f at x0 .
If f is differentiable at x0 , then the subderivative will be a set that contains only
one element, say, f (x0 ).2 We prove this as follows.
First, the convex function f is differentiable at x0 ; thus, it satisfies f (x) ≥
f (x0 ) + f (x0 )(x − x0 ). To see this, since f is convex,
2 In this case, we would not write it as a set { f (x0 )} but as an element f (x0 ).
6 1 Linear Regression
the right at x0 . Since f is differentiable at x0 , those two derivatives are equal; this
completes the proof.
The main interest of this book is specifically the case where f (x) = |x| and
x0 = 0. Hence, by (1.8), its subderivative is the set of z such that for any x ∈ R,
|x| ≥ zx. These values of z lie in the interval greater than or equal to −1 and less
than or equal to 1, and
is true. Let us confirm this. If for any x, |x| ≥ zx holds, then for x > 0 and x < 0,
z ≤ 1 and z ≥ −1, respectively, need to be true. Conversely, if −1 ≤ z ≤ 1, then
zx ≤ |zx| ≤ |x| is true for any arbitrary x ∈ R.
Example 1 By dividing into three cases x < 0, x = 0, and x > 0, find the values
x that attain the minimum of x 2 − 3x + |x| and x 2 + x + 2|x|. Note that for x = 0,
we can find their usual derivatives, but for f (x) = |x| at x = 0, its subderivative is
the interval [−1, 1].
x 2 − 3x + x, x ≥ 0 x 2 − 2x, x ≥ 0
x 2 − 3x + |x| = =
x 2 − 3x − x, x < 0 x 2 − 4x, x < 0
⎧
⎨ 2x − 2, x >0
(x 2 − 3x + |x|) = 2x − 3 + [−1, 1] = −3 + [−1, 1] = [−4, −2] 0, x = 0
⎩
2x − 4 < 0, x < 0.
x 2 + x + 2x, x ≥ 0 x 2 + 3x, x ≥ 0
x 2 + x + 2|x| = =
x 2 + x − 2x, x < 0 x 2 − x, x < 0
⎧
⎨ 2x + 3 > 0, x >0
(x 2 + x + 2|x|) = 2x + 1 + 2[−1, 1] = 1 + 2[−1, 1] = [−1, 3] 0, x = 0
⎩
2x − 1 < 0, x < 0.
Therefore, x 2 + x + 2|x| has a minimum at x = 0 (Fig. 1.2, right). We use the fol-
lowing code to draw the figures.
1 curve(x ^ 2 - 3 * x + abs(x), -2, 2, main = "y = x^2 - 3x + |x|")
2 points(1, -1, col = "red", pch = 16)
3 curve(x ^ 2 + x + 2 * abs(x), -2, 2, main = "y = x^2 + x + 2|x|")
4 points(0, 0, col = "red", pch = 16)
The subderivative of f (x) = |x| at x = 0 is the interval [−1, 1]. This fact sum-
marizes this chapter.
1.3 Lasso 7
y = x2 − 3x + |x| y = x2 + x + 2|x|
0 2 4 6 8 10 12
8 10
6
y
y
4
2
0
-2 -1 0 1 2 -2 -1 0 1 2
x x
Fig. 1.2 x 2 − 3x + |x| (left) has a minimum at x = 1, and x 2 + x + 2|x| (right) has a minimum at
x = 0. Neither is differentiable at x = 0. Despite not being differentiable, the point is a minimum
for the figure on the right
1.3 Lasso
1
L := y − X β2 + λβ1 (1.5)
2N
is called Lasso [28].
From the formularization of (1.5) and (1.6), we can tell that Lasso and Ridge are
the same in the sense that they both try to control the size of regression coefficients
β. However, Lasso also has the property of leaving significant coefficients as non-
negative, which is particularly beneficial in variable selection. Let us consider its
mechanism.
Note that in (1.5), the division of the first term by 2 is not essential: we would
obtain an equivalent formularization if we double the value of λ. For the sake of
simplicity, first let us assume that
1
N
1, j = k
xi, j xi,k = (1.9)
N i=1 0, j = k
1
N
holds, and let s j = xi, j yi . With this assumption, the calculations are made
N i=1
much simpler.
8 1 Linear Regression
-4 -2 0 2 4
λ=5
Sλ (x)
-10 -5 0 5 10
x
and hence becomes β j = Sλ (s j ). We plot the graph of Sλ (x) when λ = 5 in Fig. 1.3
using the code provided below:
1 soft.th = function(lambda, x) return(sign(x) * pmax(abs(x) - lambda, 0))
2 curve(soft.th(5, x), -10, 10, main = "soft.th(lambda, x)")
3 segments(-5, -4, -5, 4, lty = 5, col = "blue"); segments(5, -4, 5, 4, lty =
5, col = "blue")
4 text(-0.2, 1, "lambda = 5", cex = 1.5, col = "red")
1.3 Lasso 9
Next, let us consider the case where (1.9) is not satisfied. We rewrite (1.10) by
⎧
⎨ 1, βj > 0
1
N
0∈− xi, j (ri, j − xi, j β j ) + λ [−1, 1], β j = 0
N i=1 ⎩
−1, β j < 0.
1
N
Here, we denote yi − xi,k βk by ri, j , and let ri, j xi, j be s j . Next, fix βk
k= j
N i=1
(k = j), and update β j . We do this repeatedly from j = 1 to j = p until it converges
(coordinate descent). For example, we can implement the algorithm as follows:
1 linear.lasso = function(X, y, lambda = 0, beta = rep(0, ncol(X))) {
2 n = nrow(X); p = ncol(X)
3 res = centralize(X, y) ## centering (please refer to below code)
4 X = res$X; y = res$y
5 eps = 1; beta.old = beta
6 while (eps > 0.001) { ## wait for this loop to converge
7 for (j in 1:p) {
8 r = y - as.matrix(X[, -j]) %*% beta[-j]
9 beta[j] = soft.th(lambda, sum(r * X[, j]) / n) / (sum(X[, j] * X[, j])
/ n)
10 }
11 eps = max(abs(beta - beta.old)); beta.old = beta
12 }
13 beta = beta / res$X.sd ## restore each variable coefficient to that of
before standardized
14 beta.0 = res$y.bar - sum(res$X.bar * beta)
15 return(list(beta = beta, beta.0 = beta.0))
16 }
Thus, we may standardize the data first, then perform Lasso, and finally restore the
data. The aim of doing this is to examine our data all at once. Because the algorithm
sets all β̂ j less than or equal to λ to 0, we do not want it to be based on the indices
j = 1, . . . , p. Each j-th column of X is divided by scale[j], and consequently
the estimated β j will be larger to that extent. We then divide β j by scale[j] as
well.
20
Annual Police Funding in $/Resident
25 yrs.+ of High School
15 16 to 19 yrs. not in High School ...
18 to 24 yrs. in College
25 yrs. + in College
10
β
5
0
-5
-10
Fig. 1.4 Result of Example 2. In the case of Lasso, we see that as λ increases, the coefficients
decrease. At a certain λ, all the coefficients will be 0. The λ at which each coefficient becomes 0
varies
As we can see in Fig. 1.4, as λ increases, the absolute value of each coefficient
decreases. When λ reaches a certain value, all coefficients go to 0. In other words,
for each value of λ, the set of nonzero coefficients differs. The larger λ becomes, the
smaller the set of selected variables.
Example 3 We use the warm start method to reproduce the coefficient β for each λ
in Example 2.
1 crime = read.table("crime.txt"); X = crime[, 3:7]; y = crime[, 1]
2 coef.seq = warm.start(X, y, 200)
3 p = ncol(X); lambda.max = 200; dec = round(lambda.max / 50)
4 lambda.seq = seq(lambda.max, 1, -dec)
5 plot(log(lambda.seq), coef.seq[, 1], xlab = "log(lambda)", ylab = "
coefficients",
6 ylim = c(min(coef.seq), max(coef.seq)), type = "n")
7 for (j in 1:p) lines(log(lambda.seq), coef.seq[, j], col = j)
First, we make the value of λ large enough so that all β j ( j = 1, . . . , p) are 0 and
then gradually decrease the size of λ while performing coordinate N 2descent. Here, for
simplicity, we assume that for each j = 1, . . . , p, we have i=1 xi, j = 1; moreover,
N
the values of i=1 xi, j yi are all different. In this case, the smallest λ that makes
1 N
β j = 0 ( j = 1, . . . , p) can be calculated by λ = max xi, j yi . Particularly,
1≤ j≤ p N
i=1
for any λ larger than this formula, it will be satisfied that for all j = 1, . . . , p, β j = 0.
Then, we have that ri, j = yi (i = 1, . . . , N ) and
1
N
−λ ≤ − xi, j (ri, j − xi, j β j ) ≤ λ
N i=1
hold. Thus, when we decrease the size of λ, one of the values of j will satisfy
1 N
N
| i=1 x i, j (ri, j − x i, j β j )| = λ. Again, since βk = 0 (k = j), if we continue to
make it smaller, we still have that ri, j = yi (i = 1, . . . , N ), and thus the value of
N
| N1 i=1 xi, j yi | for that j becomes smaller than λ.
As a standard R package for Lasso, glmnet is often used [11].
Example 4 (Boston) Using the Boston dataset from the MASS package and setting
the variable of the 14th column as the target variable and the other 13 variables as
predictors, we plot a graph similar to that of the previous one (Fig. 1.5).
1.3 Lasso 13
5
coefficients that are not zero
0
Coefficient
-5
-10
-15
-5 -4 -3 -2 -1 0 1 2
log λ
1 library(glmnet); library(MASS)
2 df = Boston; x = as.matrix(df[, 1:13]); y = df[, 14]
3 fit = glmnet(x, y); plot(fit, xvar = "lambda", main = "BOSTON")
So far, we have seen how Lasso can be useful in the process of variable selection.
However, we have not explained the reason why we seek to minimize (1.5) for
β ∈ R p . Why do we not instead consider minimizing the following usual information
criterion
14 1 Linear Regression
1
y − X β2 + λβ0 (1.12)
2N
1.4 Ridge
β̂ = (X T X + N λI )−1 X T y.
Here, whenever λ > 0, even for the case N < p, we have that X T X + N λI is
nonsingular. In particular, since the matrix X T X is positive semidefinite, we have
that its eigenvalues μ1 , . . . , μ p are all nonnegative. Therefore, the eigenvalues of
X T X + N λI can be calculated by
det(X T X + N λI − t I ) = 0 =⇒ t = μ1 + N λ, . . . , μ p + N λ > 0,
Again, when all the eigenvalues are positive, their product, det(X T X + N λI ),
is also positive, which is the same as saying that X T X + N λI is nonsingular. Note
that this always holds regardless of the sizes of p, N . When N < p, the rank of
X T X ∈ R p× p is less than or equal to N , and hence the matrix is singular. Therefore,
for this case, the following conditions are equivalent:
λ > 0 ⇐⇒ X T X + N λI is nonsingular.
Example 5 We use the same U.S. crime data as that of Example 2 and perform the
following analysis. To control the size of the coefficient of each predictor, we call
the function ridge and then execute.
1 df = read.table("crime.txt")
2 x = df[, 3:7]; y = df[, 1]; p = ncol(x); lambda.seq = seq(0, 100, 0.1)
3 plot(lambda.seq, xlim = c(0, 100), ylim = c(-10, 20), xlab = "lambda", ylab =
"beta",
4 main = "each coefficient’s value for each lambda", type = "n", col = "
red")
5 for (j in 1:p) {
6 coef.seq = NULL
7 for (lambda in lambda.seq) coef.seq = c(coef.seq, ridge(X, y, lambda)$beta[
j])
8 par(new = TRUE); lines(lambda.seq, coef.seq, col = j)
9 }
10 legend("topright",
11 legend = c("annual police funding", "\% of people 25 years+ with 4 yrs
. of high school",
12 "\% of 16--19 year-olds not in highschool and not
highschool graduates",
13 "\% of people 25 years+ with at least 4 years of college"),
14 col = 1:p, lwd = 2, cex = .8)
In Fig. 1.6, we plot how each coefficient changes with the value of λ.
16 1 Linear Regression
20
Annual Police Funding in $/Resident
25 yrs.+ with 4 yrs. of High School
16 to 19 yrs. not in High School ...
15
18 to 24 yrs. in College
25 yrs.+ in College
10
β
5
0
-5
-10
0 20 40 60 80 100
λ
Fig. 1.6 The execution result of Example 5. The changes in the coefficient β with respect to λ
based on Ridge. As λ becomes larger, each coefficient decreases to 0
1 crime = read.table("crime.txt")
2 X = crime[, 3:7]
3 y = crime[, 1]
4 linear(X, y)
1 $beta
2 [1] 10.9806703 -6.0885294 5.4803042 0.3770443 5.5004712
3 $beta.0
4 [1] 489.6486
1 ridge(X, y)
1 $beta
2 [1] 10.9806703 -6.0885294 5.4803042 0.3770443 5.5004712
3 $beta.0
4 [1] 489.6486
1 ridge(X, y, 200)
1.5 A Comparison Between Lasso and Ridge 17
1 $beta
2 [1] 0.056351799 -0.019763974 0.077863094 -0.017121797 -0.007039304
3 $beta.0
4 [1] 716.4044
Next, let us compare Fig. 1.4 of Lasso to Fig. 1.6 of Ridge. We can see that they
are the same in the sense that when λ becomes larger, the absolute value of each
coefficient approaches 0. However, in the case of Lasso, when λ reaches a certain
value, one of the coefficients becomes exactly zero, and the time at which that occurs
varies for each variable. Thus, Lasso can be used for variable selection.
So far, we have shown this fact analytically, but it is also good to have an intuition
geometrically. Figures similar to those in Fig. 1.7 are widely used when one wants
to compare Lasso and Ridge.
Let p = 2 such that X ∈ R N × p is composed of two columns xi,1 , xi,2 (i =
1, . . . , N ). In the least squares process, we solve for the β1 , β2 values that minimize
N
S := (yi − β1 xi,1 − β2 xi,2 )2 . For now, let us denote them by β̂1 , β̂2 , respectively.
i=1
Here, if we let ŷi = xi,1 β̂1 + xi,2 β̂2 , we have
1
1
4
3
3
2
E
E
2
1 11
1
1
0
0
-1
-1
-1 0 1 2 3 -1 0 1 2 3 4
E E
Fig. 1.7 Each ellipse is centered at (β̂1 , β̂2 ), representing the contour line connecting all the points
that give the same value of (1.13). The rhombus in the left figure is the L1 regularization constraint
|β1 | + |β2 | ≤ C , while the circle in the right figure is the L2 regularization constraint β12 + β22 ≤ C
18 1 Linear Regression
N
N
xi,1 (yi − ŷi ) = xi,2 (yi − ŷi ) = 0.
i=1 i=1
N
holds, we can rewrite the quantity to be minimized, (yi − β1 xi,1 − β2 xi,2 )2 , as
i=1
N
N
N
N
(β1 − β̂1 )2 2
xi,1 + 2(β1 − β̂1 )(β2 − β̂2 ) xi,1 xi,2 + (β2 − β̂2 )2 2
xi,2 + (yi − ŷi )2
i=1 i=1 i=1 i=1
(1.13)
and, of course, if we let (β1 , β2 ) = (β̂1 , β̂2 ) here, we obtain the minimum (= RSS).
Thus, we can view the problems of Lasso and Ridge in the following way: the
minimization of quantities (1.5), (1.6) is equivalent to finding the values of (β1 , β2 )
that satisfy the constraints β12 + β22 ≤ C and |β1 | + |β2 | ≤ C , respectively, that also
minimize the quantity of (1.13) (here, the case where λ is large is equivalent to the
case where C, C are small).
The case of Lasso is the same as in the left panel of Fig. 1.7. The ellipses are
centered at (β̂1 , β̂2 ) and represent contours on which the values of (1.13) are the
same. We expand the size of the ellipse (the contour), and once we make contact
with the rhombus, the corresponding values of (β1 , β2 ) are the solution to Lasso.
If the rhombus is small (λ is large), it is more likely to touch only one of the four
rhombus vertices. In this case, one of the β1 , β2 values will become 0. However, in
the Ridge case, as in the right panel of Fig. 1.7, a circle replaces the Lasso rhombus;
hence, it is less likely that β1 = 0, β2 = 0 will occur.
In this case, if the least squares solution (β̂1 , β̂2 ) lies in the green zone of Fig. 1.8,
then we have either β1 = 0 or β2 = 0 as our solution. Moreover, when the rhombus is
small (λ is large), even when (β̂1 , β̂2 ) remains the same, the green zone will become
larger, which is the reason why Lasso performs well in variable selection.
We should not overlook one of the advantages of Ridge: its performance when
dealing with the case of collinearity. That is, it can handle well even the case where
the matrix of explanatory variables contains columns that are highly related. Let us
define by
1
V I F :=
1 − R 2X j |X − j
the VIF (variance inflation factor). The larger this value is, the better the j-th column
variable is explained by the other variables. Here, R 2X j |X − j denotes the coefficient of
determination squared in which X j is the target variable, and the other variables are
predictors.
1.5 A Comparison Between Lasso and Ridge 19
6
represents the area in which 1
the optimal solution would
1
satisfy either β1 = 0, or
4
β2 = 0 if the center of the
ellipses (β̂1 , β̂2 ) lies within it
2
0
-2
-2 0 2 4 6
Example 6 We compute the VIF for the Boston dataset. It shows that the ninth and
tenth variables (RAD and TAX) have strong collinearity.
1 R2 = function(x, y) {
2 y.hat = lm(y ~ x)$fitted.values; y.bar = mean(y)
3 RSS = sum((y - y.hat) ^ 2); TSS = sum((y - y.bar) ^ 2)
4 return(1 - RSS / TSS)
5 }
6 vif = function(x) {
7 p = ncol(x); values = array(dim = p)
8 for (j in 1:p) values[j] = 1 / (1 - R2(x[, -j], x[, j]))
9 return(values)
10 }
11 library(MASS); x = as.matrix(Boston); vif(x)
In usual linear regression, if the VIF is large, the estimated coefficient β̂ will
be unstable. Particularly, if two columns are precisely the same, the coefficient is
unsolvable. Moreover, for Lasso, if two columns are highly related, then generally
one of them will be estimated as 0 and the other as nonzero. However, in the Ridge
case, for λ > 0, even when the columns j, k of X are the same, the estimation is
solvable, and both of them will obtain the same value.
In particular, we find the partial derivative of
1
N p p
L= (yi − xi, j β j )2 + λ β 2j
N i=1 j=1 j=1
1 1
N p N p
βk = xi,k (yi − xi, j β j ) = xi,l (yi − xi, j β j ) = βl .
λN i=1 j=1
λN i=1 j=1
Example 7 We perform Lasso for the case where the variables X 1 , X 2 , X 3 and
X 4 , X 5 , X 6 have strong correlations. We generate N = 500 groups of data distributed
in the following way:
z 1 , z 2 , , 1 , . . . , 6 ∼ N (0, 1)
x j := z 1 + j /5, j = 1, 2, 3
x j := z 2 + j /5, j = 4, 5, 6
y := 3z 1 − 1.5z 2 + 2.
Up until now, we have discussed the pros and cons of Lasso and Ridge. This section
studies a method intended to combine the advantages of the two, i.e., the elastic net.
Specifically, the method considers the problem of finding β that minimizes
1.6 Elastic Net 21
0 2 5 5 6
1.5
X1
X2
X3
X4
1.0
X5
Coefficient X6
0.5
0.0
-0.5
0 1 2 3 4
L1 Norm
Fig. 1.9 The execution result of Example 7. The Lasso case is different from that of Ridge: related
variables are not given similar estimated coefficients. When pwe use the glmnet package, its
default horizontal axis is the L1-norm [11]. This is the value j=1 |β j |, which becomes smaller as
λ increases. Therefore, the figure here is left-/right-reversed compared to that where λ or log λ is
set as the horizontal axis.
1 1−α
L := y − X β2 + λ
2
β2 + αβ1 .
2
(1.14)
2N 2
N
Sλ ( N1 i=1 ri, j x i, j )
The β j that minimizes (1.14) for the case of Lasso (α=1) is β̂ j = 1 N 2
,
i=1 x i, j
1 N
N
i=1 ri, j x i, j
while for the case of Ridge (α = 0), it is β̂ j = N
1 N
. In general cases, it
i=1 x i, j + λ
2
N
is N
Sλα ( 1 i=1 ri, j x i, j )
β̂ j = 1 N N . (1.15)
i=1 x i, j + λ(1 − α)
2
N
This is called the elastic net. For each j = 1, . . . , p, if we find the subderivative of
(1.14) with respect to β j , we obtain
22 1 Linear Regression
⎧
p ⎨ 1, βj > 0
1
N
0∈− xi, j (yi − xi,k βk ) + λ(1 − α)β j + λα [−1, 1], β j = 0
N ⎩
i=1 k=1 −1, βj < 0
⎧ ⎫ ⎧
⎨1 ⎬ ⎨ 1, βj > 0
1
N N
⇐⇒ 0 ∈ − xi, j ri, j + xi,2 j + λ(1 − α) β j + λα [−1, 1], β j = 0
N ⎩N ⎭ ⎩
i=1 i=1 −1, βj < 0
⎧ ⎫ ⎛ ⎞
⎨1 N ⎬ 1
N
⇐⇒ xi,2 j + λ(1 − α) β = Sλα ⎝ xi, j ri, j ⎠ .
⎩N ⎭ N
i=1 i=1
1 1 2
N N
Here, let s j := xi, j ri, j , t j := x + λ(1 − α), and μ := λα. We have
N i=1 N i=1 i, j
used the fact that
⎧
⎨ 1, βj > 0
0 ∈ −s j + t j β j + μ [−1, 1], β j = 0 ⇐⇒ t j β j = Sμ (s j ) ,
⎩
−1, βj < 0
α = 0.75 α = 0.5
1.0 0 3 5 5 5 0 3 6 6 6
1.0
0.5
0.5
Coefficient
Coefficient
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 1 2 3 4 0 1 2 3 4
L1 Norm L1 Norm
α = 0.25 α=0
0 6 6 6 6 6 6 6 6 6
1.5
1.0
1.0
0.5
Coefficient
Coefficient
0.5
0.0
0.0
-0.5
-0.5
-1.0
0 1 2 3 4 0 1 2 3 4
L1 Norm L1 Norm
Fig. 1.10 The execution result of Example 8. The closer α is to 0 (the closer the model is to Ridge),
the more it is able to handle collinearity, which is in contrast to the case of α = 1 (Lasso), where
the related variable coefficients are not estimated equally
To perform Lasso, the CRAN package glmnet is often used. Until now, we have
tried to explain the theory behind calculations from scratch; however, it is also a good
idea to use the precompiled function from said package.
To set an appropriate value for λ, the method of cross-validation (CV) is often
used.3
For example, the tenfold CV for each λ divides the data into ten groups, with
nine of them used to estimate β and one used as the test data, and then evaluates the
model. Switching the test group, we can perform this evaluation ten times in total
and then calculate the mean of these evaluation values. Then, we choose the λ that
has the highest mean evaluation value. If we plug the sample data of the target and
the explanatory variables into the function cv.glmnet, it evaluates each value of
λ and returns the λ of the highest evaluation value as an output.
Example 9 We apply the function cv.glmnet to the U.S. crime dataset from
Examples 2 and 5, obtain the optimal λ, use it for the usual Lasso, and then obtain
the coefficients of β. For each λ, the function also provides the value of the least
squares of the test data and the confidence interval (Fig. 1.11). Each number above
the figure represents the number of nonzero coefficients for that λ.
1 library(glmnet)
2 df = read.table("crime.txt"); X = as.matrix(df[, 3:7]); y = as.vector(df[,
1])
3 cv.fit = cv.glmnet(X, y); plot(cv.fit)
4 lambda.min = cv.fit$lambda.min; lambda.min
1 [1] 20.03869
For the elastic net, we have to perform double loop cross-validation for both
(α, λ). The function cv.glmnet provides us the output cvm, which contains the
evaluation values from the cross-validation.
3Please refer to Chap. 3 of “100 Problems in Mathematics of Statistical Machine Learning with R”
of the same series.
1.7 About How to Set the Value of λ 25
55555555443333332111
0 1 2 3 4 5
log λ
Fig. 1.11 The execution result of Example 9. Using cv.glmnet, we obtain the evaluation value
(the sum of the squared residuals from the test data) for each λ. This is represented by the red
dots in the figure. Moreover, the extended lines above and below are confidence intervals for the
actual values. The optimal value here is approximately log λmin = 3 (λmin = 20). Each number
5, . . . , 5, 4, 4, 3, . . . , 3, 2, 1, 1, 1 at the top represents the number of nonzero coefficients
Example 10 We generate random numbers as our data and try conducting the double
loop cross-validation for (α, λ).
1 n = 500; x = array(dim = c(n, 6)); z = array(dim = c(n, 2))
2 for (i in 1:2) z[, i] = rnorm(n)
3 y = 3 * z[, 1] - 1.5 * z[, 2] + 2 * rnorm(n)
4 for (j in 1:3) x[, j] = z[, 1] + rnorm(n) / 5
5 for (j in 4:6) x[, j] = z[, 2] + rnorm(n) / 5
6 best.score = Inf
7 for (alpha in seq(0, 1, 0.01)) {
8 res = cv.glmnet(x, y, alpha = alpha)
9 lambda = res$lambda.min; min.cvm = min(res$cvm)
10 if (min.cvm < best.score) {alpha.min = alpha; lambda.min = lambda; best.
score = min.cvm}
11 }
12 alpha.min
1 [1] 0.47
1 lambda.min
1 [1] 0.05042894
1.0
Here, we use the
cross-validated optimal α as
the parameter
0.5
Coefficient
0.0
-1.5 -1.0 -0.5
0 1 2 3 4 5
L1 Norm
We plot the graph of the coefficient values for when α is optimal (in the cross-
validation sense) in Fig. 1.12.
Exercises 1–20
⎢ (yi − βk xi,k ) ⎥
2
⎡ ⎤ ⎢ y1 − βk x1,k ⎥
⎢ ∂β1 i=1 ⎥ x1,1 · · · x N ,1 ⎢ ⎥
⎢ k=1 ⎥ ⎢ k=1 ⎥
⎢ .. ⎥ ⎢ .. . . .. ⎥ ⎢ .. ⎥
⎢ . ⎥ = −2 ⎣ . . . ⎦ ⎢ . ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ x1, p · · · x N , p ⎣⎢ ⎥
⎣ ∂
N p p
⎦ ⎦
(yi − βk xi,k )2 yN − βk x N ,k
∂β p i=1 k=1 k=1
1, x = 0
(b) Show that the function f (x) = is not convex.
0, x = 0
(c) Decide whether the following functions of β ∈ R p are convex or not. In addi-
tion, provide the proofs.
1
i. y − X β2 + λβ0
2N
1
ii. y − X β2 + λβ1
2N
1
iii. y − X β2 + λβ22
N
Here, · 0 denotes the number of nonzero elements, · 1 denotes the sum of
the absolute values of all elements, and · 2 denotes the square root of the sum
of the square of each element. Moreover, let λ ≥ 0.
4. For a convex function f : R → R, assume that it is differentiable at x = x0 .
Answer the following questions.
However, this case still leaves two possibilities: either { f (x0 )} or an empty set.
5. Find the local minimum of the following functions. In addition, write the graphs
of the range −2 ≤ x ≤ 2 using the R language.
1
N p p
L := (yi − β j xi, j )2 + λ |β j |.
2N i=1 j=1 j=1
N
and let s j = N1 i=1 xi, j yi . For λ > 0, holding βk (k = j) constant, find the sub-
derivative of L with respect to β j . In addition, find the value of β j that attains the
minimum.
Hint Consider each of the following cases: β j > 0, β j = 0, β j < 0. For exam-
ple, when β j = 0, we have −β j + s j − λ[−1, 1] 0, but this is equivalent to
−λ ≤ s j ≤ λ.
7. (a) Let λ ≥ 0; define a function Sλ (x) as follows:
⎧
⎨ x − λ, x > λ
Sλ (x) := 0, |x| ≤ λ
⎩
x + λ, x < −λ.
For this function Sλ (x), by using the function (x)+ = max{x, 0} and
⎧
⎨ −1, x < 0
sign(x) = 0, x = 0
⎩
1, x > 0
p
β0 := ȳ − β j x̄ j .
j=1
1 1
n n
Here, we denote x̄ j = xi, j and ȳ j = yi . Complete the blanks in the
N i=1 N i=1
following and execute the program. In addition, by letting λ = 10, 50, 100, check
how the coefficients that once were 0 change accordingly.
To execute the following codes, please access the site https://web.stanford.edu/
~hastie/StatLearnSparsity/data.html and download the U.S. crime data.4 Then,
store them in a file named crime.txt.5
1 linear.lasso = function(X, y, lambda = 0) {
2 X = as.matrix(X); n = nrow(X); p = ncol(X)
3 res = centralize(X, y); X = res$X; y = res$y
4 eps = 1; beta = rep(0, p); beta.old = rep(0, p)
5 while (eps > 0.001) {
6 for (j in 1:p) {
7 r = ## blank (1) ##
8 beta[j] = soft.th(lambda, sum(r * X[, j]) / n)
9 }
10 eps = max(abs(beta - beta.old)); beta.old = beta
11 }
12 beta = beta / res$X.sd
13 beta.0 = ## blank (2) ##
14 return(list(beta = beta, beta.0 = beta.0))
15 }
16 crime = read.table("crime.txt"); X = crime[, 3:7]; y = crime[, 1]
17 linear.lasso(X, y, 10); linear.lasso(X, y, 50); linear.lasso(X, y, 100)
1 library(glmnet)
2 df = read.table("crime.txt"); x = as.matrix(df[, 3:7]); y = df[, 1]
3 fit = glmnet(x, y); plot(fit, xvar = "lambda")
4 Open the window, press Ctrl+A to select all, press Ctrl+C to copy, make a new file of your text
editor, and then press Ctrl+V to paste the data.
5 To execute the R program, you may need to set the working directory or provide the path to where
Following the above, conduct an analysis similar to that of the Boston dataset6
by setting the 14th column as the target variable and the other 13 as explanatory and
then write a graph in the same manner.
10. The coordinate descent starts from a large λ so that all coefficients are initially
0 and then gradually decreases the size of λ. In addition, for each iteration,
consider using the coefficients estimated with the last λ as the initial values for
the next λ iteration (warm start). We consider the following program and apply
it to U.S. crime data. Fill in the blank.
1 warm.start = function(X, y, lambda.max = 100) {
2 dec = round(lambda.max / 50); lambda.seq = seq(lambda.max, 1, -dec)
3 r = length(lambda.seq); p = ncol(X); coef.seq = matrix(nrow = r, ncol =
p)
4 coef.seq[1, ] = linear.lasso(X, y, lambda.seq[1])$beta
5 for (k in 2:r) coef.seq[k, ] = ## blank ##
6 return(coef.seq)
7 }
8 crime = read.table("crime.txt"); X = crime[, 3:7]; y = crime[, 1]
9 coef.seq = warm.start(X, y, 200)
10 p = ncol(X)
11 lambda.max = 200
12 dec = round(lambda.max / 50)
13 lambda.seq = seq(lambda.max, 1, -dec)
14 plot(log(lambda.seq), coef.seq[, 1], xlab = "log(lambda)", ylab = "
Coefficient",
15 ylim = c(min(coef.seq), max(coef.seq)), type = "n")
16 for (j in 1:p) lines(log(lambda.seq), coef.seq[, j], col = j)
1 λ 2
N p
1 λ
y − X β22 + β22 := (yi − β1 xi,1 − · · · − β p xi, p )2 + β j (cf. (1.8))
2N 2 2N 2
i=1 j=1
(Ridge regression).
13. Adding to the function linear of Exercise 1 the additional parameter λ ≥ 0,
write an R program ridge that performs the process of Exercise 1.7. Execute
that program, and check whether your results match the following examples.
1 crime = read.table("crime.txt")
2 X = crime[, 3:7]
3 y = crime[, 1]
4 linear(X, y)
1 $beta
2 [1] 10.9806703 -6.0885294 5.4803042 0.3770443 5.5004712
3 $beta.0
4 [1] 489.6486
1 $beta
2 [1] 10.9806703 -6.0885294 5.4803042 0.3770443 5.5004712
3 $beta.0
4 [1] 489.6486
Exercises 1–20 33
1 ridge(X, y, 200)
1 $beta
2 [1] 0.056351799 -0.019763974 0.077863094 -0.017121797 -0.007039304
3 $beta.0
4 [1] 716.4044
14. The code below performs an analysis of the U.S. crime data in the following
order: while changing the value of λ, it solves for coefficients using the function
ridge and plots a graph of how each coefficient changes. In the program, change
lambda.seq to log(lambda.seq), and change the horizontal axis label
from lambda to log(lambda). Then, plot the graph (and change the title (the
main part)).
1 df = read.table("crime.txt"); x = df[, 3:7]; y = df[, 1]; p = ncol(x)
2 lambda.max = 3000; lambda.seq = seq(1, lambda.max)
3 plot(lambda.seq, xlim = c(0, lambda.max), ylim = c(-12, 12),
4 xlab = "lambda", ylab = "coefficients", main = "changes in
coefficients according to lambda",
5 type = "n", col = "red") ## this 1 line
6 for (j in 1:p) {
7 coef.seq = NULL
8 for (lambda in lambda.seq) coef.seq = c(coef.seq, ridge(x, y, lambda)$
beta[j])
9 par(new = TRUE)
10 lines(lambda.seq, coef.seq, col = j) ## this 1 line
11 }
12 legend("topright",
13 legend = c("annual police funding", "\% of people 25 years+ with 4
yrs. of high school",
14 "\% of 16--19 year-olds not in highschool and not
highschool graduates",
15 "\% of people 25 years+ with at least 4 years of
college"),
16 col = 1:p, lwd = 2, cex = .8)
N
15. For given xi,1 , xi,2 , yi ∈ R (i = 1, . . . , N ), the β1 , β2 that minimize S := i=1
(yi − β1 xi,1 − β2 xi,2 )2 is denoted by β̂1 , β̂2 , and β̂1 xi,1 + β̂2 xi,2 is denoted by
ŷi , (i = 1, . . . , N ).
(a) Prove that the following two equalities hold:
N
N
i. xi,1 (yi − ŷi ) = xi,2 (yi − ŷi ) = 0.
i=1 i=1
ii. For any β1 , β2 ,
N
In addition, for any β1 , β2 , show that the quantity (yi − β1 xi,1 − β2 xi,2 )2
i=1
can be rewritten as
N
N
N
(β1 − β̂1 )2 2
xi,1 + 2(β1 − β̂1 )(β2 − β̂2 ) xi,1 xi,2 + (β2 − β̂2 )2 2
xi,2
i=1 i=1 i=1
N
+ (yi − ŷi )2 .(cf. (1.13))
i=1
N N 2 N
(b) Assume that i=1 xi,1 2
= i=1 xi,2 = 1, i=1 x i,1 x i,2 = 0. In the case of
usual least squares, we choose β1 = β̂1 , β2 = β̂2 . However, when there
is a constraint requiring |β1 | + |β2 | to be less than or equal to a value,
we have to think of the problem as choosing a point (β1 , β2 ) from a cir-
cle (of as small a radius as possible) centered at (β̂1 , β̂2 ), which has to
lie within the square of the constraint. Fix the square whose vertices are
(1, 0), (0, 1), (−1, 0), (0, −1), and fix a point (β̂1 , β̂2 ) outside the square.
Then, consider the circle centered at (β̂1 , β̂2 ). We expand the circle (increase
its radius) until it touches the square. Now, draw the range of (β̂1 , β̂2 ) such
that when the circle touches the square, one of the coordinates of the point
of contact will be 0.
(c) In (b), what would happen if the square is replaced by a circle of radius 1
(circle touches circle)?
16. A matrix X is composed of p explanatory variables with N samples each. If
it has two columns that are equal for all N samples, let λ > 0, perform Ridge
regression, and show that the estimated coefficients of the two vectors will be
equal.
Hint Note that
1
N p p
L= (yi − β0 − xi, j β j )2 + λ β 2j .
N i=1 j=1 j=1
1
N
Partially differentiating it with respect to βk , βl gives − xi,k (yi − β0 −
N i=1
p
1
N p
xi, j β j ) + λβk and − xi,l (yi − β0 − xi, j β j ) + λβl . Both of them
j=1
N i=1 j=1
have to be 0. Plug xi,k = xi,l into each of them.
17. We perform linear regression analysis with Lasso for the case where Y is the tar-
get variable and the variables X 1 , X 2 , X 3 , and X 4 , X 5 , X 6 are highly correlated.
We generate N = 500 groups of data, distributed as in the relations below, and
then apply linear regression analysis with Lasso to X ∈ R N × p , y ∈ R N .
Exercises 1–20 35
z 1 , z 2 , , 1 , . . . , 6 ∼ N (0, 1)
x j := z 1 + j /5, j = 1, 2, 3
x j := z 2 + j /5, j = 4, 5, 6
y := 3z 1 − 1.5z 2 + 2.
Fill in the blanks (1), (2) below. Plot a graph showing how each coefficient
changes with λ.
1 n = 500; x = array(dim = c(n, 6)); z = array(dim = c(n, 2))
2 for (i in 1:2) z[, i] = rnorm(n)
3 y = ## blank (1) ##
4 for (j in 1:3) x[, j] = z[, 1] + rnorm(n) / 5
5 for (j in 4:6) x[, j] = z[, 2] + rnorm(n) / 5
6 glm.fit = glmnet(## blank (2) ##); plot(glm.fit)
18. Instead of the usual Lasso or Ridge, we find the β0 , β values that minimize
1 1−α
y − β0 − X β22 + λ β22 + αβ1 (cf. (1.14)) (1.17)
2N 2
(we mean-center the X, y data, initially let β0 = 0, and restore its value later).
N
Sλ ( N1 i=1 ri, j xi, j )
The β j that minimizes (1.14) is given by β̂ j = 1 N 2
in the case
i=1 x i, j
1 N
N
ri, j xi, j
of Lasso (α = 1), by β̂ j = 1 Ni=1 2
N
in the case of Ridge (α = 0), or, in
N i=1 x i, j + λ
general cases, by
N
Sλα ( 1 i=1 ri, j x i, j )
β̂ j = 1 N N (cf. (1.15)) (1.18)
i=1 x i, j + λ(1 − α)
2
N
(the elastic net cases). By finding the subderivative of (1.17) with respect to β j
and sequentially proving the three equations below, show that (1.18) holds. Here,
x j denotes the j-th column of X .
⎧
⎨ 1, βj > 0
1
i. 0 ∈ − X T (y − X β) + λ(1 − α)β j + λα [−1, 1], β j = 0
N ⎩
−1, ⎧ β j < 0;
1
N
1 2
N ⎨ 1, βj > 0
ii. 0 ∈ − xi, j ri, j + xi, j + λ(1 − α) β j + λα [−1, 1], β j = 0
N N ⎩
i=1 i=1 −1, β j < 0;
1 2 1
N N
iii. x + λ(1 − α) β = Sλα xi, j ri, j .
N i=1 i, j N i=1
In addition, revise the function linear.lasso in Exercise 8, let its default
parameter be alpha = 1, and generalize it by changing the formula to (1.18).
36 1 Linear Regression
N
Moreover, correct the value of 1 − α by dividing it by i=1 (yi − ȳ) /N (this
2
1
y − X β2 + λβ1
2N
1
L 0 := (y − β0 − X β)T W (y − β0 − X β) .
2N
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 37
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_2
38 2 Generalized Linear Regression
⎧ N N
⎪
⎪ j=1 wi, j x j,k
⎪ i=1
⎪ X̄ k := N N , k = 1, . . . , p
⎨
i=1 j=1 wi, j
N N
⎪
⎪ j=1 wi, j y j
⎪
⎪ i=1
⎩ ȳ := N N
i=1 j=1 wi, j
xi,k ← xi,k − X̄ k , i = 1, . . . , N , k = 1, . . . , p
yi ← yi − ȳ, i = 1, . . . , N
N N
for i = 1, . . . , N , considering the weights W = (wi, j ), we have wi, j y j =
N N i=1 j=1
0 and i=1 j=1 wi, j x i,k = 0, which means that the solution
p
β̂0 = ȳ − X̄ k β̂k (2.1)
k=1
∂ L0 1
= − W (y − β0 − X β)
∂β0 N
⎡ N N N p ⎤ ⎡ ⎤
j=1 w1, j y j − β0 j=1 w1, j − j=1 w1, j k=1 x j,k βk 0
1 ⎢ .. ⎥ ⎢.⎥
=− ⎢ ⎥=⎣.⎦ .
N ⎣ N
.
N p
⎦ .
N
w y
j=1 N , j j − β0 w
j=1 N , j − w
j=1 N , j x
k=1 j,k k β 0
1
L(β) := (y − X β)T W (y − X β) + λβ1 .
2N
1
L(β) = u − V β22 + λβ1 .
2N
Let Y be a random variable that takes values in {0, 1}. We assume that for each
x ∈ R p (row vector), there exist β0 ∈ R and β = [β1 , . . . , β p ] ∈ R p such that the
probability P(Y = 1 | x) satisfies
P(Y = 1 | x)
log = β0 + xβ (2.2)
P(Y = 0 | x)
exp(β0 + xβ)
P(Y = 1 | x) = . (2.3)
1 + exp(β0 + xβ)
1.0
0
when β > 0 changes 0.2
0.5
1
0.8
2
10
P (Y = 1|x)
0.6
0.4
0.2
0.0
-10 -5 0 5 10
x
11 }
12 legend("topleft", legend = beta.seq, col = 1:length(beta.seq), lwd =
2, cex = .8)
13 par(new = FALSE)
1
P(Y = y | x) = . (2.4)
1 + exp{−y(β0 + xβ)}
1
N
L := log(1 + exp{−yi (β0 + xi β)}) . (2.5)
N i=1
∂L
Then, the vector ∇ L ∈ R p+1 whose j-th element is , j = 0, 1, . . . , p) can be
∂β j
1
N
1 ∂L vi
expressed by ∇ L = − X T u. In fact, we observe =− xi, j yi (j =
N ∂β j N i=1 1 + vi
0, 1, . . . , p). If we let
⎡v1 ⎤
··· 0
⎢ (1 + v1 )2 ⎥
⎢ .. .. .. ⎥
W =⎢
⎢ . . .
⎥
⎥ symmetric matrix
⎣ vN ⎦
0 ···
(1 + v N )2
∂2 L
then the matrix ∇ 2 L such that the ( j, k)-th element is , j, k = 0, 1, . . . , p can
∂β j βk
1 T
be expressed by ∇ 2 L = X W X . In fact, we have
N
1 1
N N
∂2 L ∂vi 1 vi
=− xi, j yi =− xi, j yi (−yi xi,k )
∂β j ∂βk N i=1 ∂βk 1 + vi N i=1 (1 + vi )2
1
N
vi
= xi, j xi,k ( j, k = 0, 1, . . . , p) ,
N i=1 (1 + vi )2
where yi2 = 1.
Next, to obtain the coefficients of logistic regression that maximize the likelihood,
we give the initial value of (β0 , β) and repeatedly apply the updates of the Newton
method
β0 β0
← − {∇ 2 L(β0 , β)}−1 ∇ L(β0 , β) (2.6)
β β
∇L(β0 , β) = 0.
to obtain the solution
β0
If we let z = X + W −1 u, then we have
β
42 2 Generalized Linear Regression
β0 β0
− {∇ 2 L(β0 , β)}−1 ∇ L(β0 , β) = + (X T W X )−1 X T u
β β
β0
= (X T W X )−1 X T W (X + W −1 u) = (X T W X )−1 X T W z , (2.7)
β
Example 12 We execute the maximum likelihood solution following the above pro-
cedure.
1 ## Data Generation
2 N = 1000; p = 2; X = matrix(rnorm(N * p), ncol = p); X = cbind(rep(1,
N), X)
3 beta = rnorm(p + 1); y = array(N); s = as.vector(X %*% beta); prob = 1
/ (1 + exp(s))
4 for (i in 1:N) {if (runif(1) > prob[i]) y[i] = 1 else y[i] = -1}
5 beta
Repeating the cycle of Newton method, the tentative estimates approach to the
true value.
1 [1] -0.8248500 -0.3305656 -0.6027963
2 [1] -0.4401921 0.2054357 0.6739074
3 [1] -0.68982062 0.06228453 0.37459366
4 [1] -0.68894021 0.07094424 0.40108031
2.2 Logistic Regression for Binary Values 43
Although the maximum likelihood method can estimate the values of β0 , β, when
p is large relative to N , the absolute values of the estimates β̂0 , β̂ may go to infinity.
Worse, the larger the value of p, the more likely the procedure diverges. For example,
suppose that the rank of X is N and that N < p. Then, for any α1 , . . . , α N > 0,
there exists (β0 , β) ∈ R p+1 such that yi (β0 + xi β) = αi > 0 (i = 1, . . . , N ). If we
multiply β0 , β by two, the likelihood strictly increases. In general, the larger the
value of p is, the more often this phenomenon occurs.
This section considers applying Lasso to obtain a reasonable solution, although
it does not maximize the likelihood. The main idea is to regard the almost zero
coefficients as zeros to choose relevant variables when p is large. To this end, we
add a regularization term to (2.5):
1
N
L := log(1 + exp{−yi (β0 + xi β)}) + λβ1 (2.8)
N i=1
and extend the Newton method as follows. We note that (2.7) is the β that minimizes
1
(z − X β)T W (z − X β) . (2.9)
2N
In fact, we observe
T
1 β0 β0 β0
∇ z−X W z−X = X WX
T
− XT Wz .
2 β β β
1
(z − X β)T W (z − X β) + λβ1 , (2.10)
2N
√
where the M in Sect. 2.1 is a diagonal matrix with diagonal elements wi , i =
1, . . . , N . In other words, we need only to repeatedly execute the steps
1. obtain W, z from β0 , β and
2. obtain β0 , β that minimize (2.10) from W, z
(proxy Newton method). Using the function W.linear.lasso in Sect. 2.1, we
need to update only ## in the program of Example 12 as follows, where we assume
that the leftmost column of X consists of N ones and X contains p + 1 columns.
1 logistic.lasso = function(X, y, lambda) {
2 p = ncol(X)
3 beta = Inf; gamma = rnorm(p)
4 while (sum((beta - gamma) ^ 2) > 0.01) {
5 beta = gamma
6 s = as.vector(X %*% beta)
7 v = as.vector(exp(-s * y))
44 2 Generalized Linear Regression
8 u = y * v / (1 + v)
9 w = v / (1 + v) ^ 2
10 z = s + u / w
11 W = diag(w)
12 gamma = W.linear.lasso(X[, 2:p], z, W, lambda = lambda)
13 print(gamma)
14 }
15 return(gamma)
16 }
1 logistic.lasso(X, y, 0.1)
1 logistic.lasso(X, y, 0.2)
1 z
2 y -1 1
3 -1 70 3
4 1 7 20
Note that y and z are the correct answer and the predictive value, respectively. In
this example, we predict the correct answer with a precision of 90 %.
On the other hand, the function glmnet used for linear regression can be used
for logistic regression [11].
We evaluate the CV values for each λ and connect the plots (Fig. 2.2). The option
cv.glmnet (default) evaluates the CV based on the binary deviation
−2 log(1 + exp{−(β̂0 + xi β̂)}) − 2 log(1 + exp{β̂0 + xi β̂})
i:yi =1 i:yi =−1
[11], where β̂ is the estimate from the training data, (x1 , y1 ), . . . , (xm , ym ) are the test
data, and we evaluate the CV by switching the training and test data several times. If
we specify the option type.measure = "class", we evaluate the CV based
on the error probability.
Using glmnet, we obtain the set of relevant genes with nonzero coefficients for
the λ that gives the best performance. It seems that the appropriate log λ is between
−4 and −3. Setting λ = 0.03, we execute the following procedure, in which we
have used beta = drop(glm$beta) rather than the matrix expression beta
= glm$beta because the former is easier to realize beta[beta != 0].
1 glm = glmnet(x, y, lambda = 0.03, family = "binomial")
2 beta = drop(glm$beta); beta[beta != 0]
46 2 Generalized Linear Regression
75 77 69 54 41 33 35 29 10 3 1 76 77 71 60 46 35 32 32 25 9 3 1
0.25
0.5 0.6 0.7 0.8 0.9 1.0 1.1
Misclassification Error
Binomial Deviance
0.20
0.15
0.10
-6 -5 -4 -3 -2 -6 -5 -4 -3 -2
log λ log λ
Fig. 2.2 The execution result of Example 15. We evaluated the dataset Breast Cancer for each λ; the
left and right figures show the CV evaluations based on the binary deviation and error probability.
glmnet shows the confidence interval by the upper and lower bounds in the graph [11]. We observe
that log λ is the best between −4 and −3. Because we evaluate the CV based on the samples, we
show the range of the best λ
When the covariates take K ≥ 2 values rather than two, the probability associated
with the logistic curves is generalized as
(k)
eβ0,k +xβ
P(Y = k | x) = K (k = 1, . . . , K ) . (2.11)
β0,l +xβ (l)
l=1 e
1
N K
exp{β0,h + xi β (h) }
L := − I (yi = h) log K .
(l)
l=1 exp{β0,l + x i β }
N i=1 h=1
1
N
∂L
=− xi, j {I (yi = k) − πk,i } ,
∂β j,k N i=1
where
2.3 Logistic Regression for Multiple Values 47
exp(β0,k + xi β (k) )
πk,i := K .
(l)
l=1 exp(β0,l + x i β )
1
N
∂2 L
= xi, j xi, j wi,k,k ,
∂β j,k β j ,k N i=1
Proposition 4 L is convex.
p
K
p
K
∂2 L
γ j,k γj k
j=1 k=1 j =1 k =1
∂β j,k ∂β j ,k
p
K p
K
1
N
= γ j,k xi, j wi,k,k xi, j γ j k
j=1 k=1 j =1 k =1
N i=1
1
N K K p p
= ( xi, j γ j,k )wi,k,k ( xi, j γ j ,k ) ≥ 0 ,
N i=1 k=1 k =1 j=1 j =1
and L is convex.
From a similar discussion as in the previous section, we have the procedure that
minimizes
48 2 Generalized Linear Regression
1
N K K p
exp{β0,h + xi β (h) }
L:=− I (yi = h) log K +λ β j,k(2.13)
1 .
(l)
l=1 exp{β0,l + x i β }
N i=1 h=1 k=1 j=1
K
K
p
λ βk 1 = λ |β j,k | .
k=1 k=1 j=1
However, the value of β0,k is not unique. In the glmnet package, to maintain unique-
K
ness, the values of β0,k , k = 1, . . . , K , are determined such that k=1 β0,k = 0 [11].
When we obtain β0,1 , . . . , β0,K , we subtract its arithmetic mean from them.
Example 17 (Fisher’s Iris) For Example 16, we obtained the optimum λ via the
two CV procedures using the R package glmnet [11], which we show in Fig. 2.3.
1 library(glmnet)
2 df = iris
3 x = as.matrix(df[, 1:4]); y = as.vector(df[, 5])
4 n = length(y); u = array(dim = n)
5 for (i in 1:n) if (y[i] == "setosa") u[i] = 1 else
6 if (y[i] == "versicolor") u[i] = 2 else u[i] = 3
7 u = as.numeric(u)
8 cv = cv.glmnet(x, u, family = "multinomial")
9 cv2 = cv.glmnet(x, u, family = "multinomial", type.measure = "class")
10 par(mfrow = c(1, 2)); plot(cv); plot(cv2); par(mfrow = c(1, 1))
11 lambda = cv$lambda.min; result = glmnet(x, y, lambda = lambda, family
= "multinomial")
12 beta = result$beta; beta.0 = result$a0
13 v = rep(0, n)
14 for (i in 1:n) {
15 max.value = -Inf
16 for (j in 1:3) {
17 value = beta.0[j] + sum(beta[[j]] * x[i, ])
18 if (value > max.value) {v[i] = j; max.value = value}
19 }
20 }
21 table(u, v)
50 2 Generalized Linear Regression
3 3 3 3 3 3 3 2 2 2 1 1 0 3 3 3 3 3 3 3 2 2 2 1 1 0
0.6
2.0
0.5
Misclassification Error
Multinomial Deviance
1.5
0.4
0.3
1.0
0.2
0.5
0.1
0.0
0.0
-10 -8 -6 -4 -2 -10 -8 -6 -4 -2
log λ log λ
Fig. 2.3 We draw the scores obtained via glmnet for each λ, in which the left and right scores
are the multiple deviance and error probability, respectively. We may choose the λ that minimizes
the score. However, if the sample size is small, some blurring occurs. Thus, the upper and lower
bounds that specify the confidence interval are displayed. The figures 0, 1, 2, 3 in the upper locations
indicate how many variables are chosen as nonzero for the λ chosen.
1 z
2 y -1 1
3 -1 70 3
4 1 7 20
Suppose that there exists μ > 0 such that the distribution of a random variable Y that
ranges over the nonnegative integers is
μk −μ
P(Y = k) = e (k = 0, 1, 2, . . .) (2.14)
k!
2.4 Poisson Regression 51
(Poisson distribution). One can check that μ coincides with the mean value of Y .
Hereafter, we assume that μ depends on x ∈ R p and can be expressed by
N y
μi i
e−μi (2.15)
i=1
yi !
with μi := μ(xi ) = eβ0 +xi β . We consider applying Lasso when we estimate the
parameters β0 , β: minimize the negated log-likelihood
1
N
L(β0 , β) := − {yi (β0 + xi β) − eβ0 +xi β }
N i=1
1
Similar to logistic regression, if we write ∇ L = − X T u and ∇ 2 L = X T W X ,
N
we have ⎡ ⎤
y1 − eβ0 +x1 β
⎢ .. ⎥
u=⎣ . ⎦
y N − eβ0 +x N β
and ⎡ ⎤
eβ0 +x1 β · · · 0
⎢ .. .. .. ⎥
W =⎣ . . . ⎦ .
0 · · · eβ0 +x N β
13 return(gamma)
14 }
Example 19 For the dataset birthwt in the R package MASS for the birth rate, we
executed Poisson regression. The meanings of the variables are listed in Table 2.1.
Because the first variable (whether the birth weight is over 2.5 kg) overlaps with the
last variable (the birth weight), we deleted it. We regarded the number of physician
visits during the first trimester as the response and executed regression based on
the other eight covariates (# samples: N = 189), where we chose as the λ value the
optimum value based on the CV. As a result, we found that mother’s age (age),
mother’s weight (lwt), and mother’s hypertension (ht) are the main factors.
1 library(glmnet)
2 library(MASS)
3 data(birthwt)
4 df = birthwt[, -1]
5 dy = df[, 8]
6 dx = data.matrix(df[, -8])
7 cvfit = cv.glmnet(x = dx, y = dy, family = "poisson", standardize =
TRUE)
8 coef(cvfit, s = "lambda.min")
In this section, we consider survival analysis. The problem setting is similar to the
ones we considered thus far: finding the relation between covariates and the response
from N tuples of the response and p covariate values. For survival analysis, we not
only assume that the response Y (survival time) takes positive values y but also allow
the response to take the form Y ≥ y. If an individual dies, the exact survival time is
obtained. However, if the survey is terminated before the time of death (the individual
changes hospitals, etc.), we regard the survival time to have exceeded the time. To
distinguish between the two cases, for the latter, we add the symbol +. The reason
we consider the latter case is to utilize the sample: even if the survey is terminated
before the time of death, some part of the data can be used to estimate the survival
analysis.
We estimate the covariate coefficients based on past data and predict the survival
time for a future individual. If we apply Lasso to the problem, we can determine
which covariates explain survival.
Example 20 From the kidney dataset, we wish to estimate the kidney survival
time. The meanings of the covariates are listed in Table 2.2. The symbol status
= 0 expresses survey termination before death and means that the response takes
54 2 Generalized Linear Regression
a larger value than time. On the other hand, status = 1 expresses death and
means that the covariate takes the same value of time. The rightmost four are the
covariates.
1 library(survival)
2 data(kidney)
3 kidney
1 y = kidney$time
2 delta = kidney$status
3 Surv(y, delta)
As marked in the last set of data, due to the survey termination, if the survival time
is longer, the symbol + follows the time.
1.0
curve for each of the kidney
diseases
0.8
Others
Survival Rate
GN
0.6
AN
PKD
0.4
0.2
0.0
0 100 200 300 400 500
Time
or, equivalently, by
S (t)
h(t) = −
S(t)
(hazard function). We express the function h by the product of h 0 (t) that does not
depend on the covariates x ∈ R p (row vector) and exp(xβ):
Note that the variable time in Example 21 takes survival times before death
and survey termination. If we remove the times before survey termination, then the
number of samples decreases for the estimation. However, if we follow the estimation
procedure below, we can utilize the data before survey termination.
If there is no survey termination, then we estimate S(t)
1
N
Ŝ(t) := δ(ti > t) (2.17)
N i=1
from N death times t1 ≤ t2 ≤ · · · ≤ t N −1 ≤ t N , where δ(ti > t) is one for ti > t and
zero otherwise. If t1 < · · · < t N , the function is stepwise and decreases by 1/N at
each t = ti .
Next, we consider the general case in which a survey termination can occur. Let
di ≥ 1 (i = 1, . . . , k) be the number of deaths at time [ti , ti+1 ), and assume that
k
t1 < · · · < tk . Then, the number of total deaths is D := i=1 di ≤ N , and N − D
survey terminations occur if we assume that the total number of samples is N . If we
denote by m j the number of survey terminations in the interval [t j , t j+1 ), then the
number n j of surviving individuals immediately before time t j can be expressed by
k
n j := (di + m i )
i= j
.
We define the Kaplan-Meier estimate as
⎧
⎪ 1, t < t1
⎨
Ŝ(t) = n i − di
l
(2.18)
⎪
⎩ , t ≥ t1
i=1
ni
e xi β
xjβ
(2.19)
i:δi =1 j∈Ri e
to estimate the parameter β, where xi ∈ R p and yi are covariates and time, respec-
tively, and δi = 1 and δi = 0 express death and survey termination, respectively, for
i = 1, . . . , N .
We define the risk set Ri to be the set of j such that y j ≥ yi and formulate the
Lasso as follows [25]:
1 e xi β
− log xjβ
+ λβ1 . (2.20)
N j∈Ri e
i:δi =1
e xi β
L := − log xjβ
,
i:δi =1 j∈Ri e
and let δi = 1, j ∈ Ri ⇐⇒ i ∈ C j .
∂2 L N N e xi β
= xi,k x h,l x r β )2
{I (i = h) e xs β − I (h ∈ R j )e xh β } .
∂βk ∂βl i=1 h=1 j∈C
( r ∈R j
e s∈R
i j
In particular, L is convex.
e xi β
u i := δi − xh β
j∈Ci h∈R j e
e xi β
wi,h := β
{I (i = h) e xs β − I (h ∈ R j )e xh β } ,
j∈Ci
( r ∈R j e x r ) 2
s∈R j
58 2 Generalized Linear Regression
e xi β
πi, j := xh β
,
h∈R j e
Moreover, we can write u i = δi − j∈Ci πi, j .
Based on the above discussion, we construct the function cox.lasso below. To
avoid a complicated computation, we approximate the off-diagonal elements of W
to be zero. Because the objective function is convex, there will be no problem with
convergence.
1 cox.lasso = function(X, y, delta, lambda = lambda) {
2 delta[1] = 1
3 n = length(y)
4 w = array(dim = n); u = array(dim = n)
5 pi = array(dim = c(n, n))
6 beta = rnorm(p); gamma = rep(0, p)
7 while (sum((beta - gamma) ^ 2) > 10 ^ {-4}) {
8 beta = gamma
9 s = as.vector(X %*% beta)
10 v = exp(s)
11 for (i in 1:n) {for (j in 1:n) pi[i, j] = v[i] / sum(v[j:n])}
12 for (i in 1:n) {
13 u[i] = delta[i]
14 w[i] = 0
15 for (j in 1:i) if (delta[j] == 1) {
16 u[i] = u[i] - pi[i, j]
17 w[i] = w[i] + pi[i, j] * (1 - pi[i, j])
18 }
19 }
20 z = s + u / w; W = diag(w)
21 print(gamma)
22 gamma = W.linear.lasso(X, z, W, lambda = lambda)[-1]
23 }
24 return(gamma)
25 }
Example 22 We apply the dataset kidney to the function cox.lasso. The pro-
cedure converges for all values of λ, and the estimates coincide with the ones com-
puted via glmnet[11].
1 df = kidney
2 index = order(df$time)
3 df = df[index, ]
2.5 Survival Analysis 59
4 n = nrow(df); p = 4
5 y = as.numeric(df[[2]])
6 delta = as.numeric(df[[3]])
7 X = as.numeric(df[[4]])
8 for (j in 5:7) X = cbind(X, as.numeric(df[[j]]))
9 z = Surv(y, delta)
10 cox.lasso(X, y, delta, 0)
1 [1] 0 0 0 0
2 [1] 0.0101287 -1.7747758 -0.3887608 1.3532378
3 [1] 0.01462571 -1.69299527 -0.41598742 1.38980788
4 [1] 0.01591941 -1.66769665 -0.42331475 1.40330234
5 [1] 0.01628935 -1.66060178 -0.42528537 1.40862969
1 [1] 0 0 0 0
2 [1] 0.0000000 -0.5366227 0.0000000 0.7234343
3 [1] 0.0000000 -0.5142360 0.0000000 0.6890634
4 [1] 0.0000000 -0.5099687 0.0000000 0.6800883
The variables x, y, and delta store the expression data of p = 7,399 genes, the
times, and the death (= 1)/survey termination (= 0) for N = 240 samples. We exe-
cuted the following codes to obtain the λ that minimizes the CV. We found that only
28 coefficients of genes out of 7,399 are nonzero and concluded that the other genes
do not affect the survival time.
Moreover, for the β̂ in which only 28 elements are nonzero, we computed z = X β̂.
We decided that those variables are likely to determine the survival times and drew
the Kaplan-Meier curves for the samples with z i > 0 (1) and z i < 0 (-1) to compare
the survival times (Fig. 2.5).
1 library(ranger); library(ggplot2); library(dplyr); library(ggfortify)
2 cv.fit = cv.glmnet(x, Surv(y, delta), family = "cox")
3 fit2 = glmnet(x, Surv(y, delta), lambda = cv.fit$lambda.min, family =
"cox")
4 z = sign(drop(x %*% fit2$beta))
5 fit3 = survfit(Surv(y, delta) ~ z)
6 autoplot(fit3)
7 mean(y[z == 1])
8 mean(y[z == -1])
Appendix: Proof of Propositions 61
1
N
∂L
=− xi, j {I (yi = k) − πk,i } ,
∂β j,k N i=1
where
exp(β0,k + xi β (k) )
πk,i := K .
(l)
l=1 exp(β0,l + x i β )
Proof Differentiating
1
N K
exp{β0,h + xi βh }
L:=− I (yi = h) log K
l=1 exp{β0,l + x i βl }
N i=1 h=1
1
N K
=− I (yi = h)
N i=1 h=1
K
(β0,h + xi βh ) − log exp(β0,l + xi βl )
l=1
by β j,k ( j = 1, . . . , p, k = 1, . . . , K ), we obtain
1
N K
∂L exp{β0,k + xi βk }
=− xi, j I (yi = h) I (h = k) − K
∂β j,k N i=1 h=1 l=1 exp{β0,l + x i βl }
1
N
exp{β0,k + xi βk }
=− xi, j I (yi = k) − K
l=1 exp{β0,l + x i βl }
N i=1
1
N
=− xi, j {I (yi = k) − πk,i } .
N i=1
1
N
∂2 L
= xi, j xi, j wi,k,k
∂β j,k β j ,k N i=1
1
N
= xi, j xi, j πi,k (1 − πi,k ) .
N
i=1
Proposition
3 (Gershgorin) Let A = (ai, j ) ∈ Rn×n be a symmetric matrix. If ai,i ≥
j =i |ai, j | for all i = 1, . . . , n, then A is nonnegative definite.
Proof For any eigenvalue λ of square matrix A = (a j,k ) ∈ Rn×n , there exists an
eigenvector x = [x1 , . . . , xn ]T and 1 ≤ i ≤ n such that xi = 1 and |x j | ≤ 1, j = i.
Thus, from
n
ai,i + ai, j x j = ai, j x j = λxi = λ ,
j =i j=1
we have
|λ − ai,i | = | ai, j x j | ≤ |ai, j |
j =i j =i
and
ai,i − |ai, j | ≤ λ ≤ ai,i + |ai, j | ,
j =i j =i
at least for one 1 ≤ i ≤ n, which means that if ai,i ≥ j =i |ai, j | for i = 1, . . . , n,
then all the eigenvalues are nonnegative, and the matrix A is nonnegative definite.
Appendix: Proof of Propositions 63
( j − i + 1)[−1, 1] − (i − 1) + (n − j) = [n − 2 j, n − 2i + 2]
= [2m − 2 j + 1, 2m − 2i + 3] ⊇ [−1, 1]
∂2 L
N
N e xi β
= xi,k x h,l x r β )2
∂βk ∂βl i=1 h=1 j∈Ci
( r ∈R j e
{I (i = h) e xs β − I (h ∈ R j )e xh β } .
s∈R j
In particular, L is convex.
Proof Since for Si = h∈Ri e xh β , we have
x j,k e x j β N
x j,k e x j β N e xi β
= = xi,k ,
Si Si Sj
i:δi =1
j∈R i j=1 i∈C j i=1 j∈C i
N 1
= xi,k {x e xi β
2 i,l
e xs β − e xi β x h,l e xh β }
i=1 j∈Ci
( r ∈R j e xr β ) s∈R j h∈R j
N
N e xi β
= xi,k x h,l {I (i = h) e xs β − I (h ∈ R j )e xh β } .
i=1 h=1 j∈Ci
( r ∈R j e x r β )2 s∈R j
Hence, we have
∂2 L N N
= xi,k x h,l wi,h ,
∂βk ∂βl i=1 h=1
where
⎧
⎪ e xi β
⎪
⎪ { e xs β − I (i ∈ R j )e xh β }, h = i
⎪
⎨ ( e x r β ) 2
j∈Ci r ∈R j s∈R j
wi,h = xi β .
⎪
⎪ e xh β
⎪
⎪ − {I (h ∈ R j )e }, h =i
⎩ ( r ∈R e xr β )2
j∈Ci j
Moreover, from i ∈ R j , such W = (wi, j ) satisfies wi,i = h =i |wi,h | for each
i = 1, . . . , N . From Proposition 3, W is nonnegative definite. Thus, for arbitrary
[z 1 , . . . , z p ] ∈ R p , we have
p
p
N
N
z k zl xi,k x h,l wi,h ≥ 0 ,
k=1 l=1 i=1 h=1
Exercises 21–34
21. Let Y be a random variable that takes values in {0, 1}. We assume that there exist
β0 ∈ R and β ∈ R p such that the conditional probability P(Y = 1 | x) given for
each x ∈ R p can be expressed by
P(Y = 1 | x)
log = β0 + xβ (cf. (2.2)) (2.21)
P(Y = 0 | x)
(logistic regression).
Exercises 21–34 65
exp(β0 + xβ)
P(Y = 1 | x) = (cf. (2.3)). (2.22)
1 + exp(β0 + xβ)
(b) Let p = 1 and β0 = 0. The following procedure draws the curves by com-
puting the right-hand side of (2.22) for various β ∈ R. Fill in the blanks and
execute the procedure for β = 0. How do the shapes evolve as β grows?
1 f = function(x) return(exp(beta.0 + beta * x) / (1 + exp(beta
.0 + beta * x)))
2 beta.0 = 0; beta.seq = c(0, 0.2, 0.5, 1, 2, 10)
3 m = length(beta.seq)
4 beta = beta.seq[1]
5 plot(f, xlim = c(-10, 10), ylim = c(0, 1), xlab = "x", ylab =
"y",
6 col = 1, main = "Logistic Curve")
7 for (i in 2:m) {
8 beta = beta.seq[i]
9 par(new = TRUE)
10 plot(f, xlim = c(-10, 10), ylim = c(0, 1), xlab = "", ylab =
"", axes = FALSE, col = i)
11 }
12 legend("topleft", legend = beta.seq, col = 1:length(beta.seq),
lwd = 2, cex = .8)
13 par(new = FALSE)
(c) For logistic regression (2.21), when the realizations x ∈ R p and Y ∈ {0, 1}
N
e yi {β0 +xi β}
are (xi , yi ) (i = 1, . . . , N ), the likelihood is given by . In its
i=1
1 + eβ0 +xi β
Lasso evaluation
1
N
− [yi (β0 + xi β) − log(1 + eβ0 +xi β )] + λβ1 , (2.23)
N i=1
1
N
log(1 + exp{−yi (β0 + xi β)}) + λβ1 (cf. (2.8)) . (2.24)
N i=1
Hereafter, we denote by X ∈ R N ×( p+1) the matrix such that the (i, j)-th element is
xi, j for j = 1, . . . , p and the leftmost column (the 0th column) is a vector consisting
of N ones, and let xi = [xi,1 , . . . , xi, p ]. We assume that the random variable Y takes
values in {−1, 1}.
66 2 Generalized Linear Regression
N
22. For L(β0 , β) := log{1 + exp(−yi (β0 + xi β))}, show the following equa-
i=1
tions, where vi := exp{−yi (β0 + xi β)}.
∂L
(a) The matrix ∇ L such that the j-th element is ( j = 0, 1, . . . , p) can be
∂β j
expressed by ∇ L = −X T u, where
⎡ yv ⎤
1 1
⎢ 1 + v1 ⎥
⎢ .. ⎥
u=⎢
⎢ .
⎥ .
⎥
⎣ yN vN ⎦
1 + vN
∂2 L
(b) The matrix ∇ 2 L such that the ( j, k)-th element is ( j, k = 0, 1, . . . , p)
∂β j βk
can be expressed by ∇ 2 L = X T W X , where
⎡ v1 ⎤
··· 0
⎢ (1 + v1 )2 ⎥
⎢ .. .. .. ⎥
W =⎢
⎢ . . .
⎥
⎥ diagonal matrix .
⎣ vN ⎦
0 ···
(1 + v N )2
23. Show that the (β0 , β) that maximize the likelihood diverges when λ = 0.
(a) N < p, and the rank of X is N .
Hint In Exercise 2.5, if we multiply −X T u = 0 by −X from left to right,
then it becomes X X T u = 0. Since X X T is invertible, unless u = 0, X̃ X̃ T u =
0 does not reach the stationary solution. Moreover, u cannot be zero for finite
(β0 , β).
(b) There exist (β0 , β) such that yi (β0 + xi β) > 0 for all i = 1, . . . , N .
Hint If such (β0 , β) exist, (2β0 , 2β) will make L smaller.
24. The following procedure analyzes the dataset breastcancer, which consists
of 1,000 covariate variables (the expression data of 1,000 genes) and one binary
response (case/control) for 256 samples (58 cases and 192 control). Identify the
relevant genes that affect breast cancer.
1 df = read.csv("breastcancer.csv")
2 x = as.matrix(df[, 1:1000])
3 y = as.vector(df[, 1001])
4 cv = cv.glmnet(x, y, family = "binomial")
5 cv2 = cv.glmnet(x, y, family = "binomial", type.measure = "class")
6 par(mfrow = c(1, 2))
7 plot(cv)
8 plot(cv2)
9 par(mfrow = c(1, 1))
The function cv.glmnet evaluates the test data based on the binary deviance
1
N
− log P(Y = yi | X = xi )
N i=1
and error rate if we do not specify anything and if the option is type.measure
= "class", respectively.
Fill in the optimum λ based on the CV, make the following codes, and find the
genes with nonzero coefficients.
1 glm = glmnet(x, y, lambda = ## Blank ##, family = "binomial")
by β0,k − γ0 + x(β (k) − γ), but the second term changes. K One can check that
for γ = (γ1 , . . . , γ p ), the γ j value that minimizes k=1 |β j,k − γ j | is the
median of β j,1 , . . . , β j,K . How can we obtain the minimum Lasso evaluation
from the original β j,1 , . . . , β j,K ( j = 1, . . . , p)?
(c) In the glmnet package, to maintain the uniqueness of the β0,k value, it
K
is set such that k=1 β0,k = 0. After obtaining the original β0,1 , . . . , β0,K ,
what computation should be performed to obtain the unique β0,1 , . . . , β0,K ?
26. Execute the following procedure for the Iris dataset with n = 150 and p = 4.
Output the two graphs via cv.glmnet, and find the optimum λ in the sense of
CV. Moreover, find the β0 , β w.r.t. λ, and execute it using the 150 covariates.
1 library(glmnet)
2 df = read.table("iris.txt", sep = ",")
3 x = as.matrix(df[, 1:4])
4 y = as.vector(df[, 5])
5 y = as.numeric(y == "Iris-setosa")
6 cv = cv.glmnet(x, y, family = "binomial")
7 cv2 = cv.glmnet(x, y, family = "binomial", type.measure = "class")
8 par(mfrow = c(1, 2))
9 plot(cv)
10 plot(cv2)
11 par(mfrow = c(1, 1))
12 lambda = cv$lambda.min
13 result = glmnet(x, y, lambda = lambda, family = "binomial")
14 beta = result$beta
15 beta.0 = result$a0
16 f = function(x) return(exp(beta.0 + x %*% beta))
17 z = array(dim = 150)
18 for (i in 1:150) z[i] = drop(f(x[i, ]))
19 yy = (z > 1)
20 sum(yy == y)
We evaluate the correct rate, not for K = 2 but for K = 3. Fill in the blanks and
execute it.
1 library(glmnet)
2 df = read.table("iris.txt", sep = ",")
3 x = as.matrix(df[, 1:4]); y = as.vector(df[, 5]); n = length(y); u
= array(dim = n)
4 for (i in 1:n) if (y[i] == "Iris-setosa") u[i] = 1 else
5 if (y[i] == "Iris-versicolor") u[i] = 2 else u[i] = 3
6 u = as.numeric(u)
7 cv = cv.glmnet(x, u, family = "multinomial")
8 cv2 = cv.glmnet(x, u, family = "multinomial", type.measure = "
class")
9 par(mfrow = c(1, 2)); plot(cv); plot(cv2); par(mfrow = c(1, 1))
10 lambda = cv$lambda.min; result = glmnet(x, y, lambda = lambda,
family = "multinomial")
11 beta = result$beta; beta.0 = result$a0
12 v = array(dim = n)
13 for (i in 1:n) {
14 max.value = -Inf
15 for (j in 1:3) {
Exercises 21–34 69
16 value = ## Blank ##
17 if (value > max.value) {v[i] = j; max.value = value}
18 }
19 }
20 sum(u == v)
Hint Each beta and beta.0 is a list of length 3, and the former is a vector that
stores the coefficient values.
27. If the response takes K ≥ 2 values rather than two, the probability associated
with the logistic curve is generalized as follows:
(k)
eβ0,k +xβ
P(Y = k | x) = K (k = 1, . . . , K ). (cf. (2.11))
l=1 eβ0,l +xβ (l)
1
N K
exp{β0,h + xi β (h) }
L := − I (yi = h) log K
(l)
l=1 exp{β0,l + x i β }
N i=1 h=1
μk −μ
P(Y = k) = e (k = 0, 1, 2, . . .) (cf. (2.14))
k!
can be expressed by
N y
μi i
e−μi (cf. (2.15)) (2.26)
i=1
yi !
1
L(β0 , β) + λβ1 (cf. (2.16)) (2.27)
N
70 2 Generalized Linear Regression
N
with L(β0 , β) := − i=1 {yi (β0 + xi β) − eβ0 +xi β } if we apply Lasso.
(a) How can we derive (2.27) from (2.26)?
(b) If we write ∇ L = −X T u, show that
⎡ ⎤
y1 − eβ0 +x1 β
⎢ .. ⎥
u=⎣ . ⎦ .
β0 +x N β
yN − e
We execute Poisson regression for λ ≥ 0. Fill in the blank and execute the pro-
cedure.
1 ## Data Generation
2 N = 1000
3 p = 7
4 beta = rnorm(p + 1)
5 X = matrix(rnorm(N * p), ncol = p)
6 X = cbind(rep(1, N), X)
7 s = X %*% beta
8 y = rpois(N, lambda = exp(s))
9 beta
10 ## Conputation of the ML estimates
11 lambda = 100
12 beta = Inf
13 gamma = rnorm(p + 1)
14 while (sum((beta - gamma) ^ 2) > 0.01) {
15 beta = gamma
16 s = as.vector(X %*% beta)
17 w = ## Blank (1) ##
18 u = ## Blank (2) ##
19 z = ## Blank (3) ##
20 W = diag(w)
21 gamma = # Blank (4) #
22 print(gamma)
23 }
In the following, we consider survival analysis, particularly for the Cox model.
Using the random variables T, C ≥ 0 that express death and survey termination, we
define Y := min{T, C}. For t ≥ 0, let S(t) be the probability of the event T > t:
S (t)
h(t) := −
S(t)
Exercises 21–34 71
or equivalently,
P(t < Y < t + δ | Y ≥ t)
h(t) := lim .
δ→0 δ
A Cox model can be expressed by the product of h 0 (t) (hazard function) that
depends only on the covariate x ∈ R:
30. The variable time in Exercise 29 takes both the survival time and the time
before survey termination into account. Let t1 < t2 < · · · < tk (k ≤ N ) be the
72 2 Generalized Linear Regression
k
nj = (di + m i ) ( j = 1, . . . , k) .
i= j
Then, the probability S(t) of the survival time T being larger than t can be
estimated as (Kaplan-Meier estimate): for tl ≤ t < tl+1 ,
⎧
⎪ 1, t < t1
⎨
Ŝ(t) = n i − di
l
(cf. (2.18))
⎪
⎩ , t ≥ t1 .
i=1
ni
e xi β
xjβ
(cf. (2.19))
i:δi =1 j∈Ri e
for estimating the parameter β, where xi ∈ R p and yi are covariates and time,
respectively, and δi = 1 and δi = 0 correspond to death and survey termination,
respectively, for i = 1, . . . , N .
On the other hand, Ri is the set of indices j such that y j ≥ yi . We formulate the
Lasso as follows:
1 e xi β
− log xjβ
+ λβ1 . (cf. (2.20))
N j∈Ri e
i:δi =1
e xi β
L := − log xjβ
.
i:δi =1 j∈Ri e
Express u in ∇ L= −X T u.
Hint For Si = h∈Ri e xh β , we have
x j,k e x j β N
x j,k e x j β N e xi β
= = xi,k .
Si Si Sj
i:δi =1
j∈R i j=1 i∈C i=1 j∈C j i
N 1
= xi,k {xi,l e xi β e xs β − e xi β x h,l e xh β }
( r∈R j e x r β )2
i=1 j∈Ci s∈R j h∈R j
N
N e xi β
= xi,k x h,l x r β )2
{I (i = h) e xs β − I (h ∈ R j )e xh β } .
( r∈R j e
i=1 h=1 j∈Ci s∈R j
(b) Fill in the blank and draw the Kaplan-Meier curve for samples such that
xi β > 0 and for those such that xi β < 0 to distinguish them.
74 2 Generalized Linear Regression
33. It is assumed that logistic regression and the support vector machine (SVM) are
similar even when Lasso is applied.
a. The following code executes Lasso for logistic regression and the SVM for
the South Africa heart disease dataset: https://www2.stat.duke.edu/~cr173/
Sta102_Sp14/Project/heart.pdf.
Because those procedures are in separate packages glmnet and
sparseSVM, the plots are different, and no legend is available for the
first plot. Thus, we construct a graph using the coefficients output by the
packages. Output the graph for the SVM as well.
1 library(ElemStatLearn)
2 library(glmnet)
3 library(sparseSVM)
4 data(SAheart)
5 df = SAheart
6 df[, 5] = as.numeric(df[, 5])
7 x = as.matrix(df[, 1:9]); y = as.vector(df[, 10])
8 p = 9
9 binom.fit = glmnet(x, y, family = "binomial")
10 svm.fit = sparseSVM(x, y)
11 par(mfrow = c(1, 2))
12 plot(binom.fit); plot(svm.fit, xvar = "norm")
13 par(mfrow = c(1, 1))
14 ## The outputs seemed to be similar, but we are not convinced
that they are close because no legend is available.
15 ## So, we made a graph from scratch.
16 coef.binom = binom.fit$beta; coef.svm = coef(svm.fit)[2:(p +
1), ]
17 norm.binom = apply(abs(coef.binom), 2, sum); norm.binom = norm
.binom / max(norm.binom)
18 norm.svm = apply(abs(coef.svm), 2, sum); norm.svm = norm.svm /
max(norm.svm)
19 par(mfrow = c(1, 2))
20 plot(norm.binom, xlim = c(0, 1), ylim = c(min(coef.binom), max
(coef.binom)),
21 main = "Logistic Regression", xlab = "Norm", ylab = "
Coefficient", type = "n")
22 for (i in 1:p) lines(norm.binom, coef.binom[i, ], col = i)
23 legend("topleft", legend = colnames(df), col = 1:p, lwd = 2,
cex = .8)
24 par(mfrow = c(1, 1))
b. From the gene expression data of past patients with leukemia, we distinguish
between acute lymphocytic leukemia (ALL) and acute myeloid leukemia
Exercises 21–34 75
(AML) for each future patient. In particular, our goal is to determine which
genes should be checked to distinguish between them. To this end, we down-
load the training data file leukemia_big.csv from the site listed below:
htt ps : //web.stan f or d.edu/ hastie/C AS I f iles/D AT A/leukemia.
html. The data contain samples for N = 72 patients (47 ALL and 25 AML)
and p = 7,128 genes:
htt ps : //www.ncc.go. j p/j p/r cc/about/ pediatricl eukemia/index.
html.
After executing the following, output the coefficients obtained via logistic
regression and the SVM. In most of the genome data, the rows and columns
are genes and samples, respectively. However, similar to the dataset we have
seen thus far, the rows and columns are instead samples and genes.
1 df = read.csv("http://web.stanford.edu/~hastie/CASI_files/DATA
/leukemia_big.csv")
2 dim(df)
3 names = colnames(df)
4 x = t(as.matrix(df))
5 y = as.numeric(substr(names, 1, 3) == "ALL")
6 p = 7128
7 binom.fit = glmnet(x, y, family = "binomial")
8 svm.fit = sparseSVM(x, y)
9 coef.binom = binom.fit$beta; coef.svm = coef(svm.fit)[2:(p +
1), ]
10 norm.binom = apply(abs(coef.binom), 2, sum); norm.binom = norm
.binom / max(norm.binom)
11 norm.svm = apply(abs(coef.svm), 2, sum); norm.svm = norm.svm /
max(norm.svm)
Chapter 3
Group Lasso
Group Lasso is Lasso such that the variables are categorized into K groups k =
1, . . . , K . The pk variables θk = [θ1,k , . . . , θ pk ,k ]T ∈ R pk in the same group share
the same times at which the nonzero coefficients become zeros when we increase
the λ value. This chapter considers groups with nonzero and zero coefficients to be
active and nonactive, respectively, for each λ. In other words, group Lasso chooses
active groups rather than active variables. The active and nonactive status may be
different among the groups.
Example 25 (Logistic Regression for Multiple Values) For the logistic regression
and classification discussed in Chap. 2, we might have constructed a group consisting
of k = 1, . . . , K for each covariate j = 1, . . . , p to determine which covariate plays
an important role in the classification task [27]. For example, in Fisher’s Iris dataset,
although there are p × K = 4 × 3 = 12 parameters, we may divide them into p = 4
(sepal/petal, width/length) groups that contain K = 3 iris species to observe the
p = 4 active/nonactive status. When we increase λ, K = 3 coefficients in the same
variable indexed by j become zero at once for some λ > 0. We discuss the problem
in Sect. 3.7.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 77
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_3
78 3 Group Lasso
K
yi = f k (xi ; θk ) + i ,
k=1
In this chapter, generalizing the notion of Lasso, we categorize the variables into
K groups, each of which contains pk variables (k = 1, . . . , K ), and given observa-
tions z i,k ∈ R pk (k = 1, . . . , K ), yi ∈ R (i = 1, . . . , N ), we consider the problem of
finding the
θ1 = [θ1,1 , . . . , θ p1 ,1 ]T , . . . , θ K = [θ1,K , . . . , θ p K ,K ]T
that minimize
1
N K K
(yi − z i,k θk ) + λ
2
θk 2 (3.1)
2 i=1 k=1 k=1
row vector.
Note that (3.1) is a generalization of the linear Lasso ( p1 = · · · = p K = 1, K = p)
discussed in Chap. 1. In fact, if we let xi := [z i,1 , . . . , z i, p ] ∈ R1× p (row vector) and
β := [θ1,1 , . . . , θ p,1 ]T , we have
1
N
(yi − xi β)2 + λβ1 .
2 i=1
In this section, we consider the case where only one group exists K = 1 and pk = p.
Then, (3.1) can be expressed by
3.1 When One Group Exists 79
1
N
(yi − z i,1 θ1 )2 + λθ1 2 .
2 i=1
if we set y = 0, because f (x, y) = |x|, the slopes when approaching from the left
and right are different. On the other hand, the partial differentials outside the origin
are
f x (x, y) = x/x 2 + y 2
, (3.3)
f y (x, y) = y/ x 2 + y 2
1
y − X β22 + λβ2 (3.5)
2
from X ∈ R N × p and y ∈ R N , λ ≥ 0, where we denote β2 :=
p
j=1 β 2j for β =
[β1 , . . . , β p ]T ∈ R p . In this chapter, we do not divide the first term by N as in (3.5).
We may interpret N λ as if it was λ.
80 3 Group Lasso
(a) β = 0 (p = 1) (b) β = 0 (p = 1)
XT y − λ XT y XT y + λ XT y − λ XT y XT y + λ
0 0
0 0
(c) β = (p = 2) (d) β = (p = 2)
0 0
XT y
XT y
XT y 2 ≥λ
λ XT y 2 <λ
λ
Fig. 3.1 For p = 1, if X T y is a distance λ from the origin, then β = 0. If within λ, then β = 0.
For p = 2, if X T y is a distance λ from the origin, then β = [0, 0]T . If within λ, then β = [0, 0]T
− X T (y − X β) + λ[−1, 1] 0 . (3.6)
|X T y| ≤ λ . (3.7)
(3.7) means that the line segment with the center X T y of length ±λ contains the
origin (Fig. 3.1a, b).
Suppose p = 2. When (β1 , β2 ) = (0, 0), from (3.3), the subderivative of β2 at
β is β/β2 , and the equation that gives the subderivative that contains zero becomes
β 0
− X (y − X β) + λ
T
, (3.8)
β2 0
which means
β
XT Xβ = XT y − λ . (3.9)
β2
3.1 When One Group Exists 81
Thus, the original point X T y approaches the origin by the length λ due to Lasso. On
the other hand, when β = 0, we have
u 0
The solution of(3.5) is β = 0 ⇐⇒ −X y + λ T
u +v ≤1
2 2
v 0
⇐⇒ X T y2 ≤ λ ,
which means that the disk with a center and radius of X T y and λ, respectively,
contains the origin (Fig. 3.1c, d).
For p = 1, we have the formula Sλ (X T y). However, for p ≥ 2, we wait for
the convergence similar to the Newton method discussed in Chap. 2—Repeating a
recursive equation. We consider a method for finding the solution β ∈ R p .
Setting ν > 0, we repeat the updates
γ := β + ν X T (y − X β) (3.10)
νλ
β = 1− γ (3.11)
γ2 +
Example 27 Artificially generating data, we execute the function gr. We show the
result in Fig. 3.2, in which we observe that the p variables share the λ value in which
the coefficients become zero when we increase λ.
1 ## Data Generation
2 n = 100
3 p = 3
4 X = matrix(rnorm(n * p), ncol = p); beta = rnorm(p); epsilon = rnorm(n)
5 y = 0.1 * X %*% beta + epsilon
6 ## Display the Change of Coefficients
7 lambda = seq(1, 50, 0.5)
8 m = length(lambda)
9 beta = matrix(nrow = m, ncol = p)
10 for (i in 1:m) {
11 est = gr(X, y, lambda[i])
12 for (j in 1:p) beta[i, j] = est[j]
82 3 Group Lasso
13 }
14 y.max = max(beta); y.min = min(beta)
15 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
16 xlab = "lambda", ylab = "Coefficients", type = "n")
17 for (j in 1:p) lines(lambda, beta[, j], col = j + 1)
18 legend("topright", legend = paste("Coefficients", 1:p), lwd = 2, col =
2:(p + 1))
19 segments(lambda[1], 0, lambda[m], 0)
When we alternately update (3.10) and (3.11), the sequence {βt } is not guaranteed
to converge to the solution of (3.5). In the following, we show that we can always
obtain the correct solution by choosing an appropriate value of ν > 0.
Suppose λ = 0. Then, we note that the update via (3.10) and (3.11) amounts to
βt+1 ← βt + ν X T (y − X βt ) (t = 0, 1, . . .)
after setting an initial value β0 . Let g(β) be (3.5) with λ = 0. Then, since −X T (y −
X β) is the derivative of g(β) by β, we can write the update as
We call the procedure a gradient method because it updates β such that the decrease
in g is maximized when it minimizes g.
In the following, we express the second term of (3.5) by h(β) := λβ2 and
evaluate the convergence of a variant of the gradient method (proximal gradient)
1
βt − ν∇g(βt ) − θ22 + νh(θ) , (3.14)
2
3.2 Proximal Gradient Method 83
L
Q(x, y) := g(y) + (x − y)T ∇g(y) + x − y2 + h(x) (3.16)
2
p(y) := argmin x∈R p Q(x, y) (3.17)
Proposition 7 (Beck and Teboulle, 2009 [3]) The sequence {βt } generated by the
ISTA satisfies
Lβ0 − β∗ 22
f (βk ) − f (β∗ ) ≤ ,
2k
0.2
β1
Coefficients
0.0
β1
β2
β3
-0.2 -0.1
β4
β5
β6
β7
β8
0 10 20 30 40 50 0 10 20 30 40 50
λ λ
Fig. 3.2 We execute the function gr (Example 27) for p = 3 and p = 8 and observe that the
coefficients become zero at the same time as we increase λ
αt − 1
γt+1 = βt + (βt − βt−1 ) . (3.20)
αt+1
12 alpha.old = alpha
13 alpha = (1 + sqrt(1 + 4 * alpha ^ 2)) / 2
14 gamma = beta + (alpha.old - 1) / alpha * (beta - beta.old)
15 }
16 return(beta)
17 }
Example 28 To compare the efficiencies of the ISTA and FISTA, we construct the
program (Fig. 3.3). We could not see a significant difference when N = 100, p =
1, 3. However, the FISTA can be applied to general optimization problems and works
efficiently for large-scale problems.2 We show the program in Exercise 39.
In the following, we apply the group Lasso procedure for one group in a cyclic
manner to obtain the group Lasso for K ≥ 1 groups. We can obtain the solution of
(3.1) using the coordinate descent method among the groups. If each of the groups
k = 1, . . . , K contains pk variables, we apply the method introduced in Sects. 3.1
and 3.2 as if p = pk .
ISTA ISTA
FISTA FISTA
0.8
0.6
L2 Error
L2 Error
0.4
0.2
0.0
2 4 6 8 10 2 4 6 8 10
# Repetitions # Repetitions
Fig. 3.3 The difference in the convergence rate between the ISTA and FISTA (Example 28). It
seems that the difference is not significant unless the problem size is rather large
2 If there exists m > 0 such that (∇ f (x) − ∇ f (y))T (x − y) ≥ m2 x − y2 for arbitrary x, y ∈ R,
then we say that f : R → R is strongly convex. In this case, the error can be exponentially decreased
w.r.t. k (Nesterov, 2007).
86 3 Group Lasso
Note that z and theta are lists. If the number of groups is one, they are the X, β
discussed in Chap. 1.
For a single group, we have seen that the variables share the active/nonactive status
for λ ≥ 0. Variables in the same group share the same status for multiple groups, but
those in different groups do not. We manually input the variables into the procedure
for grouping before executing it (the grouping is not automatically constructed).
Example 29 Suppose that we have four variables and that the first two and last two
variables constitute two separate groups. We artificially generate data and estimate
the coefficients of the four variables. When we make λ larger, we observe that the
first two and last two variables share the active/nonactive status (Fig. 3.4).
J = 2, p1 = p2 = 2
(Example 29). If we make λ Group 1
larger, we observe that each Group 2
group shares the
Group 2
Coefficients
active/nonactive status
1 ## Data Generation
2 N = 100; J = 2
3 u = rnorm(n); v = u + rnorm(n)
4 s = 0.1 * rnorm(n); t = 0.1 * s + rnorm(n); y = u + v + s + t + rnorm(n)
5 z = list(); z[[1]] = cbind(u, v); z[[2]] = cbind(s, t)
6 ## Display the coefficients that change with lambda
7 lambda = seq(1, 500, 10); m = length(lambda); beta = matrix(nrow = m,
ncol = 4)
8 for (i in 1:m) {
9 est = group.lasso(z, y, lambda[i])
10 beta[i, ] = c(est[[1]][1], est[[1]][2], est[[2]][1], est[[2]][2])
11 }
12 y.max = max(beta); y.min = min(beta)
13 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
14 xlab = "lambda", ylab = "Coefficients", type = "n")
15 lines(lambda, beta[, 1], lty = 1, col = 2); lines(lambda, beta[, 2], lty
= 2, col = 2)
16 lines(lambda, beta[, 3], lty = 1, col = 4); lines(lambda, beta[, 4], lty
= 2, col = 4)
17 legend("topright", legend = c("Group1", "Group1", "Group2", "Group2"),
18 lwd = 1, lty = c(1, 2), col = c(2, 2, 4, 4))
19 segments(lambda[1], 0, lambda[m], 0)
1
N K K
(yi − z i,k θk )2 + λ {(1 − α)θk 2 + αθk 1 } (3.21)
2 i=1 k=1 k=1
to contain sparsity within the group as well as between groups [26]. We refer to this
extended group Lasso as sparse group Lasso. The difference lies only in the second
term from (3.1), introducing the parameter 0 < α < 1, as done in the elastic net.
In sparse group Lasso, although the variables of the nonactive groups are non-
active, the variables in the active groups may be either active or nonactive. In this
sense, sparse group Lasso extends the ordinary group Lasso and allows the active
groups to have nonactive variables.
If we differentiate both sides of (3.21) by θk ∈ R pk , we have
N
− z i,k (ri,k − z i,k θk ) + λ(1 − α)sk + λαtk = 0 (3.22)
i=1
with ri,k := yi − l=k z i,l θ̂l , where sk , tk ∈ R pk are the subderivatives of θk 2 ,
θk 1 .
Next, we derive the conditions for θk = 0 to be optimum in terms of sk , tk .
The minimum value of (3.21) except the term with the coefficient 1 − α is
88 3 Group Lasso
N
N
θk = Sλα z i,k ri,k 2
z i,k ,
i=1 i=1
νλ(1 − α)
β = 1− Sλα (γ) , (3.23)
Sλα (γ)2 +
where (u)+ := max{0, u}. Therefore, we can construct a procedure that repeats (3.10)
and (3.23) alternatively.
Although we abbreviate the proof, we have the following proposition.
The actual procedure can be constructed as follows, where the three lines marked
as ## are different from the ordinary group Lasso.
1 sparse.group.lasso = function(z, y, lambda = 0, alpha = 0) {
2 J = length(z)
3 theta = list(); for (j in 1:J) theta[[j]] = rep(0, ncol(z[[j]]))
4 for (m in 1:10) {
5 for (j in 1:J) {
6 r = y; for (k in 1:J) {if (k != j) r = r - z[[k]] %*% theta[[k]]}
7 theta[[j]] = sparse.gr(z[[j]], r, lambda, alpha)
##
8 }
9 }
10 return(theta)
11 }
12
13 sparse.gr = function(X, y, lambda, alpha = 0) {
14 nu = 1 / max(2 * eigen(t(X) %*% X)$values)
15 p = ncol(X)
16 beta = rnorm(p); beta.old = rnorm(p)
17 while (max(abs(beta - beta.old)) > 0.001) {
18 beta.old = beta
19 gamma = beta + nu * t(X) %*% (y - X %*% beta)
20 delta = soft.th(lambda * alpha, gamma)
##
21 beta = max(1 - lambda * nu * (1 - alpha) / norm(delta, "2"), 0) *
delta ##
22 }
23 return(beta)
24 }
3.4 Sparse Group Lasso 89
α=0 α = 0.001
1.2
1.2
Group 1 Group 1
Group 1 Group 1
Group 2 Group 2
1.0
1.0
Group 2 Group 2
0.8
0.8
Coefficients
Coefficients
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 100 200 300 400 500 0 50 100 150 200 250
λ λ
α = 0.01 α = 0.1
1.2
Group 1 Group 1
Group 1 Group 1
1.0
Group 2 Group 2
1.0
Group 2 Group 2
0.8
0.8
Coefficients
Coefficients
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 50 2 4 6 8
λ λ
Fig. 3.5 We observe that the larger the value of α, the greater the number of variables that are
active/nonactive in the group (Example 30)
Example 30 We executed the sparse group Lasso for α = 0, 0.001, 0.01, 0.1 under
the same setting as in Example 29. We find that the larger the value of α, the greater
the number of variables that are nonactive in the active groups (Fig. 3.5).
We consider the case where some groups overlap. For example, the groups are
{1, 2, 3}, {3, 4, 5} rather than {1, 2}, {3, 4}, {5}.
For groups k = 1, . . . , K , among the coefficients β1 , . . . , β p of variables, we
specify the βk for which Kwe do not use zero. We prepare the variables θ1 , . . . , θ K ∈ R
p
such that R β = k=1 θk (see Example 31) and find the β value that minimizes
p
90 3 Group Lasso
1 K K
y − X θk 22 + λ θk 2
2 k=1 k=1
and differentiate L by β1 , β2 , β3,1 , β3,2 , β4 , β5 . We write the first and last three
columns of X ∈ R N ×5 as X 1 ∈ R N ×3 and X 2 ∈ R N ×3 . If we differentiate L by γ1
and γ2 (the first and last three elements of θ1 , θ2 ), we have
∂L γ1
= −X 1T (y − X 1 γ1 ) + λ
∂γ1 γ1 2
∂L γ2
= −X 2T (y − X 2 γ2 ) + λ .
∂γ2 γ2 2
(the first 3 of θ1 )
−X 1T (y − X θ1 ) + λ =0
θ1 2
(the last 3 of θ2 )
−X 2T (y − X θ2 ) + λ =0.
θ2 2
with
3.6 Group Lasso with Multiple Responses 91
1
N K p
L 0 (β) := (yi,k − xi, j β j,k )2 . (3.24)
2 i=1 k=1 j=1
N
( j)
{−xi, j (ri,k − xi, j β j,k )} ,
i=1
N
N
( j)
βj xi,2 j − xi, j ri + λ∂β j 2 ,
i=1 i=1
( j) ( j) ( j)
where ri := [ri,1 , . . . , ri,K ]. Therefore, the solution is
1 λ
N
( j)
β̂ j = N 1− N ( j)
xi, j ri . (3.25)
2
i=1 x i, j i=1 x i, j ri 2 + i=1
Fig. 3.6 We find the Hitters with many HRs and RBIs
relevant variables that affect
HRs and RBIs via group
Coefficients
coefficients, which shows
that SHs and (HBPs, RBIs)
have negative correlations.
H
Even if we make λ large
SB
enough, the coefficients of H BB
and BB do not become zero, HBP
keeping positive values, SO
which means that they are SH
essentially important items DP
for HRs and RBIs
0 50 100 150 200
λ
We do not have to repeat updates for each j (for each group) in this procedure but
may update only once.
Example 32 From a dataset that contains nine items of a professional baseball
team,3 i.e., the number of hits (H), home runs (HRs), runs batted in (RBIs), stolen
bases (SBs), base on balls (BB), hits by pitch (HBPs), strike outs (SOs), sacrifice
hits (SHs), and double plays (DPs), we find the relevant covariates among the seven
for the responses HRs and RBIs. In general, we require estimation of the coefficients
β j,k ( j = 1, . . . , 7, k = 1, 2), because the HRs and RBIs are correlated, and we
make β j,1 , β j,2 share the active/nonactive status for each j. We draw the change in
the coefficients with λ in Fig. 3.6.
1 df = read.csv("giants_2019.csv")
2 X = as.matrix(df[, -c(2, 3)])
3 Y = as.matrix(df[, c(2, 3)])
4 lambda.min = 0; lambda.max = 200
5 lambda.seq = seq(lambda.min, lambda.max, 5)
6 m = length(lambda.seq)
7 beta.1 = matrix(0, m, 7); beta.2 = matrix(0, m, 7)
8 j = 0
9 for (lambda in lambda.seq) {
10 j = j + 1
11 beta = gr.multi.linear.lasso(X, Y, lambda)
12 for (k in 1:7) {
13 beta.1[j, k] = beta[k + 1, 1]; beta.2[j, k] = beta[k + 1, 2]
14 }
15 }
16 beta.max = max(beta.1, beta.2); beta.min = min(beta.1, beta.2)
3 The hitters of the Yomiuri Giants, Japan, each of which has at least one opportunity to bat.
3.6 Group Lasso with Multiple Responses 93
In Chap. 2, we considered logistic regression with sparse estimation for binary and
multiple values. In this section, among the pK parameters, the variables with the
coefficients β j,k (k = 1, . . . , K ) with the same j comprise a group. Variables in the
same group share the active/nonactive status, and the coordinate descent method is
applied among the p groups.
In Chap. 2, we consider minimizing
p
K
L 0 (β) + λ |β j,k |
j=1 k=1
with
⎡ ⎛ ⎛ ⎞⎞⎤
N
p
K
K p
L 0 (β) := ⎣ yi,k xi β j,k − log ⎝ exp ⎝ xi, j β j,h ⎠⎠⎦
i=1 j=1 k=1 h=1 j=1
1, yi = k
when yi,k = . In this section, we replace the last term by the L2 norm
0, yi = k
and minimize
p
K
L(β) := L 0 (β) + λ β 2j,k
j=1 k=1
to choose the relevant variables for classification. From the same discussion in
Chap. 2, we have
∂ L 0 (β) N
=− xi, j (yi,k − πi,k )
∂β j,k i=1
∂ 2 L 0 (β) N
= xi, j xi, j wi,k,k ,
∂β j,k ∂β j ,k i=1
94 3 Group Lasso
exp(xi β (k) )
where πi,k = K and (2.12) has been applied. Because wi,k,k are
(h)
h=1 exp(x i β )
constants, from the Taylor expansion at β = γ, we derive
N
p
1
N p p
L 0 (β) = L 0 (γ) − xi j (β j − γ j )(yi − πi ) + xi, j xi, j (β j − γ j )Wi (β j − γ j )T ,
2
i=1 j=1 i=1 j=1 j =1
where yi = [yi,1 , . . . , yi,K ]T , πi = [πi,1 , . . . , πi,K ]T and Wi = (wi,k,k ) are the con-
stants at β = γ. We observe that the Lipschitz constant is at most t such that
i.e., t := 2 maxi,k πi,k (1 − πi,k ). In fact, the matrix that is obtained by subtracting
Wi = (wi,k,k ) from the diagonal matrix of size K with the elements 2πi,k (1 −
πi,k ) (k = 1, . . . , K ) is Q i = (qi,k,k ), with
πi,k (1 − πi,k ), k = k
qi,k,k = ,
πi,k πi,k , k = k
and
qi,k,k = πi,k (1 − πi,k ) = πi,k πi,k = |qi,k,k |
k =k k =k
2
N N
t
p p p
− xi, j (β j − γ j )(yi − πi ) + x (β − γ ) +λ β j 2 ,
2 i=1
i, j j j
i=1 j=1 j=1 j=1
2
where
yi − πi
p
( j)
ri := + xi,h γh − xi,h βh . (3.26)
t h=1 h= j
Example 33 For Fisher’s Iris dataset, we draw the graph in which β changes with
λ (Fig. 3.7). The same color expresses the same variable, and each contains K = 3
coefficients. We find that the sepal length plays an important role. The code is as
follows.
96 3 Group Lasso
Fig. 3.7 For Fisher’s Iris The Cofficients that change with λ
data, we observed the
changes in the coefficients
Sepal Length
with λ (Example 33). The
1.0
Sepal Width
same color expresses the Petal Length
same variable, and each Petal Width
0.5
consists of K = 3
Coefficients
coefficients. We found that
0 50 100 150
λ
1 df = iris
2 X = cbind(df[[1]], df[[2]], df[[3]], df[[4]])
3 y = c(rep(1, 50), rep(2, 50), rep(3, 50))
4 lambda.seq = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 125, 150)
5 m = length(lambda.seq); p = ncol(X); K = length(table(y))
6 alpha = array(dim = c(m, p, K))
7 for (i in 1:m) {
8 res = gr.multi.lasso(X, y, lambda.seq[i])
9 for (j in 1:p) {for (k in 1:K) alpha[i, j, k] = res[j, k]}
10 }
11 plot(0, xlim = c(0, 150), ylim = c(min(alpha), max(alpha)), type = "n",
12 xlab = "lambda", ylab = "Coefficients", main = "Each lambda and its
Coefficients")
13 for (j in 1:p) {for (k in 1:K) lines(lambda.seq, alpha[, j, k], col = j +
1)}
14 legend("topright", legend = c("Sepal Length", "Sepal Width", "Petal
Length", "Petal Width"),
15 lwd = 2, col = 2:5)
pk
f k (X ; θk ) = θ j,k j,k (X )
j=1
1 K K
L := y − f k (X )2 + λ θk 2
2 k=1 k=1
1
N pk K K
= {yi − j,k (xi )θ j,k }2 + λ θk 2 (3.28)
2 i=1 j=1 k=1 k=1
and z i,k := [1,k (xi ), . . . , pk ,k (xi )], then (3.28) coincides with (3.1).
f 1 (x; α, β) = α + βx
f 2 (x; p, q, r ) = p cos x + q cos 2x + r cos 3x
1 f1 (x) + f2 (x)
x
4
f1 (x)
cos x
Coefficients Values
f2 (x)
Function Values
cos 2x
0.5
cos 3x
2
0
0.0
-2
-0.5
Fig. 3.8 Express y = f (x) in terms of the sum of the outputs of f 1 (x), f 2 (x). We obtain the
coefficients of 1, x, cos x, cos 2x, cos 3x for each λ (left) and draw the functions f 1 (x), f 2 (x)
whose coefficients have already been obtained (right)
98 3 Group Lasso
Proposition 7 (Beck and Teboulle, 2009 [3]) The sequence {βt } generated by the
ISTA satisfies
Lβ0 − β∗ 22
f (βk ) − f (β∗ ) ≤ ,
2k
Lemma 1
L
f (x) − f ( p(y)) ≥ p(y) − y2 + L(y − x)T ( p(y) − y)
2
Proof of Lemma 1
We note that in general, we have
2
{ f (β∗ ) − f (βt+1 )} ≥ βt+1 − βt 2 + 2(βt − β∗ )T (βt+1 − βt )
L
= βt+1 − βt , βt+1 − βt + 2(βt − β∗ )
= (βt+1 − β∗ ) − (βt − β∗ ), (βt+1 − β∗ ) + (βt − β∗ )
= β∗ − βt+1 2 − β∗ − βt 2 ,
where ·, · denotes the inner product in the associated linear space. If we add them
over t = 0, 1, . . . , k − 1, we have
2
k−1
{k f (β∗ ) − f (βt+1 )} ≥ β∗ − βk 2 − β∗ − β0 2 . (3.32)
L t=0
2
{ f (βt ) − f (βt+1 )} ≥ βt − βt+1 2 .
L
If we add them after multiplying each by t on both sides over t = 0, 1, . . . , k − 1,
we have
2 2
k−1 k−1
t{ f (βt ) − f (βt+1 )} = {t f (βt ) − (t + 1) f (βt+1 ) + f (βt+1 )}
L t=0 L t=0
2 k−1 k−1
= {−k f (βk ) + f (βt+1 )} ≥ tβt − βt+1 2 .
L t=0 t=0
(3.33)
2k k−1
{ f (β∗ ) − f (βk )} ≥ β∗ − βk 2 − β∗ − β0 2 + tβt − βt+1 2
L t=0
≥ −β0 − β∗ 2 ,
Exercises 35–46
34. For the function f (x, y) := x 2 + y 2 , answer the following questions.
(a) Find the subderivative of f (x, y) at (x, y) = (0, 0).
(b) Let p ≥ 2. For β ∈ R p , Differentiate the L2 norm β2 for β = 0.
(c) The subderivative of two variables is defined by the set (u, v) ∈ R2 such that
Exercises 35–46 101
1
y − X β22 + λβ2 . (cf. (3.5))
2
(a) Show that the necessary and sufficient condition for β = 0 to be a solution
is X T y2 ≤ λ.
Hint If we take the subderivative, the condition of containing zero is as
follows:
β 0
−X T (y − X β) + λ (cf. (3.8))
β2 0
β
XT Xβ = XT y − λ . (cf. (3.9))
β2
γ := β + ν X T (y − X β) (cf. (3.10))
νλ
β = 1− γ (cf. (3.11))
γ2 +
14 n = 100
15 p = 3
16 X = matrix(rnorm(n * p), ncol = p); beta = rnorm(p); epsilon = rnorm(
n)
17 y = 0.1 * X %*% beta + epsilon
18 ## Display the coefficients that change with lambda
19 lambda = seq(1, 50, 0.5)
20 m = length(lambda)
21 beta = matrix(nrow = m, ncol = p)
22 for (i in 1:m) {
23 est = gr(X, y, lambda[i])
24 for (j in 1:p) beta[i, j] = est[j]
25 }
26 y.max = max(beta); y.min = min(beta)
27 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
28 xlab = "lambda", ylab = "Coefficients", type = "n")
29 for (j in 1:p) lines(lambda, beta[, j], col = j + 1)
30 legend("topright", legend = paste("Coefficients", 1:p), lwd = 2, col
= 2:(p + 1))
31 segments(lambda[1], 0, lambda[m], 0)
αt − 1
γt+1 = βt + (βt − βt−1 ) (cf. (3.20))
αt+1
39. To compare the efficiency between the ISTA and FISTA, we construct the fol-
lowing program. Fill in the blanks, and execute the procedure to examine the
difference.
1 ## Data Generation
2 n = 100; p = 1 # p = 3
3 X = matrix(rnorm(n * p), ncol = p); beta = rnorm(p); epsilon = rnorm(
n)
4 y = 0.1 * X %*% beta + epsilon
5 lambda = 0.01
6 nu = 1 / max(eigen(t(X) %*% X)$values)
7 p = ncol(X)
8 m = 10
9 ## Performance of ISTA
10 beta = rep(1, p); beta.old = rep(0, p)
11 t = 0; val = matrix(0, m, p)
12 while (t < m) {
13 t = t + 1; val[t, ] = beta
14 beta.old = beta
15 gamma = ## Blank(1) ##
16 beta = ## Blank(2) ##
17 }
18 eval = array(dim = m)
19 val.final = val[m, ]; for (i in 1:m) eval[i] = norm(val[i, ] - val.
final, "2")
20 plot(1:m, ylim = c(0, eval[1]), type = "n",
21 xlab = "Repetitions", ylab = "L2 Error", main = "Comparison
bewteen ISTA and FISTA")
22 lines(eval, col = "blue")
23 ## Performance of FISTA
24 beta = rep(1, p); beta.old = rep(0, p)
25 alpha = 1; gamma = beta
26 t = 0; val = matrix(0, m, p)
27 while (t < m) {
28 t = t + 1; val[t, ] = beta
29 beta.old = beta
30 w = ## Blank(3) ##
31 beta = ## Blank(4) ##
32 alpha.old = alpha
33 alpha = ## Blank(5) ##
34 gamma = ## Blank(6) ##
35 }
36 val.final = val[m, ]; for (i in 1:m) eval[i] = norm(val[i, ] - val.
final, "2")
37 lines(eval, col = "red")
38 legend("topright", c("FISTA", "ISTA"), lwd = 1, col = c("red", "blue"
))
40. We consider extending the group Lasso procedure for one group to the case of
K groups, similar to the procedure for the coordinate decent method, to obtain
the solution of
104 3 Group Lasso
1
N J K
(yi − z i,k θk )2 + λ θk 2 (cf. (3.1)) . (3.35)
2 i=1 k=1 k=1
Fill in the blanks and construct the group Lasso. Then, execute the function for
the following procedure to examine the behavior.
1 group.lasso = function(z, y, lambda = 0) {
2 J = length(z)
3 theta = list(); for (j in 1:J) theta[[j]] = rep(0, ncol(z[[j]]))
4 for (m in 1:10) {
5 for (j in 1:J) {
6 r = y; for (k in 1:J) {if (k != j) r = r - ## Blank(1) ##}
7 theta[[j]] = ## Blank(2) ##
8 }
9 }
10 return(theta)
11 }
12
13 ## Data Generation
14 N = 100; J = 2
15 u = rnorm(n); v = u + rnorm(n)
16 s = 0.1 * rnorm(n); t = 0.1 * s + rnorm(n); y = u + v + s + t + rnorm
(n)
17 z = list(); z[[1]] = cbind(u, v); z[[2]] = cbind(s, t)
18 ## Display of the Coefficients that change with lambda
19 lambda = seq(1, 500, 10); m = length(lambda); beta = matrix(nrow = m,
ncol = 4)
20 for (i in 1:m) {
21 est = group.lasso(z, y, lambda[i])
22 beta[i, ] = c(est[[1]][1], est[[1]][2], est[[2]][1], est[[2]][2])
23 }
24 y.max = max(beta); y.min = min(beta)
25 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
26 xlab = "lambda", ylab = "Coefficient", type = "n")
27 lines(lambda, beta[, 1], lty = 1, col = 2); lines(lambda, beta[, 2],
lty = 2, col = 2)
28 lines(lambda, beta[, 3], lty = 1, col = 4); lines(lambda, beta[, 4],
lty = 2, col = 4)
29 legend("topright", legend = c("Group1", "Group1", "Group2", "Group2"
),
30 lwd = 1, lty = c(1, 2), col = c(2, 2, 4, 4))
31 segments(lambda[1], 0, lambda[m], 0)
41. To contain sparsity not just within a group but also between groups, we extend
the formulation (3.35) to
1
N K K
(yi − z i,k θk )2 + λ {(1 − α)θk 2 + αθk 1 } (0 < α < 1) (cf. (3.21)) (3.36)
2
i=1 k=1 k=1
(sparse group Lasso). While the active group contains only active variables in
the ordinary group Lasso, it may contain nonactive variables in the sparse group
Lasso. In other words, the sparse group Lasso extends the group Lasso and allows
the active group to contain nonactive variables.
Exercises 35–46 105
(a) Show that the minimum value except the second term of (3.36) is
N
Sλα z i,k ri,k .
i=1
(b) Show that the necessary and sufficient condition for θ j = 0 to be a solution
is
N
λ(1 − α) ≥ Sλα z i,k ri,k .
i=1 2
(c) Show that the update formula for the ordinary group Lasso
νλ
β ← 1− γ becomes
γ2 +
νλ(1 − α)
β ← 1− Sλα (γ) (cf. (3.23))
Sλα (γ)2 +
43. We consider the case where some groups overlap. For example, the groups
are {1, 2, 3}, {3, 4, 5} rather than {1, 2}, {3, 4}, {5}. We prepare the variables
106 3 Group Lasso
K
θ1 , . . . , θ K ∈ R p such that R p β = k=1 θk (see Example 31) and find the β
value that minimizes
1 K K
y − X θk 22 + λ θk 2
2 k=1 k=1
from the data X ∈ R N × p , y ∈ R N . We write the first and last three columns of
X ∈ R5 as X 1 ∈ R3 and X 2 ∈ R3 and let
⎡ ⎤ ⎡ ⎤
β1 0
⎢ β2 ⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥
θ1 = ⎢ ⎥ ⎢ ⎥
⎢ 3,1 ⎥ 2 ⎢ β3,2 ⎥ (β3 = β3,1 + β3,2 ) .
β , θ =
⎣ 0 ⎦ ⎣ β4 ⎦
0 β5
p
L(β) := L 0 (β) + λ β j 2
j=1
with
1
N K p
L 0 (β) := (yi,k − xi, j β j,k )2 .
2 i=1 k=1 j=1
Show that
1 λ
N
( j)
β̂ j = N 1− N ( j)
xi, j ri
2
i=1 x i, j i=1 x i, j ri 2 + i=1
is the solution.
45. From the observations (x1 , y1 ), . . . , (x N , y N ) ∈ R × R, we regress yi by
f 1 (xi ) + f 2 (xi ). Let
f 1 (x; α, β) = α + βx
f 2 (x; p, q, r ) = p cos x + q cos 2x + r cos 3x
12 for (j in 1:p) {
13 r = R + as.matrix(X[, j]) %*% t(beta[j, ])
14 M = ## Blank(2) ##
15 beta[j, ] = sum(X[, j] ^ 2) ^ (-1) *
16 max(1 - lambda / t / sqrt(sum(M ^ 2)), 0) * M
17 R = r - as.matrix(X[, j]) %*% t(beta[j, ])
18 }
19 }
20 return(beta)
21 }
22 ## the procedure to execute
23 df = iris
24 X = cbind(df[[1]], df[[2]], df[[3]], df[[4]])
25 y = c(rep(1, 50), rep(2, 50), rep(3, 50))
26 lambda.seq = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 125, 150)
27 m = length(lambda.seq); p = ncol(X); K = length(table(y))
28 alpha = array(dim = c(m, p, K))
29 for (i in 1:m) {
30 res = gr.multi.lasso(X, y, lambda.seq[i])
31 for (j in 1:p) {for (k in 1:K) alpha[i, j, k] = res[j, k]}
32 }
33 plot(0, xlim = c(0, 150), ylim = c(min(alpha), max(alpha)), type = "n
",
34 xlab = "lambda", ylab = "Coefficient", main = "The coefficients
that change with lambda")
35 for (j in 1:p) {for (k in 1:K) lines(lambda.seq, alpha[, j, k], col =
j + 1)}
36 legend("topright", legend = c("Sepal Length", "Sepal Width", "Petal
Length", "Petal Width"),
37 lwd = 2, col = 2:5)
Chapter 4
Fused Lasso
N −1
1
N
(yi − θi )2 + λ |θi − θi+1 | (4.1)
2 i=1 i=1
for given μ, λ ≥ 0 (sparse fused Lasso) . The extended setting penalizes the size
of θi as well as the difference θi − θi+1 . Another extension is that the observations
y1 , . . . , y N may be in the p ≥ 1 dimension in which the difference θi − θi+1 is
evaluated by the ( p-dimension) L2 norm.
Before looking at the inside of the fused Lasso procedure, we overview the application
of the CRAN package genlasso [2] to various data.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 109
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_4
110 4 Fused Lasso
1
N
(yi − θi )2 + λ |θi − θ j | (4.3)
2 i=1 {i, j}∈E
2
-2
Gene #
λ = 0.1
Copy Number Variation
2
-2
0 10 20 30 40 50
Gene #
λ=1
Copy Number Variation
2
-2
0 10 20 30 40 50
Gene #
λ = 10
Copy Number Variation
2
-2
0 10 20 30 40 50
Gene #
Fig. 4.2 Execution of Example 35. Applying fused Lasso, we make the CGH data smooth. For the
first graph, we draw the smoothing of the CGH for all genes, and for the three graphs below it, we
draw the smoothing for the first fifty genes and λ = 0.1, 1, 10. We observe that the greater λ is, the
more smooth (the less sensitive) the generated line
112 4 Fused Lasso
λ = 50 λ = 150
Fig. 4.3 The execution of Example 36. The number of people infected with the novel coronavirus in
2020 until June 9th is displayed in color. When λ = 150 (right), a difference between a small number
of areas can be observed, such as Tokyo and Hokkaido, where the state of emergency declared in
April lasted longer than in other prefectures. However, small differences can be observed when
λ = 50 (left)
1 library(genlasso)
2 library(NipponMap)
3 mat = read.table("adj.txt")
4 mat = as.matrix(mat) ## Adjacency matrix of 47 prefectures in
Japan
5 y = read.table("2020_6_9.txt")
6 y = as.numeric(y[[1]]) ## #infected with corona for each of the 47
prefectures
7
8 k = 0; u = NULL; v = NULL
9 for (i in 1:46) for (j in (i + 1):47) if (mat[i, j] == 1) {
10 k = k + 1; u = c(u, i); v = c(v, j)
11 }
12 m = length(u)
13 D = matrix(0, m, 47)
14 for (k in 1:m) {D[k, u[k]] = 1; D[k, v[k]] = -1}
15 res = fusedlasso(y, D = D)
16 z = coef(res, lambda = 50)$beta # lambda = 150
17 cc = round((10 - log(z)) * 2 - 1)
18 cols = NULL
19 for (k in 1:47) cols = c(cols, heat.colors(12)[cc[k]]) ## Colors for each
of 47 prefectures
20 JapanPrefMap(col = cols, main = "lambda = 50") ## a function to draw JP map
Example 37 Fused Lasso can show not only the difference between the adjacencies
of variables but also the second-order difference θi − 2θi+1 + θi+2 , the third-order
difference θi − 3θi+1 + 3θi+2 − θi+3 , etc. Differences can be introduced in multiple
layers, which is called trend filtering.
4.1 Applications of Fused Lasso 113
2
1
1
0
0
-1
-1
-2
-2
0 10 20 30 40 50 0 10 20 30 40 50
Position Position
k=3 k=4
Trend filtering estimate
2
1
1
0
0
-1
-1
-2
-2
0 10 20 30 40 50 0 10 20 30 40 50
Position Position
Fig. 4.4 Execution of Example 37 (output of the trend filtering procedure), in which the higher the
order k, the more degrees of freedom the curve has, and the easier it is for the curve to follow the
data
1
h 1 (θ1 , θ2 ) := (y1 − θ1 )2 + λ|θ2 − θ1 |
2
but θ2 remains. Then, the value of θ1 when the value of θ2 is known is
⎧
⎨ y1 − λ, y1 > θ2 + λ
θ̂1 (θ2 ) = θ2 , |y1 − θ2 | ≤ λ .
⎩
y1 + λ, y1 < θ2 − λ
1 1
(y1 − θ1 )2 + (y2 − θ2 )2 + λ|θ2 − θ1 | + λ|θ3 − θ2 | .
2 2
Then, the variables θ1 , θ3 remain, but if we replace θ1 with θ̂1 (θ2 ), the value of θ2
when we know θ3 , i.e., θ̂2 (θ3 ) that minimizes
1 1
h 2 (θ̂1 (θ2 ), θ2 , θ3 ) := (y1 − θ̂1 (θ2 ))2 + (y2 − θ2 )2 + λ|θ2 − θ̂1 (θ2 )| + λ|θ3 − θ2 |
2 2
can be expressed as a function of θ3 .
Since θ̂1 (θ2 ) can be expressed by a function θ̂2 (θ3 ) of θ3 , we write θ̂1 (θ3 ). If we
continue this process, then θ̂1 (θ N ), . . . , θ̂ N −1 (θ N ) are obtained as a function of θ N .
If we substitute them into (4.1), then the problem reduces to the minimization of
h N (θ̂1 (θ N ), . . . , θ̂ N −1 (θ N ), θ N , θ N )
N −1 −2
1
N
1
:= (yi − θ̂i (θ N ))2 + (y N − θ N )2 + λ |θ̂i (θ N ) − θ̂i+1 (θ N )| + λ|θ̂ N −1 (θ N ) − θ N |
2 2
i=1 i=1
(4.4)
It does not seem to be easy to find a function θ̂i (θi+1 ) except for i = 1. However,
there exist L i , Ui , i = 1, . . . , N − 1 such that
⎧
⎨ L i , θi+1 < L i
θ̂i (θi+1 ) = θi+1 , L i ≤ θi+1 ≤ Ui .
⎩
Ui , Ui < θi+1
In the actual procedure, we find the optimum value θ∗N of θ N and the remaining
θi∗ = θ̂i (θi+1
∗
), i = N − 1, . . . , 1, in a backward manner. For the details, see the
Appendix. From the procedure, we have the following proposition.
Proposition 10 (N. Johnson, 2013 [14]) The algorithm that computes the fused
Lasso via dynamic programming completes in O(N ).
Based on the discussion above, we construct a function that computes the fused
Lasso.
1 clean = function(z) {
2 m = length(z)
3 j = 2; while (z[1] >= z[j] && j < m) j = j + 1
4 k = m - 1; while (z[m] <= z[k] && k > 1) k = k - 1
5 if (j > k) return(z[c(1, m)]) else return(z[c(1, j:k, m)])
6 }
7
8 fused = function(y, lambda = lambda) {
9 if (lambda == 0) return(y)
10 n = length(y)
11 L = array(dim = n - 1)
12 U = array(dim = n - 1)
13 G = function(i, theta) {
14 if (i == 1) theta - y[1]
15 else G(i - 1, theta) * (theta > L[i - 1] && theta < U[i - 1]) +
16 lambda * (theta >= U[i - 1]) - lambda * (theta <= L[i - 1]) + theta -
y[i]
17 }
18 theta = array(dim = n)
19 L[1] = y[1] - lambda; U[1] = y[1] + lambda; z = c(L[1], U[1])
20 if (n > 2) for (i in 2:(n - 1)) {
21 z = c(y[i] - 2 * lambda, z, y[i] + 2 * lambda); z = clean(z)
22 m = length(z)
23 j = 1; while (G(i, z[j]) + lambda <= 0) j = j + 1
24 if (j == 1) {L[i] = z[m]; j = 2}
25 else L[i] = z[j - 1] - (z[j] - z[j - 1]) * (G(i, z[j - 1]) + lambda) /
26 (-G(i, z[j - 1])+G(i, z[j]))
27 k = m; while (G(i, z[k]) - lambda >= 0) k = k - 1
28 if (k == m) {U[i] = z[1]; k = m - 1}
29 else U[i] = z[k] - (z[k + 1] - z[k]) * (G(i, z[k]) - lambda) /
30 (-G(i, z[k]) + G(i, z[k + 1]))
31 z = c(L[i], z[j:k], U[i])
32 }
33 z = c(y[n] - lambda, z, y[n] + lambda); z = clean(z)
34 m = length(z)
35 j = 1; while (G(n, z[j]) <= 0 && j < m) j = j + 1
36 if (j == 1) theta[n] = z[1]
37 else theta[n] = z[j - 1] - (z[j] - z[j - 1]) * G(n, z[j - 1]) /(-G(n, z[j
- 1]) + G(n, z[j]))
116 4 Fused Lasso
38 for (i in n:2) {
39 if (theta[i] < L[i - 1]) theta[i - 1] = L[i - 1]
40 if (L[i - 1] <= theta[i] && theta[i] <= U[i - 1]) theta[i - 1] = theta[i
]
41 if (theta[i] > U[i - 1]) theta[i - 1] = U[i - 1]
42 }
43 return(theta)
44 }
We also consider minimizing (4.2) rather than (4.1) (sparse Fused Lasso), which
not only smooths the adjacent data but also regularizes θ1 , . . . , θ N to suppress the
sizes.
In reality, we often consider the fused Lasso under μ = 0. Although this is large
because we do not care about the size of θ = [θ1 , . . . , θ N ], we have another reasonable
motivation: the solution under the extended setting can be obtained from that under
the nonextended setting.
Proposition 11 (Friedman et al., 2007 [10]) Let θ(0) be the solution θ ∈ R N under
μ = 0. Then, the solution θ = θ(μ) under μ > 0 is given by S μN (θ(0)), where SμN :
R N → R N is such that the i = 1, . . . , N -th element is Sμ (z i ) for z = [z 1 , . . . , z N ]T
(see (1.11)).
4.3 LARS
To understand the Lasso duality algorithm introduced in the next section, consider
the processing of sparse estimation called LARS (least angle regression; B. Efron
et al., 2004 [9]) . LARS is a sparse estimation algorithm that performs the same
processing as Lasso, but since the amount of calculation is O( p 2 ) for the number of
variables p, Lasso is often used in practice. However, LARS performs processing
similar to Lasso, is easy to analyze theoretically and is said to be rich in suggestions.
In linear Lasso, we identify the largest λ such that at least one coefficient is
nonzero. Then, such a λ is the largest absolute value λ0 of x ( j) , y for a variable j,
if the squared loss y − X β 2 (the first term) is not divided by N , where x ( j) is the
j-th column of X ∈ R N × p , and y ∈ R N .
LARS defines a piecewise linear vector β(λ) (Figure 4.5): suppose we have
obtained λ0 , λ1 , λk−1 , β0 = 0, β1 , βk−1 , and S = { j1 , . . . , jk−1 } for k ≥ 1. Then,
1. Define k−1 such that the first k and the rest p − k are nonzeros and zeros.
2. Define
and
r (λ) = y − X β(λ) (4.6)
4.3 LARS 117
Coef. Coef.
βk−1
βk
β0 = 0 λ λ
λ0 λk λk−1
Fig. 4.5 LARS: k−1 with k nonzero and p − k zero elements defines the piecewise linear vector
β(λ) = βk−1 + (λk−1 − λ)k−1 , given β(λk−1 ) = βk−1
for λ ≤ λk−1 .
3. Seek the largest λk ≤ λk−1 and jk ∈ / S such that that the absolute value of
x ( jk ) , r (λk ) is λk , and join jk to the acyive set S.
4. Retrict the ranges of β(λ) and r (λ) to λk ≤ λ, and define βk := β(λk ).
Note that for the third step, if we define u j := x j , rk−1 − λk−1 X k−1 and v j :=
uj
x j , X k−1 , where rk−1 := y − X βk−1 , we choose the largest t j := among
vj ± 1
j∈ / S as λk .
If we choose T
(X S X S )−1 X ST rk−1 /λk−1
k−1 := ,
0
In other words, in LARS, once one variable joins the active set, they share the
same relation x ( j) , r (λ) = ±λ until λ = 0.
For example, in the R language, we can construct the following procedure.
1 lars = function(X, y) {
2 X = as.matrix(X); n = nrow(X); p = ncol(X); X.bar = array(dim = p)
3 for (j in 1:p) {X.bar[j] = mean(X[, j]); X[, j] = X[, j] - X.bar[j]}
4 y.bar = mean(y); y = y - y.bar
5 scale = array(dim = p)
118 4 Fused Lasso
Example 38 We apply LARS to the U.S. crime data to which we applied Lasso in
Chapter 1. Although the scales of λ are different, they share a similar shape (Fig.
4.6). The figure is displayed via the above function and the following procedure.
1 df = read.table("crime.txt"); X = as.matrix(df[, 3:7]); y = df[, 1]
2 res = lars(X, y)
3 beta = res$beta; lambda = res$lambda
4 p = ncol(beta)
5 plot(0:8000, ylim = c(-7.5, 15), type = "n",
6 xlab = "lambda", ylab = "beta", main = "LARS (USA Crime Data)")
7 abline(h = 0)
8 for (j in 1:p) lines(lambda[1:(p)], beta[1:(p), j], col = j)
9 legend("topright",
10 legend = c("Annual Police Funding in \$/Resident", "25 yrs.+ with 4
yrs. of High School",
11 "16 to 19 yrs. not in High School ...",
12 "18 to 24 yrs. in College",
13 "25 yrs.+ in College"),
14 col = 1:p, lwd = 2, cex = .8)
Among the five relations, the first three hold for any probability model. The fourth
is true for the Gaussian distributions from Proposition 16. We only show that the last
relation holds for the Gaussian distribution. The relation if either k ∈ A or k ∈ B, then
the claim is apparent. If neither X A ⊥⊥ X k |C nor X B ⊥⊥ X k |C, then it contradicts
X A ⊥⊥ A B |X C because the distribution is Gaussian.
4.4 Dual Lasso Problem and Generalized Lasso 119
15
16 to 19 yrs. not in High School ...
18 to 24 yrs. in College
25 yrs.+ in College
10
5
β
0
-5
Fig. 4.6 We apply LARS to the U.S. crime data and observe that they share a similar shape
In this section, we solve fused Lasso in a manner different from the method based on
dynamic programming. Although the method takes O(N 2 ) time for N observations,
it can solve more general fused Lasso problems. First, we introduce the notion of the
dual Lasso problem [29].
1
y − Xβ 2
2 +λ β 1 (4.7)
2
(linear regression Lasso), where the first term is not divided by N (consider that N
is contained in λ), which is the minimization of
1
r 2
2 +λ β 1
2
1
L(β, r, α) := r 2
2 +λ β 1 − αT (r − y + X β)
2
w.r.t. β, r . Thus, if we minimize it w.r.t. β, r , we have
0, XT α ∞ ≤ λ
minp {−αT X β + λ β 1 } =
β∈R −∞, otherwise
1 1
min r 2
2 − αT r = − αT α ,
r 2 2
1
{ y 2
2 − y − α 22 } (4.8)
2
for i = 1, . . . , N − 1.
The same idea can be applied not just to the one-dimensional fused Lasso but also
to trend filtering
N −2
1
N
(yi − θi )2 + |θi − 2θi+1 + θi+2 |
2 i=1 i=1
1
y−θ 2
2 +λ γ 1
2
over θ ∈ R p . If we introduce a Lagrange multiplier α, then we have
1
y−θ 2
2 +λ γ 1 + αT (Dθ − γ) .
2
If we minimize this via θ, γ, we have
1 1
min y−θ 2
2 + αT Dθ = y − D T α 22 − y 22 (4.13)
θ 2 2
0, α ∞≤λ
min{λ γ 1 − αT γ} = .
γ −∞, otherwise
1
y − DT α 2
2
2
1
y − Xβ 2
2 + λ Dβ 1 , (4.14)
2
122 4 Fused Lasso
1
X (X T X )−1 X T y − X (X T X )−1 X T D T α 2
2 (4.15)
2
we have
αi (λ) ∞ =λ =⇒ αi (λ ) ∞ = λ , λ < λ . (4.18)
For the proof, consult the paper (R. Tibshirani and J. Taylor, 2011 [29]).
For each λ, each element of the solution α̂(λ) is at most λ. The path algorithm uses
Proposition 13. The condition (4.17) is met for the ordinary fused Lasso as in (4.10)
but does not hold for (4.12). However, the original paper considers a generalization
for dealing with the case, and the package genlasso implements the idea. This
book considers only the case where the matrix D satisfies (4.17).
First, let λ1 and i 1 be the maximum absolute value among the elements and
the element i in the least squares solution α̂(1) of (4.16). We define the sequences
{λk }, {i k }, {sk } for k = 2, . . . , m as follows: For S := {i 1 , . . . , i k−1 }, let λk and i k = i
be the maximum absolute value among the elements and the element i ∈ / S in α−S
that minimizes
1
y − λDST s − D−ST
α−S 22 . (4.19)
2
then the maximum value and its i are λk and i k , respectively, where the sign of
α̂(λk )ik is sk . Then, we add i k to S. For j ∈
/ S, we can compute αk (λk ) from (4.20).
For j ∈ S, we have αk (λk ) = λk .
For example, we can construct the following procedure.
1 fused.dual = function(y, D) {
2 m = nrow(D)
3 lambda = rep(0, m); s = rep(0, m); alpha = matrix(0, m, m)
4 alpha[1, ] = solve(D %*% t(D)) %*% D %*% y
5 for (j in 1:m) if (abs(alpha[1, j]) > lambda[1]) {
6 lambda[1] = abs(alpha[1, j])
7 index = j
8 if (alpha[1, j] > 0) s[j] = 1 else s[j] = -1
9 }
10 for (k in 2:m) {
11 U = solve(D[-index, ] %*% t(as.matrix(D[-index, , drop = FALSE])))
12 V = D[-index, ] %*% t(as.matrix(D[index, , drop = FALSE]))
13 u = U %*% D[-index, ] %*% y
14 v = U %*% V %*% s[index]
15 t = u / (v + 1)
16 for (j in 1:(m - k + 1)) if (t[j] > lambda[k]) {lambda[k] = t[j]; h = j;
r = 1}
17 t = u / (v - 1)
18 for (j in 1:(m - k + 1)) if (t[j] > lambda[k]) {lambda[k] = t[j]; h = j;
r = -1}
19 alpha[k, index] = lambda[k] * s[index]
20 alpha[k, -index] = u - lambda[k] * v
21 h = setdiff(1:m, index)[h]
22 if (r == 1) s[h] = 1 else s[h] = -1
23 index = c(index, h)
24 }
25 return(list(alpha = alpha, lambda = lambda))
26 }
β̂(λ) = y − D T α̂(λ)
124 4 Fused Lasso
1 fused.prime = function(y, D) {
2 res = fused.dual(y, D)
3 return(list(beta = t(y - t(D) %*% t(res$alpha)), lambda = res$lambda))
4 }
Example 41 Using the path algorithm introduced above, we draw the graph of the
coefficients α(λ)/β̂(λ) of the dual/primary problems with λ in the horizontal axes.
(Fig. 4.7).
1 p = 8; y = sort(rnorm(p)); m = p - 1; s = 2 * rbinom(m, 1, 0.5) - 1
2 D = matrix(0, m, p); for (i in 1:m) {D[i, i] = s[i]; D[i, i + 1] = -s[i]}
3 par(mfrow = c(1, 2))
4 res = fused.dual(y, D); alpha = res$alpha; lambda = res$lambda
5 lambda.max = max(lambda); m = nrow(alpha)
6 alpha.min = min(alpha); alpha.max = max(alpha)
7 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(alpha.min, alpha.max),
type = "n",
8 xlab = "lambda", ylab = "alpha", main = "Dual Problem")
9 u = c(0, lambda); v = rbind(0, alpha); for (j in 1:m) lines(u, v[, j], col =
j)
10 res = fused.prime(y, D); beta = res$beta
11 beta.min = min(beta); beta.max = max(beta)
12 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(beta.min, beta.max),
type = "n",
13 xlab = "lambda", ylab = "beta", main = "Prime Problem")
14 w = rbind(0, beta); for (j in 1:p) lines(u, w[, j], col = j)
15 par(mfrow = c(1, 1))
We now consider the general problem (4.15) that contains the design matrix
X ∈ R N × p with the rank p when the matrix D satisfies (4.17).
If we let ỹ := X (X T X )−1 X T y, D̃ := D(X T X )−1 X T , the problem reduces to
minimizing
1
ỹ − D̃α 22 .
2
For example, we can construct the following R language function.
1 fused.dual.general = function(X, y, D) {
2 X.plus = solve(t(X) %*% X) %*% t(X)
3 D.tilde = D %*% X.plus
4 y.tilde = X %*% X.plus %*% y
5 return(fused.dual(y.tilde, D.tilde))
6 }
7 fused.prime.general = function(X, y, D) {
8 X.plus = solve(t(X) %*% X) %*% t(X)
9 D.tilde = D %*% X.plus
10 y.tilde = X %*% X.plus %*% y
11 res = fused.dual.general(X, y, D)
12 m = nrow(D)
13 beta = matrix(0, m, p)
14 for (k in 1:m) beta[k, ] = X.plus %*% (y.tilde - t(D.tilde) %*% res$alpha[
k, ])
15 return(list(beta = beta, lambda = res$lambda))
16 }
4.4 Dual Lasso Problem and Generalized Lasso 125
1.0
1
0.5
α
β
0
-1
0.0
-2
-0.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
λ λ
Fig. 4.7 Execution of Example 41: the solution paths of the dual (left) and primary (right) problems
w.r.t. p = 8 and m = 7. We choose the one-dimensional fused Lasso for the matrix D. In both paths,
the solution paths merge as λ decreases. The solutions α ∈ Rm and β ∈ R p of the dual and primary
problems are shown by the lines of seven and eight colors, respectively
Example 42 When D is the unit matrix and expresses the one-dimensional fused
Lasso (4.10) , we show the solution paths of the dual and primary problems (Fig. 4.8).
For the unit matrix, we observe the linear Lasso path that we have seen thus far.
However, for (4.10), the design matrix is not the unit matrix, and the nature of the
path is different. Those executions are due to the following problem.
1 n = 20; p = 10; beta = rnorm(p + 1)
2 X = matrix(rnorm(n * p), n, p); y = cbind(1, X) %*% beta + rnorm(n)
3 # D = diag(p) ## Use one of the two D
4 D = array(dim = c(p - 1, p))
5 for (i in 1:(p - 1)) {D[i, ] = 0; D[i, i] = 1; D[i, i + 1] = -1}
6 par(mfrow = c(1, 2))
7 res = fused.dual.general(X, y, D); alpha = res$alpha; lambda = res$lambda
8 lambda.max = max(lambda); m = nrow(alpha)
9 alpha.min = min(alpha); alpha.max = max(alpha)
10 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(alpha.min, alpha.max),
type = "n",
11 xlab = "lambda", ylab = "alpha", main = "Dual Problem")
12 u = c(0, lambda); v = rbind(0, alpha); for (j in 1:m) lines(u, v[, j], col =
j)
13 res = fused.prime.general(X, y, D); beta = res$beta
14 beta.min = min(beta); beta.max = max(beta)
15 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(beta.min, beta.max),
type = "n",
16 xlab = "lambda", ylab = "beta", main = "Primary Problem")
17 w = rbind(0, beta); for (j in 1:p) lines(u, w[, j], col = j)
18 par(mfrow = c(1, 1))
1
0
0
-10
α
β
-20
-1
-30
-2
-40
0 10 20 30 40 0 10 20 30 40
λ λ
0
-2
α
β
-4
-6
-10 -8
Fig. 4.8 Execution of Example 42. The solution paths of the dual and primary paths when the
design matrix is not the unit matrix. a When D is the unit matrix, the problem reduces to the linear
Lasso. b When D expresses the one-dimensional fused Lasso, a completely different shape appears
in the solution path
4.4 Dual Lasso Problem and Generalized Lasso 127
g(β1 , . . . , β p ) + λh(β1 , . . . , β p )
4.5 ADMM
α ∈ Rm , β ∈ Rn such that
Aα + Bβ = c (4.21)
f (α) + g(β) that minimizes (4.22)
In general, we have
Then, the minimum and maximum values of the primary and dual problems coin-
cide, respectively. Although in general, the equality in
may or may not hold, in this case, it is known that if the objective function and the
constraints are positive, the equality holds (Slater condition) .
In this book, we introduce a constant ρ > 0 to define the extended Lagrangian
ρ
L ρ (α, β, γ) = f (α) + g(β) + γ T (Aα + Bβ − c) + Aα + Bβ − c 2
. (4.23)
2
128 4 Fused Lasso
1 ρ
L ρ (α, β, γ) := y−α 2
2 +λ β 1 + μT (Dα − β) + Dα − β 2
2 2
then we have
∂ Lρ
= α − y + D T γt + ρD T (Dα − βt )
∂α ⎧
⎨ 1, β>0
∂ Lρ
= −γt + ρ(β − Dαt+1 ) + λ [−1, 1], β = 0 ,
∂β ⎩
−1, β<0
1.0
solution path of the CGH
data in Example 35
0.5
Coefficients
0.0
-0.5 0.0 0.1 0.2 0.3 0.4 0.5
λ
Example 44 We apply the CGH data in Example 35 to the ADMM to obtain the
solution path of the primary problem (Fig. 4.9). For the ADMM, unlike the path
algorithm in the previous section, we obtain θ ∈ R N for a specific λ ≥ 0.
1 df = read.table("cgh.txt"); y = df[[1]][101:110]; N = length(y)
2 D = array(dim = c(N - 1, N)); for (i in 1:(N - 1)) {D[i, ] = 0; D[i, i] = 1;
D[i, i + 1] = -1}
3 lambda.seq = seq(0, 0.5, 0.01); M = length(lambda.seq)
4 theta = list(); for (k in 1:M) theta[[k]] = admm(y, D, lambda.seq[k])
5 x.min = min(lambda.seq); x.max = max(lambda.seq)
6 y.min = min(theta[[1]]); y.max = max(theta[[1]])
7 plot(lambda.seq, xlim = c(x.min, x.max), ylim = c(y.min, y.max), type = "n",
8 xlab = "lambda", ylab = "Coefficients", main = "Fused Lasso Solution
Path")
9 for (k in 1:N) {
10 value = NULL; for (j in 1:M) value = c(value, theta[[j]][k])
11 lines(lambda.seq, value, col = k)
12 }
statistical learning via the alternating method of multipliers (2011) [5] does not
satisfy the full rank of A, B and does not prove the second property. D. Bertsekas et al.,
Convex analysis and optimization (2003) [4] proves the second property, assuming
that B is the unit matrix. They do not contain anything complicated, although the
derivation is lengthy. In particular, S. Boyd (2011) [5] can be easily understood.
Proposition 10 (N. Johnson, 2013 [14]) The algorithm that computes the fused
Lasso via dynamic programming completes in O(N ).
1
i−1 i−2
1
{y j − θ̂ j (θi )}2 + (yi − θi )2 + λ |θ̂ j (θi ) − θ̂ j+1 (θi )| + λ|θ̂i−1 (θi ) − θi |
2 j=1 2 j=1
∗
that is obtained by removing the last term λ|θi − θi+1 | in
∗
h i (θ̂1 (θi ), . . . , θ̂i−1 (θi ), θi , θi+1 )
1
i−1 i−2
1 ∗
= {y j − θ̂ j (θi )}2 + (yi − θi )2 + λ |θ̂ j (θi ) − θ̂ j+1 (θi )| + λ|θ̂i−1 (θi ) − θi | + λ|θi − θi+1 |
2 2
j=1 j=1
we differentiate by θi to define
i−1
d θ̂ j (θi ) d
i−2
d
gi (θi ) := − {y j − θ̂ j (θi )} − (yi − θi ) + λ |θ̂ j (θi ) − θ̂ j+1 (θi )| + λ |θ̂i−1 (θi ) − θi | .
dθi dθi dθi
j=1 j=1
As we notice later, we will have either θ̂ j (θ) = θ + a or θ̂ j (θ) = b for the con-
d θ̂ j (θi )
stants a, b, which means that is either zero or one for j = 1, . . . , i − 1. For
dθi
the other terms that contain absolute values, we cannot differentiate and apply the
subderivative.
Thus, for each of the cases, we can obtain θi∗ = θ̂i (θi+1 ∗
):
∗ ∗
1. For θi > θi+1 , if we differentiate λ|θi − θi+1 | by θi , it becomes λ. Then, the θi
∗
value does not depend on θi+1 . We let the solution θi of
d ∗
h i (θ̂1 (θi ), . . . , θ̂i−1 (θi ), θi , θi+1 ) = gi (θi ) + λ = 0
dθi
∗
be θ̂i (θi+1 ) := L i .
Appendix: Proof of Proposition 131
∗ ∗
2. For θi < θi+1 , if we differentiate λ|θi − θi+1 | by θi , it becomes −λ. Then, the θi
∗
value dose not depend on θi+1 . We let the solution of θi
d ∗
h i (θ̂1 (θi ), . . . , θ̂i−1 (θi ), θi , θi+1 ) = gi (θi ) − λ = 0
dθi
∗
be θ̂i (θi+1 ) := Ui .
∗ ∗ ∗
3. For the other cases, i.e., for L i < θi+1 < Ui , we have θ̂i (θi+1 ) = θi+1 .
For example, if i = 1, then we have g1 (θ1 ) = θ1 − y1 . If θ1 > θ2∗ , then solving
g1 (θ1 ) + λ = 0, we have θ̂1 (θ2∗ ) = y1 − λ = L 1 . If θ1 < θ2∗ , then solving g1 (θ1 ) −
λ = 0, we have θ̂1 (θ2∗ ) = y1 + λ = U1 . Furthermore, if L 1 < θ2∗ < U1 , we have
θ̂1 (θ2∗ ) = θ2∗ . Thus, we have
d 1 1
g2 (θ2 ) = (y1 − θ̂(θ2 ))2 + (y2 − θ2 )2 + λ|θ̂1 (θ2 ) − θ2 |
dθ 2 2
⎧2
⎨ θ2 − y2 − λ, θ2 < L 1
= 2θ2 − y2 − y1 , L 1 ≤ θ2 ≤ U1
⎩
θ2 − y2 + λ, U1 < θ2
= g1 (θ2 )I [L 1 ≤ θ2 ≤ U1 ] + λI [θ2 > U1 ] − λI [θ2 < L 1 ] + θ2 − y2 ,
On the other hand, gi (θi ) is a line that has a piecewise nonnegative slope, and the
knots are
L 1 , . . . , L i−1 , U1 , . . . , Ui−1 .
Then, the first three terms of gi (θi ) are respectively between −λ and λ, and the
solution of gi (θi ) ± λ = 0 ranges over yi − 2λ ≤ θi ≤ yi + 2λ. In fact, from (4.24),
we have
> λ, θi > yi + 2λ
gi (θi ) .
< −λ, θi < yi − 2λ
There are at most 2i knots containing the two, which we express as x1 < · · · < x2i .
If gi (xk ) + λ ≤ 0 and gi (xk+1 ) + λ ≥ 0, then
|gi (xk ) + λ|
L i := xk + (xk+1 − xk )
|gi (xk ) + λ| + |gi (xk+1 ) + λ|
d
0= h N (θ̂1 (θ N ), . . . , θ̂ N −1 (θ N ), θ N , θ N ) = g N (θ N ) = 0
dθ N
rather than g N (θ N ) ± λ = 0.
Finally, we evaluate the efficiency of the procedure. If
x1 < · · · < xr ≤ L i−1 < xr +1 < · · · < xs−1 < Ui−1 ≤ xs < · · · < x2i ,
Proposition 11 (Friedman et al., 2007 [10]) Let θ(0) be the solution θ ∈ R N under
μ = 0. Then, the solution θ = θ(μ) under μ > 0 is given by SμN (θ(0)), where SμN :
R N → R N is such that the i = 1, . . . , N -th element is Sμ (z i ) for z = [z 1 , . . . , z N ]T
(see (1.11)).
such that
Moreover, since from the definition of (4.26), we observe that Sμ (x) monotonically
decreases with x ∈ R, we have
Hence, if we regard ti (0), ti (μ) as a set similar to {1}, [−1, 1], {−1}, we have
134 4 Fused Lasso
Similarly, we have
u i (0) ⊆ u i (μ) . (4.29)
From (4.26), we have θi (μ) = θi (0) ± μ and θi (μ) = 0 for the i such that |θi (0)| >
μ and for the i such that |θi (0)| ≤ μ, respectively.
For the first case, we have
Moreover, from si (0) ⊆ si (μ), we may assume si (μ) = si (0). From (4.28), (4.29),
if we substitute the ti (0), u i (0) values into ti (μ), u i (μ), the relation holds. Hence,
substituting θi (μ) = θi (0) − si (0)μ and (si (μ), ti (μ), u i (μ)) = (si (0), ti (0), u i (0))
into (4.25), from (4.25), (4.27), we have
For the latter case, the subderivative of si (0) should be [−1, 1] for the i such
that |θi (0)| ≤ μ. Thus, if we input si (μ) = θi (0)/μ ∈ [−1, 1], si (μ) becomes a
subdifferential of |θi (μ)|. Also from (4.28), (4.29), the relation holds even if we
substitute the ti (0), u i (0) values into ti (μ), u i (μ). Thus, substituting θi (μ) = 0
and (si (μ), ti (μ), u i (μ)) = (θi (0)/μ, ti (0), u i (0)) f i (μ) = 0, from (4.25), (4.27), we
have
θi (0)
f i (μ) = −yi + μ · + λti (0) + λu i (0) = 0 .
μ
1
y − Xβ 2
2 + λ Dβ 1 , (4.14)
2
then the dual problem finds the α ∈ Rm that minimizes
1
X (X T X )−1 X T y − X (X T X )−1 X T D T α 2
2 (4.15)
2
1
y − Xβ 2
2 +λ γ 1
2
w.r.t. β ∈ R p . Introducing the Lagrange multiplier α, we have
1
y − Xβ 2
2 +λ γ 1 + αT (Dβ − γ),
2
and if we take the minimization of this over β, γ, then we have
1 1
min y − Xβ 2
2 + αT Dβ = y − X (X T X )−1 (X T y − D T α) 2
2 + αT D(X T X )−1 (X T y − D T α)
β 2 2
0, α ∞≤λ
min{λ γ 1 − αT γ} = .
γ −∞, otherwise
Exercises 47–61
1
N N
(yi − θi )2 + λ |θi − θi−1 | (cf. (4.1)) (4.30)
2 i=1 i=2
case and control patients that are measured by CGH (comparative genomic
hybridization). Download the data from
https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/cgh.html
fill in the blank, and execute the procedure to observe what smoothing pro-
cedure can be obtained for λ = 0.1, 1, 10, 100, 1000.
1 library(genlasso)
2 df = read.table("cgh.txt"); y = df[[1]]; N = length(y)
3 theta = ## Blank ##
4 plot(1:N, theta, lambda = 0.1, xlab = "gene #", ylab = "Copy Number
Variant",
5 col = "red", type = "l")
6 points(1:N, y, col = "blue")
(b) The fused Lasso can consider the differences w.r.t. the second-order θi −
2θi+1 + θi+2 , w.r.t. the third-order θi − 3θi+1 + 3θi+2 − θi+3 , etc., as well as
w.r.t. the first order θi − θi+1 (trend filtering). Using the function
trendfilter prepared in the genlasso package, we wish to execute
trend filtering for the data sin θ (0 ≤ θ ≤ 2π) with random noise added.
Determine the value of λ appropriately, and then execute trendfilter
of the k = 1, 2, 3, 4-th order to show the graph similarly to (a).
1 library(genlasso)
2 ## Data Generation
3 N = 100; y = sin(1:N/(N * 2 * pi)) + rnorm(N, sd = 0.3)
4 out = ## Blank ##
5 plot(out, lambda = lambda) ## Smoothing and Output
48. We wish to obtain the θ1 , . . . , θ N that minimize (4.30) via dynamic programming
from the observations y1 , . . . , y N .
(a) The condition of minimization w.r.t. θ1
1
h 1 (θ1 , θ2 ) := (y1 − θ1 )2 + λ|θ2 − θ1 |
2
contains another variable θ2 . The optimum θ1 when the θ2 value is known
can be written as
⎧
⎨ y1 − λ, y1 ≥ θ2 + λ
θ̂1 (θ2 ) = θ2 , |y1 − θ2 | < λ .
⎩
y1 + λ, y1 ≤ θ2 − λ
1 1
(y1 − θ1 )2 + (y2 − θ2 )2 + λ|θ2 − θ1 | + λ|θ3 − θ2 |
2 2
Exercises 47–61 137
1 1
h 2 (θ̂1 (θ2 ), θ2 , θ3 ) := (y1 − θ̂1 (θ2 ))2 + (y2 − θ2 )2 + λ|θ2 − θ̂1 (θ2 )| + λ|θ3 − θ2 | ,
2 2
which is the θ̂2 (θ3 ) value when the θ3 value can be written as a function of
θ3 . Show that the problem of finding θ1 , . . . , θ N reduces to finding the θ N
value that minimizes
N −1 −2
1
N
1
(yi − θ̂i (θ N ))2 + (y N − θ N )2 + λ |θ̂i (θ N ) − θ̂i+1 (θ N )| + λ|θ̂ N −1 (θ N ) − θ N |
2 2
i=1 i=1
(cf. (4.4)) .
(b) When we obtain θ N in (a), how can we obtain the θ1 , . . . , θ N −1 values?
49. When we solve the fused Lasso via dynamic programming, the computation
time is proportional to the length N of y ∈ R N . Explain the fundamental reason
for this fact.
50. When we find the value of θi that minimizes
1
N N N
(yi − θi )2 + λ |θi | + μ |θi − θi−1 | (cf. (4.2)) , (4.31)
2 i=1 i=1 i=2
such that
where we assume that (si (0), ti (0), u i (0)) exists such that f i (0) = 0.
(a) Show that −yi + θi (0) + μti (0) + μu i (0) = 0.
(b) For i, j = 1, . . . , N (i = j), show that θi (0) = θ j (0) =⇒ θi (λ) = θ j (λ)
and θi (0) < θ j (0) =⇒ θi (λ) ≤ θ j (λ).
(c) If we regard the solution ti (0), ti (λ) as the set ({1}, [−1, 1], {−1}), show that
ti (0) ⊆ ti (λ).
138 4 Fused Lasso
(d) When |θi (0)| ≥ λ, show that (si (λ), ti (λ), u i (λ)) = (si (0), ti (0), u i (0)) is
the solution of f i (λ) = 0.
(e) When |θi (0)| < λ, show that (si (λ), ti (λ), u i (λ)) = (θi (0)/λ, ti (0), u i (0))
is the solution of f i (λ) = 0.
(f) Let θ̂i (λ1 , λ2 ) (i = 1, . . . , N ) be the θi that minimizes (4.31). Show that
θ̂i (λ1 , λ2 ) = Sλ1 (θ̂i (0, λ2 )).
51. When we minimize (4.31) based on Exercise 50, what process is required to
obtain the solutions for λ = 0 after obtaining the solution for λ = 0? Specify
the R language function.
52. In linear regression Lasso, there exists a smallest λ such that no variables are
active. Let λ0 be such a λ. In LARS, it is the maximum value of r0 , x j when
r0 := y. Let S be the set of indices of the active variables. Initially, we have
S = j. In LARS, if we decrease the λ value, only the coefficients of the active
variables increase. Show that if an index j becomes active at λ,
for λ ≤ λ, where r (λ) is the residual for λ and x ( j) expresses the j-th column
of the matrix X .
53. We construct the LARS function as follows. Fill in the blanks, and execute the
procedure.
1 lars = function(X, y) {
2 X = as.matrix(X); n = nrow(X); p = ncol(X); X.bar = array(dim = p)
3 for (j in 1:p) {X.bar[j] = mean(X[, j]); X[, j] = X[, j] - X.bar[j]}
4 y.bar = mean(y); y = y - y.bar
5 scale = array(dim = p)
6 for (j in 1:p) {scale[j] = sqrt(sum(X[, j] ^ 2) / n); X[, j] = X[, j]
/ scale[j]}
7 beta = matrix(0, p + 1, p); lambda = rep(0, p + 1)
8 for (i in 1:p) {
9 lam = abs(sum(X[, i] * y))
10 if (lam > lambda[1]) {i.max = i; lambda[1] = lam}
11 }
12 r = y; index = i.max; Delta = rep(0, p)
13 for (k in 2:p) {
14 Delta[index] = solve(t(X[, index]) %*% X[, index]) %*%
15 t(X[, index]) %*% r / lambda[k - 1]
16 u = t(X[, -index]) %*% (r - lambda[k - 1] * X %*% Delta)
17 v = -t(X[, -index]) %*% (X %*% Delta)
18 t = ## Blank(1) ##
19 for (i in 1:(p - k + 1)) if (t[i] > lambda[k]) {lambda[k] = t[i]; i.
max = i}
20 t = u / (v - 1)
21 for (i in 1:(p - k + 1)) if (t[i] > lambda[k]) {lambda[k] = t[i]; i.
max = i}
22 j = setdiff(1:p, index)[i.max]
23 index = c(index, j)
24 beta[k, ] = ## Blank(2) ##
25 r = y - X %*% beta[k, ]
26 }
27 for (k in 1:(p + 1)) for (j in 1:p) beta[k, j] = beta[k, j] / scale[j]
Exercises 47–61 139
54. From fused Lasso, we consider the extended formulation (generalized Lasso):
for X ∈ R N × p , y ∈ R N , β ∈ R p , and D ∈ Rm× p (m ≤ p), we minimize
1
y − Xβ 2
+ λ Dβ 1 . (4.32)
2
Why is the ordinary linear regression Lasso a particular case of (4.32)? How
about the ordinary fused Lasso? What D should be given for the two cases
below (trend filtering)?
N −2
1
N
i. (yi − θi )2 + |θi − 2θi+1 + θi+2 |
2 i=1 i=1
N −2
1 θi+2 − θi+1
N
θi+1 − θi
ii. (yi − θi )2 + − (cf. (4.13)) (4.33)
2 i=1 i=1
x i+2 − x i+1 x i+1 − x i
55. Derive the dual problem from each of the primary problems below.
(a) For X ∈ R N × p , y ∈ R N , and λ > 0, find the β ∈ R p that minimizes
1
y − Xβ 2
2 +λ β 1 (cf. (4.7))
2
1
{ y 2
2 − y − α 22 }
2
(Dual).
(b) For λ ≥ 0 and D ∈ Rm×N , find the θ ∈ R N that minimizes
140 4 Fused Lasso
1
y−θ 2
2 + λ Dθ 1
2
1
y − DT α 2
2
2
(Dual).
(c) For X ∈ R N × p , y ∈ R N , and D ∈ Rm× p (m ≥ 1), find the β ∈ R p that min-
imizes
1
y − X β 22 + λ Dβ 1
2
(Primary). Under α ∞ ≤ λ, find the α ∈ Rm that minimizes
1
X (X T X )−1 y − X (X T X )−1 D T α 2
2 (cf. (4.15))
2
(Dual).
56. Suppose that we solve fused Lasso as a dual problem. Let λ1 and i 1 be the
largest absolute value and its element i among the elements of the solution α̂(1)
of y − D T α 22 . For k = 2, . . . , m, we execute the following procedure to define
the sequences {λk }, {i k }, {sk }. For S := {i 1 , . . . , i k−1 }, let λk and i k be the largest
absolute value and its element i among the elements of the solution α−S that
minimizes
1
y − λDST s − D−S T
α−S 22 (cf. (4.19)) .
2
If we set the i ∈/ S-th element as ai − λbi , what are ai , bi ? Moreover, how can
we obtain i k , λk ?
57. We construct a program that obtains the solutions of the dual and primary fused
Lasso problems. Fill in the blanks, and execute the procedure.
1 fused.dual = function(y, D) {
2 m = nrow(D)
3 lambda = rep(0, m); s = rep(0, m); alpha = matrix(0, m, m)
4 alpha[1, ] = solve(D %*% t(D)) %*% D %*% y
5 for (j in 1:m) if (abs(alpha[1, j]) > lambda[1]) {
6 lambda[1] = abs(alpha[1, j])
7 index = j
8 if (alpha[1, j] > 0) ## Blank(1) ##
9 }
Exercises 47–61 141
10 for (k in 2:m) {
11 U = solve(D[-index, ] %*% t(as.matrix(D[-index, , drop = FALSE])))
12 V = D[-index, ] %*% t(as.matrix(D[index, , drop = FALSE]))
13 u = U %*% D[-index, ] %*% y
14 v = U %*% V %*% s[index]
15 t = u / (v + 1)
16 for (j in 1:(m - k + 1)) if (t[j] > lambda[k]) {lambda[k] = t[j]; h
= j; r = 1}
17 t = u / (v - 1)
18 for (j in 1:(m - k + 1)) if (t[j] > lambda[k]) {lambda[k] = t[j]; h
= j; r = -1}
19 alpha[k, index] = ## Blank(2) ##
20 alpha[k, -index] = ## Blank(3) ##
21 h = setdiff(1:m, index)[h]
22 if (r == 1) s[h] = 1 else s[h] = -1
23 index = c(index, h)
24 }
25 return(list(alpha = alpha, lambda = lambda))
26 }
27 m = p - 1; D = matrix(0, m, p); for (i in 1:m) {D[i, i] = 1; D[i, i + 1]
= -1}
28 fused.prime = function(y, D){
29 res = fused.dual(y, D)
30 return(list(beta = t(y - t(D) %*% t(res$alpha)), lambda = res$lambda))
31 }
32 p = 8; y = sort(rnorm(p)); m = p - 1; s = 2 * rbinom(m, 1, 0.5) - 1
33 D = matrix(0, m, p); for (i in 1:m) {D[i, i] = s[i]; D[i, i + 1] = -s[i
]}
34 par(mfrow = c(1, 2)) res = fused.dual(y, D); alpha =
35 res$alpha; lambda = res$lambda
36 lambda.max = max(lambda); m =
37 nrow(alpha) alpha.min = min(alpha); alpha.max = max(alpha)
38 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(alpha.min, alpha.
max),
39 type = "n", xlab = "lambda", ylab = "alpha", main = "Dual Problem")
40 u = c(0, lambda); v = rbind(0, alpha); for (j in 1:m) lines(u, v[,
41 j], col = j) res = fused.prime(y, D); beta = res$beta
42 beta.min = min(beta); beta.max = max(beta)
43 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(beta.min, beta.max
),
44 type = "n", xlab = "lambda", ylab = "beta", main = "Primary Problem
")
45 w = rbind(0, beta); for (j in 1:p) lines(u, w[, j], col = j)
46 par(mfrow = c(1, 1))
58. If the design matrix is not the unit matrix for the generalized Lasso, then for
X + := (X T X )−1 X T , ỹ := X X + y, D̃ := D X + , the problem reduces to mini-
mizing
1
ỹ − D̃α 22 .
2
Extend the functions fused.dual and fused.prime to fused.dual.
general and fused.prime.general, and execute the following
procedure.
142 4 Fused Lasso
to write
ρ
L ρ (α, β, γ) := f (α) + g(β) + γ T (Aα + Bβ − c) + Aα + Bβ − c 2
2
(cf. (4.23)) .
If we repeat the updates expressed by the three equations, we can obtain the
optimum solution under some conditions (ADMM, alternating direction method
of multipliers).
⎧
⎨ αt+1 ← α ∈ Rm that minimizes L ρ (α, βt , γt )
βt+1 ← β ∈ Rn that minimizes L ρ (αt+1 , β, γt )
⎩
γt+1 ← γt + ρ(Aαt+1 + Bβt+1 − c)
1 ρ
L ρ (θ, γ, μ) := y−θ 2
2 +λ γ 1 + μT (Dθ − γ) + Dθ − γ 2
2 2
Answer the following questions.
(a) What are A ∈ Rd×m , B ∈ Rd×n , c ∈ Rd , f : Rm → R, and g : Rn → R?
Exercises 47–61 143
Hint
∂ Lρ
= θ − y + D T μt + ρD T (Dθ − γt )
∂θ ⎧
⎨ 1, γ>0
∂ Lρ
= −μt + ρ(γ − Dθt+1 ) + λ [−1, 1], γ = 0
∂γ ⎩
−1, γ<0
60. Fill in the blanks below, and construct the function admm that realizes the gen-
eralized Lasso.
1 admm = function(y, D, lambda) {
2 K = ncol(D); L = nrow(D)
3 theta.old = rnorm(K); theta = rnorm(K); gamma = rnorm(L); mu = rnorm(L
)
4 rho = 1
5 while (max(abs(theta - theta.old) / theta.old) > 0.001) {
6 theta.old = theta
7 theta = ## Blank(1) ##
8 gamma = ## Blank(2) ##
9 mu = mu + ## Blank(3) ##
10 }
11 return(theta)
12 }
61. Using Exercise 60, for each of the following cases, fill in the blanks to execute
the procedure.
(a) Lasso in (4.33) (trend filtering). The data can be downloaded from
https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/
airPollution.txt
13 theta = ## Blank(2) ##
14 plot(x, theta, xlab = "Temperature (F)", ylab = "Ozon", col = "red",
type = "l")
15 points(x, y, col = "blue")
(b) See the solution path merge in the fused Lasso, where we use the CGH data
in Exercise 47 (a).
1 df = read.table("cgh.txt"); y = df[[1]][101:110]; N = length(y)
2 D = array(dim = c(N - 1, N)) for (i in 1:(N - 1)) {D[i, ] = 0; D[i,
i] =
3 1; D[i, i + 1] = -1}
4 lambda.seq = seq(0, 0.5, 0.01); M = length(lambda.seq)
5 theta = list(); for (k in 1:M) theta[[k]] = ## Blank(3) ##
6 x.min = min(lambda.seq); x.max = max(lambda.seq)
7 y.min = min(theta[[1]]); y.max = max(theta[[1]])
8 plot(lambda.seq, xlim = c(x.min, x.max), ylim = c(y.min, y.max),
type = "n",
9 xlab = "lambda", ylab = "Coefficient", main = "Fused Lasso
Solution Path")
10 for (k in 1:N) {
11 value = NULL; for (j in 1:M) value = c(value, theta[[j]][k])
12 lines(lambda.seq, value, col = k)
13 }
Chapter 5
Graphical Models
In this chapter, we examine the problem of estimating the structure of the graphical
model from observations. In the graphical model, each vertex is regarded as a variable,
and edges express the dependency between them (conditional independence ). In
particular, assume a so-called sparse situation where the number of vertices is larger
than the number of variables . Consider the problem of connecting vertex pairs with
a certain degree of dependency or more as edges. In the structural estimation of
a graphical model assuming sparsity, the so-called undirected graph with no edge
orientation is often used.
In this chapter, we first introduce the theory of graphical models, especially the
concept of conditional independence and the separation of undirected graphs. Then,
we learn the algorithms of graphical Lasso, structural estimation of the graphical
model using the pseudo-likelihood, and joint graphical Lasso.
The graphical model dealt with in this book is related to undirected graphs. For
p ≥ 1, we define V := {1, . . . , p}, and E is a subset of {{i, j} | i = j, i, j ∈ V }.1
In particular, V is called a vertex set, its elements are called vertices, E is called an
edge set, and its elements are called edges. The undirected graph consisting of them
is written as (V, E). For subsets A, B, C of V , when all paths connecting A, B pass
through some vertex of C, C is said to separate Aand B and is written as A ⊥⊥ E B | C
(Fig. 5.1).
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 145
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_5
146 5 Graphical Models
The red verteces do not separate the blue and green verteces.
Fig. 5.1 Vertex sets A, B are separated by another vertex set. The blue and green vertices are
separated and not separated by red ones above and below the three undirected graphs, respectively.
There is a red vertex on any path connecting green and blue vertices in each of the three above
graphs, while there is at least one path connecting green and blue vertices that do not contain any
red vertex in each of the three below graphs. For example, in the left-below graph, the two left blue
vertices can reach the green vertex without passing through the red vertex. The two blue and two
green vertices communicate with each other without passing through the red vertex in the center-
below graph. In the right-below graph, the two blue and one green vertices communicate without
passing through the red vertex
X A ⊥⊥ P X B | X C ⇐⇒ A ⊥⊥ E B | C . (5.1)
However, in general, such an edge set E of (V, E) may not exist for some probabilities
P.
Example 45 Let X, Y be random variables that take zeros and ones equiprobably,
and assume that Z is the residue of X + Y when divided by two. Because (X, Y )
takes four values equiprobably, we have
5.1 Graphical Models 147
and find that X, Y are independent. However, each of (X, Z ), (Y, Z ), (X, Y, Z ) takes
four values equiprobably, and Z takes two values equiprobably. Thus, we have
1 1 1 1
P(X, Z )P(Y, Z ) = · = · = P(X, Y, Z )P(Z ) ,
4 4 4 2
which means that X, Y are not conditionally independent given Z . From the impli-
cation ⇐= in (5.1), we find that X, Z , Y are not connected in this order. If we know
the value of Z , then we also know whether X = Y or X = Y , which means that they
are not independent given Z . In addition, both (X, Z ), (Y, Z ) are not independent,
and they should be connected as edges. Hence, we need to connect all three edges,
although the graph cannot express X ⊥⊥ P Y . Therefore, we cannot distinguish by
any undirected graph between the CI relation in this example and the case where no
nontrivial CI relation exists.
However, if we connect all the edges, the implication ⇐= in (5.1) trivially holds.
We say that an undirected graph for probability P is a Markov network of P if the
number of edges is minimal and the implication ⇐= in (5.1) holds.
However, for the Gaussian variables, we can obtain the CI properties from the
covariance matrix ∈ R p× p . For simplicity, we assume that the means of the vari-
ables are zeros. Let be the inverse matrix of , and write the probability density
matrix as
det 1 T
f (x) = exp − x x (x ∈ R p ) .
(2π) p 2
Then, the sufficient and necessary condition for disjoint subsets A, B, C of {1, . . . , p}
to be conditionally independent is
− log det A∪C − log det B∪C + log det A∪B∪C + log det C
= x A∪C
T
A∪C x A∪C + x B∪C
T
B∪C x B∪C − x A∪B∪C
T
A∪B∪C x A∪B∪C − xCT C xC
(5.3)
we have
X A ⊥⊥ P X B | X C , C ⊆ D =⇒ X A ⊥⊥ P X B | X D .
X A ⊥⊥ P X B | X C ⇐⇒ X B ⊥⊥ P X A |X C (5.4)
5.1 Graphical Models 149
5 4 4
3 3
1 2 2
1 2 3 4 5
Fig. 5.2 In Proposition 17, for the Gaussian variables, the existence of each edge in the Markov
network is equivalent to the nonzero of the corresponding element in the precision matrix. The
colors red, blue, green, yellow, black, and pink in the left and right correspond
X A ⊥⊥ P (X B ∪ X D ) | X C =⇒ X A ⊥⊥ P X B |X C , X A ⊥⊥ P X D | X C (5.5)
XA ⊥
⊥ P (X C ∪ X D ) | X B , X A ⊥
⊥ E X D | (X B ∪ X C ) =⇒ X A ⊥
⊥ E (X B ∪ X D ) | X C
(5.6)
X A ⊥⊥ P X B | X C =⇒ X A ⊥⊥ P X B | (X C ∪ X D ) (5.7)
X A ⊥⊥ P X B | X C =⇒ X A ⊥⊥ P X k | X C or X k ⊥⊥ E X B |X C (5.8)
for k ∈
/ A ∪ B ∪ C, where A, B, C, D are disjoint subsets of V for each equation.
N
1
xi, j , then for x̄ j := xi, j , one might consider computing the sample covariance
N i=1
N
1
matrix S = (si, j ) whose elements are si, j := (xi, j − x̄ j )2 and testing whether
N i=1
each is zero or not.
However, we immediately find that the process causes problems. First, the matrix
S does not always have the inverse. In particular, under the sparse situation, we
assume that the number p of variables is large relative to the number N , such as
when finding what genes among p = 10, 000 affect the sick from p gene expression
data of N = 100 cases/controls. S is nonnegative definite and can be expressed as
A T A using a matrix A ∈ R N × p . If p > N , then the rank of S ∈ R p× p is at most
N (< p), and S is not nonsingular.
Even if p < N , because we estimate from samples, it is unlikely that all the
elements of S AB that correspond to AB happen to be zero, even if X A ⊥⊥ X B | X C .
Thus, we cannot infer the correct conclusion without applying anything similar to
statistical testing.
In the previous section, under p > N , we observe that no correct estimation of the
CI relations among X 1 , . . . , X p is obtained even if we correctly estimate the sample
covariance matrix S. In this section, using Lasso, we consider distinguishing whether
each element i, j is zero or not (graphical Lasso [12]).
We first note that if we express the Gaussian density function of p variables as
f (x1 , . . . , x p ), the log-likelihood can be computed as
N
1
log f (xi,1 , . . . , xi, p )
N i=1
N p p
1 p 1
= log det − log(2π) − xi, j θ j,k xi,k
2 2 2N i=1 j=1 k=1
1
= {log det − p log(2π) − trace(S)} , (5.9)
2
where we have used
N p p p p N p p
1 1
xi, j θ j,k xi,k = θ j,k xi, j xi,k = s j,k θ j,k .
N i=1 j=1 k=1 j=1 k=1
N i=1 j=1 k=1
N
1 1
log f (xi ) − λ |θ j,k |
N i=1
2 j=k
p p
|Ak, j | 1, i = k
ai, j b j,k = ai, j (−1) k+ j
= .
|A| 0, i = k
j=1 j=1
Hereafter, we denote such a matrix B as A−1 . Then, we note that the differentiation
of |A| w.r.t. ai, j is (−1)i+ j |Ai, j |. In fact, we observe that if we let i = k in (5.11),
then the coefficient of ai, j in |A| is (−1)i+ j |Ai, j |. In addition, if we differentiate
p p
trace(S) = i=1 j=1 si, j θi, j with the (k, h)-th element θk,h of , it becomes the
(k, h)-th element sk,h of S. Moreover, if we differentiate log det by θk,h , we have
which is the (h, k)-th element of −1 . Because is symmetric, −1 is also sym-
metric:
T = =⇒ (−1 )T = (T )−1 = −1 .
−1 − S − λ = 0 , (5.12)
otherwise.
The solution can be obtained from formulating the Lagrange multiplier of the
nonnegative definite problem and applying the KKT condition:
1. 0
2. 0, −1 − S − λ+ < , ∂ >= 0
3. < , >= 0 ,
where A 0 indicates that A is nonnegative definite and < A, B > denotes the
inner product trace(A T B) of matrices A, B. Because , 0, < , >= 0
means trace() = trace( 1/2 1/2 ) = 0, which further means that = 0 because
1/2 1/2 0. Thus, we obtain (5.7).
Next, we obtain the solution of (5.12). Let W be the inverse matrix of .
Decomposing the matrix , , S, W ∈ R p× p , with the size of the upper-left elements
being ( p − 1) × ( p − 1), we write
1,1 ψ1,2 S1,1 s1,2 1,1 θ1,2
= , S= , =
ψ2,1 ψ2,2 s2,1 s2,2 θ2,1 θ2,2
W1,1 w1,2 1,1 θ1,2 I p−1 0
= , (5.13)
w2,1 w2,2 θ2,1 θ2,2 0 1
where we assume θ2,2 > 0. If is positive definite, the eigenvalues of are the
inverse of those of , and is positive definite. Thus, if we multiply the vector
with the first p − 1 and the p-th zero and one from the front and end, then the value
(quadratic form) should be positive.
Then, from the upper-right part of both sides of (5.12), we have
where the upper-right part of −1 is w1,2 . From the upper-right part of both sides in
(5.13), we have
W1,1 θ1,2 + w1,2 θ2,2 = 0 . (5.15)
⎡ ⎤
β1
⎢ ⎥ θ1,2
If we put β = ⎣ ... ⎦ := − , then from (5.14), (5.15), we have the following
θ2,2
β p−1
equations:
5.2 Graphical Lasso 153
W1,1 s1,2
W1,1 s1,2
s1,2 W1,1
s1,2 W1,1
Fig. 5.3 The core part of graphical Lasso (Friedman et al., 2008)[12]. Update W1,1 , w1,2 , where
the yellow and green elements are W1,1 and w1,2 , respectively. The same value as that of w1,2 is
stored in w2,1 . In this case ( p = 4), we repeat the four steps. The diagonal elements remain the
same as those of S
⎧
⎪ cj − λ
⎪
⎪ a , cj > λ
⎪
⎨ j, j
β j = 0, −λ < c j < λ . (5.18)
⎪
⎪
⎪
⎪ c j + λ
⎩ , c j < −λ
a j, j
In the last stage, for j = 1, . . . , p, if we take the following one cycle, we obtain
the estimate of
For the data, we detect whether an edge exists using the R language function
graph.lasso, which recognizes whether each element of the matrix Theta is
zero.
1 Theta
1 graph.lasso(s)
For the data generated above, we draw the undirected graphs λ = 0, 0.015, 0.03,
0.05 in Fig. 5.4. We observe that the larger the value of λ is, the greater the number
of edges in the graph.
In the actual data processing w.r.t. graphical Lasso, the glasso R package is
often used (the option rho is used to specify λ). In this book, we construct the
function graphical.lasso, but the process is not optimized for finer details. In
the rest of this section, we use glasso.
When we use a graph of the numerical data output of the R language, for both
directed and undirected graphs, the igraph R package is used. Using the package,
we draw an undirected graph such that (i, j) are connected if and only if ai, j = 0
from a symmetric matrix A = (ai, j ) of size p. Moreover, we use it to draw an
undirected graph from a precision matrix . For example, we can construct the
following function adj.
1 library(igraph)
2 adj = function(mat) {
3 p = ncol(mat); ad = matrix(0, nrow = p, ncol = p)
4 for (i in 1:(p - 1)) for (j in (i + 1):p) {
5 if (mat[i, j] == 0) ad[i, j] = 0 else ad[i, j] = 1
6 }
7 g = graph.adjacency(ad, mode = "undirected")
8 plot(g)
9 }
2 4 2 4 2 4 2 4
1 5 1 5 1 5 1 5
Fig. 5.4 In Example 47, we fix an undirected graph and generate data based on the graph. From
the data, we estimate the undirected graph from the samples. We observe that the larger the value
of λ is, the greater the number of edges in the graph
5.2 Graphical Lasso 157
Fig. 5.5 We draw the output of glasso w.r.t. the first 200 genes for λ = 0.75. For display, we
used a drawing application (Cytoscape) rather than igraph
Example 49 (Breast Cancer) We execute graphical Lasso for the dataset breast-
cancer.csv (N = 250, p = 1000). The purpose of the procedure is to find the relation
among the gene expressions of breast cancer patients. When we execute the follow-
ing program, we observe that the smaller the value of λ is, the longer the execution
time. We draw the output of glasso w.r.t. the first 200 genes out of p = 1000 for
λ = 0.75 (Fig. 5.5). The execution is carried out using the following code.
1 library(glasso); library(igraph)
2 df = read.csv("breastcancer.csv")
3 w = matrix(nrow = 250, ncol = 1000)
4 for (i in 1:1000) w[, i] = as.numeric(df[[i]])
5 x = w; s = t(x) %*% x / 250
6 fit = glasso(s, rho = 0.75); sum(fit$wi == 0)
7 y = NULL; z = NULL
8 for (i in 1:999) for (j in (i + 1):1000) if (fit$wi[i, j] != 0) {y = c
(y, i); z = c(z, j)}
9 edges = cbind(y, z)
10 write.csv(edges,"edges.csv")
The edges of the undirected graph are stored in edges.csv. We input the file to
Cytoscape to obtain the output.
Although thus far, we have addressed the algorithm of graphical Lasso , we have
not mentioned how to set the value of λ and the correctness of graphical Lasso for a
large sample size N (missing/extra edges and parameter estimation).
Ravikumar et al. 2011 [23] presented a result that theoretically guarantees the
exist a parameter α and function
correctness of the graphical Lasso procedure: there √
f, g determined w.r.t. such that when λ is f (α)/ N ,
1. the probability that the maximum element of − ˆ is upper bounded by
√
g(α)/ N can be arbitrarily small and
ˆ coincide can be arbitrarily large
2. the probability that the edge sets in and
as the sample size N grows, where ˆ is the estimate of that is obtained by graphical
Lasso. √
Although the paper suggests that λ should be O(1/ N ), the parameter α is not
known and cannot be estimated from the N samples, which means that the λ that
guarantees the correctness cannot be obtained. In addition, the theory assumes a
range of α, and we cannot know from the samples that the condition is met. Even
if the λ value is known, the paper does not prove that the choice of λ is optimum.
However, Ref. [23] claims that consistency can be obtained for a broad√ range of α
for large N and that the parameter estimation improves with O(1/ N ).
If the response is continuous and discrete variables, we can infer the relevant covari-
ates via linear and logistic regressions. Moreover, each covariate may be either
discrete or continuous.
For a response X i , if the covariates are X k (k ∈ πi ⊆ {1, . . . , p}\{i}), we consider
the subset πi to be the parent set of X i . Even if we obtain the optimum parent set, we
cannot maximize the (regularized) likelihood of the data. If the graph has a tree/forest
structure, and if we seek the parent set from the root to the leaves, it may be possible
5.3 Estimation of the Graphical Model based on the Quasi-likelihood 159
38 22 11
13 10 42
46 37 43
39 2634 33 49
36 16
10 8 17
25 45 6 40 32 2718
29 2 41 1 12
23 4035 4
48 3050 31 28 41
241 32 245 3
185 13 28 2 9
44 9 47 2127 163533 23 22 6 21 14 19
7 12
49 4 42 37 263439 31 15 44 20
19 20
3 11 36 38 30 7 48
14 8 50 29 47
43 17 45 25
15 46
Fig. 5.6 The graphical model obtained via the quasi-likelihood method from the Breast Cancer
Dataset. When all the variables are continuous (left, Example 50), and when all the variables are
binary (right, Example 51)
to obtain the exact likelihood. However, we may wonder where the root should be,
why we may artificially assume the structure, etc.
Thus, to approximately realize maximizing the likelihood, we obtain the parent
set πi (i = 1, . . . , p) of each vertex (response) independently. We may choose one
of the following:
1. j ∈ πi and i ∈ π j =⇒ Connect i, j as an edge (AND rule)
2. j ∈ πi or i ∈ π j =⇒ Connect i, j as an edge (OR rule),
where the choice of λ may be different for each response.
Although this quasi-likelihood method [20] may not be theoretical, it can be
applied to general situations, in contrast to graphical Lasso:
1. the distribution over the p variables may not be Gaussian
2. some of the p variables may be discrete.
A demerit of the method is that when λ is small, particularly for large p, it is rather
time-consuming compared with graphical Lasso.
If we realize it using the R language, we may use the glmnet package: the options
family = "gaussian" (default), family = "binomial", and family
= "multinomial" are available for continuous, binary, and multiple variables,
respectively.
1 library(glmnet)
2 df = read.csv("breastcancer.csv")
3 n = 250; p = 50; w = matrix(nrow = n, ncol = p)
4 for (i in 1:p) w[, i] = as.numeric(df[[i]])
5 x = w[, 1:p]; fm = rep("gaussian", p); lambda = 0.1
6 fit = list()
7 for (j in 1:p) fit[[j]] = glmnet(x[, -j], x[, j], family = fm[j],
lambda = lambda)
8 ad = matrix(0, p, p)
9 for (i in 1:p) for (j in 1:(p - 1)) {
10 k = j
11 if (j >= i) k = j + 1
12 if (fit[[i]]$beta[j] != 0) ad[i, k] = 1 else ad[i, k] = 0
13 }
14 ## AND
15 for (i in 1:(p - 1)) for (j in (i + 1):p) {
16 if (ad[i, j] != ad[i, j]) {ad[i, j] = 0; ad[j, i] = 0}
17 }
18 u = NULL; v = NULL
19 for (i in 1:(p - 1)) for (j in (i + 1):p) {
20 if (ad[i, j] == 1) {u = c(u, i); v = c(v, j)}
21 }
22 u
1 [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
2 [23] 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5
3 ......................................................................
4 [199] 30 30 31 31 31 31 31 31 31 32 32 33 33 34 34 34 35 35 36 36 36 37
5 [221] 37 37 38 38 38 38 38 40 40 42 42 43 43 43 44 44 45 45 47 47 47 49
1 v
1 [1] 2 12 13 18 22 23 24 28 41 46 48 50 4 5 10 16 18 21 23 26 27 34
2 [23] 37 43 9 10 15 19 20 21 24 29 38 50 5 8 17 18 21 23 46 9 20 21
3 ......................................................................
4 [199] 48 50 34 36 39 44 47 49 50 33 45 37 49 35 36 45 41 42 38 39 45 38
5 [221] 42 46 39 45 46 49 50 41 42 49 50 47 49 50 48 50 46 50 48 49 50 50
1 adj(ad)
2 ## OR
3 for (i in 1:(p - 1)) for (j in (i + 1):p) {
4 if (ad[i, j] != ad[i, j]) {ad[i, j] = 1; ad[j, i] = 1}
5 }
6 adj(ad)
1 library(glmnet)
2 df = read.csv("breastcancer.csv")
3 w = matrix(nrow = 250, ncol = 1000); for (i in 1:1000) w[, i] = as.
numeric(df[[i]])
4 w = (sign(w) + 1) / 2 ## transforming it to binary
5 p = 50; x = w[, 1:p]; fm = rep("binomial", p); lambda = 0.15
6 fit = list()
7 for (j in 1:p) fit[[j]] = glmnet(x[, -j], x[, j], family = fm[j],
lambda = lambda)
8 ad = matrix(0, nrow = p, ncol = p)
9 for (i in 1:p) for (j in 1:(p - 1)) {
10 k = j
11 if (j >= i) k = j + 1
12 if (fit[[i]]$beta[j] != 0) ad[i, k] = 1 else ad[i, k] = 0
13 }
14 for (i in 1:(p - 1)) for (j in (i + 1):p) {
15 if (ad[i, j] != ad[i, j]) {ad[i, j] = 0; ad[j, i] = 0}
16 }
17 sum(ad); adj(ad)
We may mix continuous and discrete variables by setting one of the options
"gaussian","binomial","multinomial" for each response in the array
fm of Examples 50 and 51.
Thus far, we have generated undirected graphs from the data X ∈ R N × p . Next, we
consider generating a graph from each of X 1 ∈ R N1 × p and X 2 ∈ R N2 × p (N1 + N2 =
N ) (joint graphical Lasso). We may assume a supervised learning with y ∈ {1, 2} N
such that X 1 , X 2 are generated from the rows i of X with yi = 1 and yi = 2. The
idea is to utilize the N samples when the two graphs share similar properties: if we
generate two graphs from X 1 and X 2 separately, the estimates will be poorer than
those obtained by applying the joint graphical Lasso. For example, suppose that we
generate two gene regulatory networks for the case and control expression data of
p genes. In this case, the two graphs would be similar, although some edges may be
different.
The joint graphical Lasso (JGL; Danaher-Witten, 2014) [8] extends the maxi-
mization of (5.10) to that of
K
Nk log det k − trace(Sk ) − P(1 , . . . , K ), (5.21)
i=1
P(1 , . . . , K ) := λ1 |θi,(k)j | + λ2 |θi,(k)j − θi,(kj ) | (5.22)
k i= j k<k i, j
2
P(1 , . . . , K ) := λ1 |θi,(k)j | + λ2 θi,(k)j , (5.23)
k i= j i= j k
K
L ρ (, Z , U ) := − Nk {log det k − trace(Sk )} + P(Z 1 , . . . , Z K )
k=1
K K
ρ
+ρ Uk , k − Z k + k − Z k 2F
k=1
2 k=1
−Nk (−1
k − Sk ) + ρ(k − Z k + Uk ) = 0 .
ρ Zk Uk
−1
k − k = Sk − ρ +ρ
Nk Nk Nk
i.e.,
Nk
D̃ j, j = −D j, j + D 2j, j + 4ρ/Nk .
2ρ
The second step executes fused Lasso among different classes k. For K = 2,
genlasso and other fused Lasso procedures do not work; thus, we set up the
function b.fused. If the difference between y1 and y2 is at most 2λ, then θ1 =
θ2 = (y1 + y2 )/2. Otherwise, we subtract λ from the larger of y1 , y2 and add λ to
the smaller one. The procedure to obtain λ1 from λ2 is due to the sparse fused Lasso
discussed in Sect. 4.2.
We construct the JGL for fused Lasso as follows.
1 # genlasso works only when the size is at least three
2 b.fused = function(y, lambda) {
3 if (y[1] > y[2] + 2 * lambda) {a = y[1] - lambda; b = y[2] + lambda}
4 else if (y[1] < y[2] - 2 * lambda) {a = y[1] + lambda; b = y[2] - lambda}
5 else {a = (y[1] + y[2]) / 2; b = a}
6 return(c(a, b))
7 }
8 # fused Lasso that compares not only the adjacency terms but also all
adjacency values
9 fused = function(y, lambda.1, lambda.2) {
10 K = length(y)
11 if (K == 1) theta = y
12 else if (K == 2) theta = b.fused(y, lambda.2)
13 else {
14 L = K * (K - 1) / 2; D = matrix(0, nrow = L, ncol = K)
15 k = 0
16 for (i in 1:(K - 1)) for (j in (i + 1):K) {
17 k = k + 1; D[k, i] = 1; D[k, j] = -1
18 }
19 out = genlasso(y, D = D)
20 theta = coef(out, lambda = lambda.2)
21 }
22 theta = soft.th(lambda.1, theta)
23 return(theta)
24 }
25 # Joint Graphical Lasso
26 jgl = function(X, lambda.1, lambda.2) { # X is given as a list
27 K = length(X); p = ncol(X[[1]]); n = array(dim = K); S = list()
28 for (k in 1:K) {n[k] = nrow(X[[k]]); S[[k]] = t(X[[k]]) %*% X[[k]] / n[k]}
29 rho = 1; lambda.1 = lambda.1 / rho; lambda.2 = lambda.2 / rho
30 Theta = list(); for (k in 1:K) Theta[[k]] = diag(p)
31 Theta.old = list(); for (k in 1:K) Theta.old[[k]] = diag(rnorm(p))
32 U = list(); for (k in 1:K) U[[k]] = matrix(0, nrow = p, ncol = p)
33 Z = list(); for (k in 1:K) Z[[k]] = matrix(0, nrow = p, ncol = p)
34 epsilon = 0; epsilon.old = 1
35 while (abs((epsilon - epsilon.old) / epsilon.old) > 0.0001) {
36 Theta.old = Theta; epsilon.old = epsilon
37 ## Update (i)
38 for (k in 1:K) {
39 mat = S[[k]] - rho * Z[[k]] / n[k] + rho * U[[k]] / n[k]
40 svd.mat = svd(mat)
41 V = svd.mat$v
42 D = svd.mat$d
43 DD = n[k] / (2 * rho) * (-D + sqrt(D ^ 2 + 4 * rho / n[k]))
44 Theta[[k]] = V %*% diag(DD) %*% t(V)
45 }
46 ## Update (ii)
47 for (i in 1:p) for (j in 1:p) {
48 A = NULL; for (k in 1:K) A = c(A, Theta[[k]][i, j] + U[[k]][i, j])
164 5 Graphical Models
For the JGL for the group Lasso, we may replace the update (ii). Let Ak [i, j] :=
k [i, j] + Uk [i, j]. Then, no update is required for i = j, and
⎛ ⎞
λ2
Z k [i, j] = S λ1 /ρ (Ak [i, j]) ⎝1 − ⎠
K
ρ k=1 Sλ1 /ρ (Ak [i, j])2
+
for i = j.
We construct the following code, in which no functions b.fused and genlasso
are required.
1 ## Replace the update (ii) by the following
2 for (i in 1:p) for (j in 1:p) {
3 A = NULL; for (k in 1:K) A = c(A, Theta[[k]][i, j] + U[[k]][i, j])
4 if (i == j) B = A
5 else {B = soft.th(lambda.1 / rho,A) *
6 max(1 - lambda.2 / rho / sqrt(norm(soft.th(lambda.1 / rho, A), "2"
) ^ 2), 0)}
7 for (k in 1:K) Z[[k]][i, j] = B[k]
8 }
We observe that the larger the value of λ1 is, the more sparse the graphs. On the other
hand, the larger the value of λ2 is, the more similar the graphs (Fig. 5.7).
Appendix: Proof of Propositions 165
8 4 2 3
9 9
6 10 7 1 5 4 2 3 8 6 10 7 1 5
4 2 3
9 8 9
6 10 7 1 5 4 2 3 8 6 10 7 1 5
(c) λ1 = 3, λ2 = 0.03
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0
2 1 1 1 1 0 0 1 0 2 0 1 1 1 0 0 1 0
3 0 0 0 1 0 0 1 3 0 0 0 1 0 1 1
4 1 1 1 0 0 1 4 1 1 1 0 0 0
5 0 1 1 1 1 5 1 0 1 0 1
6 0 1 1 0 6 0 0 1 1
7 0 1 1 7 0 1 1
8 1 1 8 1 1
9 1 9 1
3 5 1 10 3 5 1 10
4 6 7 8 2 9 4 6 7 8 2 9
Fig. 5.8 Execution of Example 52 (group Lasso, K = 2) with (λ1 , λ) = (10, 0.01)
We execute the group Lasso JGL (Fig. 5.8). From the example, little difference
can be observed from the fused Lasso JGL.
in −1
a b e−1 −e−1 bd −1
= , (5.24)
c d −ed c d + d −1 ce−1 bd −1
−1 −1
we have
⎡ ⎤−1
A A AC 0
A∪C ∗
= −1 = ⎣ C A CC C B ⎦
∗ ∗
0 BC B B
⎡ −1 ⎤
AA AC
∗⎦
= ⎣ C A CC − C B ( B B )−1 BC ,
∗ ∗
which means
AA AC
( A∪C )−1 = , (5.25)
C A CC − C B ( B B )−1 BC
Similarly, we have
−1
AA 0 AC
(C )−1 = CC − C A C B
0 B B BC
= CC − C A ( A A )−1 AC − C B ( B B )−1 BC .
Thus, we have
Appendix: Proof of Propositions 167
I ( A A )−1 AC
det = det(C )−1 . (5.26)
C A CC − C B ( B B )−1 BC
det(C )
det A∪C = . (5.27)
det A A
we have
⎡ ⎤
0 ⎡ ⎤
⎢ ⎥ 0
C A ⎥ = det ⎣ B∪C
AA
det = det ⎢
⎣
⎦ C A
⎦
0
B∪C 0 C A AA
C A
0 det A A
= det B∪C − A A 0 C A · det A A = . (5.28)
C A det B∪C
det C
det = .
det A∪C det B∪C
X A ⊥⊥ P X B | X C ⇐⇒ X B ⊥⊥ P X A |X C (5.4)
X A ⊥⊥ P (X B ∪ X D ) | X C =⇒ X A ⊥⊥ P X B |X C , X A ⊥⊥ P X D | X C (5.5)
X A ⊥⊥ P (X C ∪ X D ) | X B , X A ⊥⊥ E X D | (X B ∪ X C ) =⇒ X A ⊥⊥ E (X B ∪ X D ) | X C (5.6)
168 5 Graphical Models
X A ⊥⊥ P X B | X C =⇒ X A ⊥⊥ P X B | (X C ∪ X D ) (5.7)
X A ⊥⊥ P X B | X C =⇒ X A ⊥⊥ P X k | X C or X k ⊥⊥ E X B |X C (5.8)
for k ∈
/ A ∪ B ∪ C, where A, B, C, D are disjoint subsets of V for each equation.
X A ⊥⊥ P X B | X C ⇐⇒ X i ⊥⊥ P X j | X C i ∈ A, j ∈ B (5.29)
XA ⊥
⊥ P X B | X C , X A ⊥⊥ P X D | X C =⇒ X A ⊥
⊥ P X B | (X C ∪ X D ), X A ⊥
⊥ P X D | (X C ∪ X B )
=⇒ X A ⊥
⊥ P (X B ∪ X D ) | X C ,
where (5.6) and (5.7) have been used. Thus, it is sufficient to show
X i ⊥⊥ P X j | X S ⇐⇒ i ⊥⊥ E j | S
Exercises 62–75
62. In what graph among (a) through (f) do the red vertices separate the blue and
green vertices?
Exercises 62–75 169
63. Let X, Y be binary variables that take zeros and ones and are independent, and
let Z be the residue of the sum when divided by two. Show that X, Y are not
conditionally independent given Z . Note that the probabilities p and q of X = 1
and Y = 1 are not necessarily 0 and 5 and that we do not assume p = q.
64. For the precision matrix = (θi, j )i, j=1,2,3 , with θ12 = θ21 = 0, show
(a) Suppose that λ = 0 and N < p. Then, no inverse matrix exists for S
−1 − S − λψ = 0 (5.30)
N
p p
1
xi,s θs,t xi,t
N i=1 s=1 t=1
1 1
trace(S) = trace(X T X ) = trace(X X T )
N N
when A = X T and B = X .
(c) If we express the probability density function of the p Gaussian variables as
N
f (x1 , . . . , x p ), then the log-likelihood N1 i=1 log f (xi,1 , . . . , xi, p ) can
be written as
1
{log det − p log(2π) − trace(S)} . (cf. (5.9))
2
N
1 1
(d) For λ ≥ 0, the maximization of log f (xi ) − λ |θs,t | w.r.t. is
N i=1
2 s=t
that of
67. Let Ai, j and |B| be the submatrix that excludes the i-th row and j-th column
from matrix A ∈ R p× p and the determinant of B ∈ Rm×m (m ≤ p), respectively.
Then, we have
p
|A|, i = k
(−1)k+ j ai, j |Ak, j | = . (cf. (5.11))
0, i = k
j=1
(a) When |A| = 0, let B be the matrix whose ( j, k) element is b j,k = (−1)k+ j |
Ak, j |/|A|. Show that AB is the unit matrix. Hereafter, we write this matrix
B as A−1 .
(b) Show that if we differentiate |A| by ai, j , then it becomes (−1)i+ j |Ai, j |.
(c) Show that if we differentiate log |A| by ai, j , then it becomes (−1)i+ j |
Ai, j |/|A|, the ( j, i)-th element of A−1 .
(d) Show that if we differentiate the trace of S by the (s, t)-element θs,t of ,
it becomes the (t, s)-th element of S.
p p
Hint Differentiate trace(S) = i=1 j=1 si, j θ j,i by θs,t .
(e) Show that the that maximizes (5.31) is the solution of (5.12), where
= (ψs,t ) is ψs,t = 0 if s = t and
Exercises 62–75 171
⎧
⎨ 1, θs,t > 0
ψs,t = [−1, 1], θs,t = 0
⎩
−1, θs,t < 0
otherwise.
Hint If we differentiate log det by θs,t , it becomes the (t, s)-th element
of −1 from (d). However, because is symmetric, −1 : T = =⇒
(−1 )T = (T )−1 = −1 is symmetric as well.
and
W1,1 θ1,2 + w1,2 θ2,2 = 0 (cf. (5.15)) (5.34)
In the last stage, for j = 1, . . . , p, if we take the following one cycle, we obtain
the estimate of :
⎧
⎪ cj − λ
⎪
⎪ , cj > λ
⎪
⎨ a j, j
β j = 0, −λ < c j < λ (cf. (5.18)) (5.39)
⎪
⎪
⎪
⎪ cj + λ
⎩ , c j < −λ
a j, j
25 }
26 return(theta)
27 }
What rows of the function definition of graph.lasso are the Eqs. (5.36)–
(5.39)?
Then, we can construct an undirected graph G by connecting each s, t such that
θs,t = 0 as an edge. Generate the data for p = 5 and N = 20 based on a matrix
= (θs,t ) known a priori.
1 library(MultiRNG)
2 Theta = matrix(c( 2, 0.6, 0, 0, 0.5, 0.6, 2, -0.4,
0.3, 0, 0,
3 -0.4, 2, -0.2, 0, 0, 0.3, -0.2, 2,
-0.2, 0.5, 0,
4 0, -0.2, 2), nrow = 5)
5 Sigma = ## Blank ##; meanvec = rep(0, 5)
6 # mean: mean.vec, cov matrix: cov.mat, samples #: no.row, variable
#: d
7 # Generate the sample matrix
8 dat = draw.d.variate.normal(no.row = 20, d = 5, mean.vec = meanvec
, cov.mat = Sigma)
Moreover, execute the following code, and examine whether the precision matrix
is correctly estimated.
1 s = t(dat) %*% dat / nrow(dat); graph.lasso(s); graph.lasso(s,
lambda = 0.01)
71. The function adj defined below connects each (i, j) as an edge if and only if the
element is nonzero given a symmetric matrix of size p to construct an undirected
graph. Execute it for the breastcancer.csv data.
1 library(igraph)
2 adj = function(mat) {
3 p = ncol(mat); ad = matrix(0, nrow = p, ncol = p)
4 for (i in 1:(p - 1)) for (j in (i + 1):p) {
5 if (## Blank ##) {ad[i, j] = 0} else {ad[i, j] = 1}
6 }
7 g = graph.adjacency(ad, mode = "undirected")
8 plot(g)
9 }
10
11 library(glasso)
12 df = read.csv("breastcancer.csv")
13 w = matrix(nrow = 250, ncol = 1000)
14 for (i in 1:1000) w[, i] = as.numeric(df[[i]])
15 x = w; s = t(x) %*% x / 250
16 fit = glasso(s, rho = 1); sum(fit$wi == 0); adj(fit$wi)
17 fit = glasso(s, rho = 0.5); sum(fit$wi == 0); adj(fit$wi)
18 y = NULL; z = NULL
19 for (i in 1:999) for (j in (i + 1):1000) {
20 if (mat[i, j] != 0) {y = c(y, i); z = c(z, j)}
21 }
174 5 Graphical Models
22 cbind(y,z)
23 ## Restrict to the first 100 genes
24 x = x[, 1:100]; s = t(x) %*% x / 250
25 fit = glasso(s, rho = 0.75); sum(fit$wi == 0); adj(fit$wi)
26 y = NULL; z = NULL
27 for (i in 1:99) for (j in (i + 1):100) {
28 if (fit$wi[i, j] != 0) {y = c(y, i); z = c(z, j)}
29 }
30 cbind(y, z)
31 fit = glasso(s, rho = 0.25); sum(fit$wi == 0); adj(fit$wi)
72. The following code generates an undirected graph via the quasi-likelihood
method and the glmnet package. We examine the difference between using
the AND and OR rules, where the original data contains p = 1,000 but the exe-
cution is for the first p = 50 genes to save time. Execute the OR rule case as
well as the AND case by modifying the latter.
1 library(glmnet)
2 df = read.csv("breastcancer.csv")
3 n = 250; p = 50; w = matrix(nrow = n, ncol = p)
4 for (i in 1:p) w[, i] = as.numeric(df[[i]])
5 x = w[, 1:p]; fm = rep("gaussian", p); lambda = 0.1
6 fit = list()
7 for (j in 1:p) fit[[j]] = glmnet(x[, -j], x[, j], family = fm[j],
lambda = lambda)
8 ad = matrix(0, p, p)
9 for (i in 1:p) for (j in 1:(p - 1)) {
10 k = j
11 if (j >= i) k = j + 1
12 if (fit[[i]]$beta[j] != 0) {ad[i, k] = 1} else {ad[i, k] = 0}
13 }
14 ## AND
15 for (i in 1:(p - 1)) for (j in (i + 1):p) {
16 if (ad[i, j] != ad[i, j]) {ad[i, j] = 0; ad[j, i] = 0}
17 }
18 u = NULL; v = NULL
19 for (i in 1:(p - 1)) for (j in (i + 1):p) {
20 if (ad[i, j] == 1) {u = c(u, i); v = c(v, j)}
21 }
22 ## OR
1 library(glmnet)
2 df = read.csv("breastcancer.csv")
3 w = matrix(nrow = 250, ncol = 1000); for (i in 1:1000) w[, i] = as.
numeric(df[[i]])
4 w = (sign(w) + 1) / 2 ## Transform to Binary
5 p = 50; x = w[, 1:p]; fm = rep("binomial", p); lambda = 0.15
6 fit = list()
7 for (j in 1:p) fit[[j]] = glmnet(x[, -j], x[, j], family = fm[j],
lambda = lambda)
8 ad = matrix(0, nrow = p, ncol = p)
9 for (i in 1:p) for (j in 1:(p - 1)) {
10 k = j
11 if (j >= i) k = j + 1
12 if (fit[[i]] beta[j] != 0) {ad[i, k] = 1} else {ad[i, k] = 0}
13 }
14 for (i in 1:(p - 1)) for (j in (i + 1):p) {
15 if (ad[i, j] != ad[i, j]) {ad[i, j] = 0; ad[j, i] = 0}
16 }
17 sum(ad); adj(ad)
How can we deal with data that contain both continuous and discrete values?
73. Joint graphical Lasso (JGL) finds 1 , . . . , K that maximize
K
Nk {log det k − trace(Sk )} − P(1 , . . . , K ) (cf. (5.21)) (5.40)
i=1
−Nk (−1
k − Sk ) + ρ(k − Z k + Uk ) = 0 .
(b) We wish to obtain the optimum k in the first step. To this end, we decompose
both sides of the symmetric matrix
ρ Zk Uk
−1
k − k = Sk − ρ +ρ
Nk Nk Nk
1 1
(y1 − θ1 )2 + (y2 − θ2 )2 + |θ1 − θ2 | .
2 2
74. We construct the fused Lasso JGL. Fill in the blanks, and execute the procedure.
75. For the group Lasso, only the second step should be replaced. Let Ak [i, j] =
k [i, j] + Uk [i, j]. Then, no update is required for i = j, and
⎛ ⎞
λ2
Z k [i, j] = Sλ1 /ρ (Ak [i, j]) ⎝1 − ⎠
K
ρ k=1 Sλ1 /ρ (Ak [i, j])2
+
for i = j. We construct the code. Fill in the blank, and execute the procedure as
in the previous exercise.
1 for (i in 1:p) for (j in 1:p) {
2 A = NULL; for (k in 1:K) A = c(A, Theta[[k]][i, j] + U[[k]][i, j
])
3 if (i == j) {B = A} else {B = ## Blank ##}
4 for (k in 1:K) Z[[k]][i, j] = B[k]
5 }
Chapter 6
Matrix Decomposition
Thus far, by Lasso, we mean making the estimated coefficients of some variables
zero if the absolute values are significantly small for regression, classification, and
graphical models. In this chapter, we consider dealing with matrices. Suppose that
the given data take the form of a matrix, such as in image processing. We wish to
approximate an image by a low-rank matrix after singular decomposition. However,
the function needed to obtain the rank from a matrix is not convex. In this chapter,
we introduce matrix norms w.r.t. the singular values, such as the nuclear norm and
spectral norm, and formulate the problem as convex optimization, as we did for other
Lasso classes.
We list several formulas of matrix arithmetic, where we write the unit matrix of
size n × n as In .
i. By Z F , we denote the Frobenius norm of matrix Z , the square root of the
square sum of the elements in Z ∈ Rm×n . In general, we have
n
n
m
trace(Z T Z ) = wi,i = 2
z k,i = Z 2F .
i=1 i=1 k=1
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 179
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_6
180 6 Matrix Decomposition
l
l
m
n
m
trace(ABC) = u i,i = ai,k bk,h ch,i = vi,i = trace(BC A) .
i=1 i=1 k=1 h=1 i=1
(6.2)
iii. We can choose mutually orthogonal vectors as the basis of a vector space. If the
lengths of the vectors in the mutually orthogonal basis are one, then the basis is
said to be orthonormal. The columns u 1 , · · · , u n ∈ Rn of U ∈ Rn×n making an
orthonormal basis of the vector space Rn are equivalent to U T U = UU T =
In . Moreover, if U = [u 1 , · · · , u n ] consists of orthonormal column vectors,
for Ur = [u i(1) , · · · , u i(r ) ] ∈ Rn×r with r ≤ n and 1 ≤ i(1) ≤ · · · ≤ i(r ) ≤ n,
UrT Ur ∈ Rr ×r is the unit matrix, but Ur UrT ∈ Rn×n is not.
Z T Z V = V D2 ,
where
√ V = [v1 , .√. . , vn ] ∈ Rn×n and D is a diagonal matrix with the elements d1 :=
λ1 , . . . , dn := λn . From V V T = In , we have
Z T Z = V D2 V T . (6.3)
Suppose dn > 0, which means that D is the inverse matrix. We define U ∈ Rm×n
as
U := Z V D −1 .
U T U = (Z V D −1 )T Z V D −1 = D −1 V T Z T Z V D −1 = D −1 V T (V D 2 V T )V D −1 = In .
Z x = 0 =⇒ Z T Z x = 0
Z T Z x = 0 =⇒ x T Z T Z x = 0 =⇒ Z x2 = 0 =⇒ Z x = 0
the kernels of Z ∈ Rm×n , Z T Z ∈ Rn×n coincide, and the numbers of columns are
equal. Thus, the r value such that dr > 0, dr +1 = 0 coincides with the rank of Z .
We now consider a general case r ≤ n. Let Vr := [v1 , . . . , vr ] ∈ Rn×r and Dr ∈
r ×r
R be a diagonal matrix with the elements d1 , . . . , dr . Then, if we regard the
row vectors d1 v1T , . . . , dr vrT ∈ Rn of Dr VrT ∈ Rr ×n (rank: r ) as a basis of the rows
zi ∈ R r (i = 1, . . .T, m) of Z , we find that um×r
n
i, j ∈ R exists and is unique such that
z i = j=1 u i, j d j v j . Thus, Ur = (u i, j ) ∈ R such that Z = Ur Dr VrT is unique.
Moreover, (6.3) means that UrT Ur = Ir . In fact, let Vn−r be the submatrix of V
that excludes Vr . From VrT Vr = Ir and VrT Vn−r = O ∈ Rr ×(n−r ) , we have
Z T Z = V D 2 V T ⇐⇒ Vr Dr UrT Ur Dr VrT = V D 2 V T
⇐⇒ V T Vr Dr UrT Ur Dr VrT V = D 2
2
Dr Dr O
⇐⇒ UrT Ur Dr O =
O O O
⇐⇒ UrT Ur = Ir .
1 [,1] [,2]
2 [1,] 0 -2
3 [2,] 5 -4
4 [3,] -1 1
1 svd(Z)
182 6 Matrix Decomposition
1 $d
2 [1] 6.681937 1.533530
3 $u
4 [,1] [,2]
5 [1,] -0.1987442 0.97518046
6 [2,] -0.9570074 -0.21457290
7 [3,] 0.2112759 -0.05460347
8 $v
9 [,1] [,2]
10 [1,] -0.7477342 -0.6639982
11 [2,] 0.6639982 -0.7477342
1 svd(t(Z))
1 $d
2 [1] 6.681937 1.533530
3 $u
4 [,1] [,2]
5 [1,] -0.7477342 0.6639982
6 [2,] 0.6639982 0.7477342
7 $v
8 [,1] [,2]
9 [1,] -0.1987442 -0.97518046
10 [2,] -0.9570074 0.21457290
11 [3,] 0.2112759 0.05460347
λi ≥ 0 ⇐⇒ u i = vi = wi
λi < 0 ⇐⇒ u i = wi , vi = −wi or u i = −wi , vi = wi ,
1 [,1] [,2]
2 [1,] 0 5
3 [2,] 5 -1
1 svd(Z)
1 $d
2 [1] 5.524938 4.524938
3 $u
4 [,1] [,2]
5 [1,] 0.6710053 -0.7414525
6 [2,] -0.7414525 -0.6710053
7 $v
8 [,1] [,2]
9 [1,] -0.6710053 -0.7414525
10 [2,] 0.7414525 -0.6710053
1 eigen(Z)
1 $values
2 [1] 4.524938 -5.524938
3 $vectors
4 [,1] [,2]
5 [1,] -0.7414525 -0.6710053
6 [2,] -0.6710053 0.7414525
25000
Squared Frobenius norm
executing svd.r(z,r)
and the original z. The
horizontal and vertical axes
show r and the difference in
15000
the squared Frobenius norms
0 5000
0 50 100 150
Rank
1 svd.r = function(z, r) {
2 n = min(nrow(z), ncol(z))
3 ss = svd(z)
4 tt = ss$u %*% diag(c(ss$d[1:r], rep(0, n - r))) %*% t(ss$v)
5 return(tt)
6 }
Example 56 The following code is used to obtain another image file with a lower
rank from the original lion.jpg. In general, if an image takes the J P G form, then
each pixel takes 256 grayscale values. Moreover, if the image is in color, we have the
information for each of the blue, red, and green pixels. For example, if the display
is 480 × 360 pixels in size, then it has 480 × 360 × 3 pixels in total. The following
codes are executed for ranks r = 2, 5, 10, 20, 50, 100 (Fig. 6.2). Before executing the
procedure, we need to generate a folder /compressed under the current folder
beforehand.
6.2 Eckart-Young’s Theorem 185
rank 2 rank 10
rank 10 rank 20
Fig. 6.2 We approximate the red, green, and blue matrix expression of the image file lion.jpg by
several ranks r . We observe that the lower the rank is, the less clear the approximated image
1 library(jpeg)
2 image = readJPEG(’lion.jpg’)
3 rank.seq = c(2, 5, 10, 20, 50, 100)
4 mat = array(0, dim = c(nrow(image), ncol(image), 3))
5 for (j in rank.seq) {
6 for (i in 1:3) mat[, , i] = svd.r(image[, , i], j)
7 writeJPEG(mat, paste("compressed/lion_compressed", "_mat_rank_", j, "
.jpg", sep = ""))
8 }
For the next section, we assume the following situation. Suppose that the loca-
tions indexed by ⊆ {1, . . . , m} × {1, . . . , n} in the matrix Z ∈ Rm×n are observed.
Given z i, j ((i, j) ∈ ), we wish to contain the matrix M ∈ Rm×n with a rank of at
most r that minimizes the Frobenius norm of M − Z , i.e., the M that minimizes
186 6 Matrix Decomposition
Z − M2F + λr (6.4)
To this end, we define the symbol [, ·, ·]: [, A, B] whose (i, j)-element is
ai, j , (i, j) ∈
bi, j , (i, j) ∈
/
For example, we construct the following procedure (this process stops at a local
optimum and cannot converge to the globally optimum M that minimizes (6.4)).
The function mask takes one and zero for the observed and unobserved positions,
respectively; thus, 1 - mask takes the opposite values.
1 mat.r = function(z, mask, r) {
2 z = as.matrix(z)
3 min = Inf
4 m = nrow(z); n = ncol(z)
5 for (j in 1:5) {
6 guess = matrix(rnorm(m * n), nrow = m)
7 for (i in 1:10) guess = svd.r(mask * z + (1 - mask) * guess, r)
8 value = norm(mask * (z - guess), "F")
9 if (value < min) {min.mat = guess; min = value}
10 }
11 return(min.mat)
12 }
Example 57 To obtain the M that minimizes (6.4), we repeat the updates (6.5). In
the inner loop, the rank is approximated to r , and the unobserved value is predicted
ten times. In the outer loop, to prevent falling into a local solution, the initial value
is changed and executed five times to find the solution that minimizes the Frobenius
norm of the difference.
6.2 Eckart-Young’s Theorem 187
1 library(jpeg)
2 image = readJPEG(’lion.jpg’)
3 m = nrow(image); n = ncol(image)
4 mask = matrix(rbinom(m * n, 1, 0.5), nrow = m)
5 rank.seq = c(2, 5, 10, 20, 50, 100)
6 mat = array(0, dim = c(nrow(image), ncol(image), 3))
7 for (j in rank.seq) {
8 for (i in 1:3) mat[, , i] = mat.r(image[, , i], mask, j)
9 writeJPEG(mat, paste("compressed/lion_compressed", "_mat_rank_", j, "
.jpg", sep = ""))
10 }
The main reason for the hardness of optimization as in (6.4) is that the arithmetic
of computing the rank is not convex. We show that the proposition
For arbitrary A, B ∈ Rm×n and 0 < α < 1,
Example 58 The following is a counterexample of (6.6). The ranks in (6.6) are two
and one on the left and right sides if
1 0 0 0
α = 0.5, A= , and B = .
0 0 0 1
6.3 Norm
Then, b∗ := supa≤1 a, b satisfies the above three conditions as well. In fact, if
an element in b is nonzero, then we have a, b > 0 for any a that shares its sign with
that of b for each nonzero element of b. Thus, b∗ = 0 ⇐⇒ b = 0. Moreover, we
have
188 6 Matrix Decomposition
Proposition 19 The spectral norm M∗ of a matrix M coincides with the largest
singular value d1 (M) of M.
Proof We have
sup M x2 = sup U DV T x2 = sup DV T x2 = sup Dy2 = d1 (M) ,
x2 ≤1 x2 ≤1 x2 ≤1 y2 ≤1
M∗ := sup M, Q ,
Q∗ ≤1
where the inner product M, Q is the trace of M T Q. We consider the dual M∗ of
the spectral norm of M to be the nuclear norm.
We state an inequality that plays a significant role in deriving the conclusion of
this section.
n
Q, M ≤ Q∗ di (M) ,
i=1
and either u i (Q) = u i (M), vi (Q) = vi (M) or u i (Q) = −u i (M), vi (Q) = −vi (M)
ni = 1, . . . , r , where we performed
for each n the singular decomposition of M, Q as
M = i=1 di (M)u i (M)vi (M), Q = i=1 di (Q)u i (Q)vi (Q).
Proposition 21 The nuclear norm M∗ of matrix M coincides with the sum
n
i=1 di (M) of the singular values of M.
Q, M = U V T , U DV T = trace(V U T U DV T ) = trace(V T V U T U D)
r
= trace D = di (M),
i=1
n
n
sup Q, M ≤ sup di (M)d1 (Q) = di (M) ,
d1 (Q)≤1 d1 (Q)≤1 i=1 i=1
d1 (G) = · · · = dr (G) = 1
1 ≥ dr +1 (G) ≥ · · · ≥ dn (G) ≥ 0
either u i (G) = u i (A), vi (G) = vi (A) or
u i (G) = −u i (A), vi (G) = −vi (A) for each i = 1, . . . , r.
Square We show the necessity first. Since (6.7) needs to hold even when B is the
zero matrix, we have
190 6 Matrix Decomposition
G, A ≥ A∗ .
which means that G∗ ≥ 1. On the other hand, if we substitute B = A + u ∗ v∗T into
(6.7), where u ∗ , v∗ are the largest-singular-value left and right vectors of G, then we
have
A + u ∗ v∗T ∗ ≥ A∗ + G, u ∗ v∗T .
which means that G∗ ≤ 1. From the equality of (6.8) and Propositions 20 and
d1 (G) = G∗ , the proof of necessity is completed.
We show the sufficiency next. If we substitute M = B and Q = G into Proposi-
tion 20, when G∗ = 1, we have
B∗ ≥ G, B . (6.9)
n
From Proposition 22, the subderivative G of · ∗ at M := i=1 di u i viT is
r
n
G= u i viT + d̃i ũ i ṽiT
i=1 i=r +1
when the rank of M is r , where ũ r +1 , . . . , ũ n and ṽr +1 , . . . , ṽn are the singular vectors
associated with the singular values 1 ≥ d̃r +1 , · · · , d̃n of G.
Thus, if we differentiate both sides of (6.11) by the matrix M = (m i, j ), we have
[, M − Z , 0] + λG ,
∂L
where the values differentiated by m i, j are .
∂m i, j
Next, we show that when the rank of M is r <n, m, M∗ cannot be differentiated
by M without subderivative. If we write M = ri=1 u i di viT , for > 0, we have
r r
M + u j v Tj ∗ − M∗ i=1 di + u j v Tj ∗ − i=1 di
= =1
r r
M − u j v Tj ∗ − M∗ i=1 di − u j v Tj ∗ − i=1 di
= =1
− −
Mazumder et al. (2010) [19] showed that when starting with an arbitrary initial
value M0 and updating Mt via
until convergence, both sides of M in (6.13) coincide. This matrix M satisfies (6.12),
where Sλ (W ) denotes the matrix whose singular values di , i = 1, . . . , n, are replaced
with di − λ if di ≥ λ and zeros otherwise. Hence, singular-decompose [, Z , M]
and subtract min{λ, dk } from the singular values dk so that the obtained singular
values are nonnegative.
192 6 Matrix Decomposition
Note that (6.12) can be obtained if we set Mt+1 = Mt = M in the update (6.13).
For example, we can construct the function soft.svd that returns Z = U D V T
given Z = U DV T , where D and D are diagonal matrices with the diagonal elements
dk and dk := max{dk − λ, 0}, k = 1, . . . .n.
1 soft.svd = function(lambda, z) {
2 n = ncol(z); ss = svd(z); dd = pmax(ss$d - lambda, 0)
3 return(ss$u %*% diag(dd) %*% t(ss$v))
4 }
Example 59 For the same image as in Example 55, we construct a procedure such
that we recover the whole image when we mask some part. More precisely, we apply
the functions soft.svd and mat.lasso for p = 0.5, 0.25 and λ = 0.5, 0.1 to
recover the image (Fig. 6.3), where the larger the value of λ is, the lower the rank.
Fig. 6.3 We obtain the image that minimizes (6.11) from an image with the locations of lion.jpg
unmasked, where p = 0.25, 0.5, 0.75 are the ratios of the (unmasked) observations, λ is the param-
eter in (6.11), and the larger the value of λ is, the lower the rank
6.4 Sparse Estimation for Low-Rank Estimations 193
1 library(jpeg)
2 image = readJPEG(’lion.jpg’)
3 m = nrow(image[, , 1]); n = ncol(image[, , 1])
4 p = 0.5
5 lambda = 0.5
6 mat = array(0, dim = c(m, n, 3))
7 mask = matrix(rbinom(m * n, 1, p), ncol = n)
8 for (i in 1:3) mat[, , i] = mat.lasso(lambda, image[, , i], mask)
9 writeJPEG(mat, paste("compressed/lion_compressed", "_mat_soft.jpg", sep
= ""))
m
r
by the element a p,q of A, then it becomes − (z i,q − qi,k ak,q )qi, p , which is the
i=1 k=1
( p, q)-th element of −Q T (Z − Q A) = −Q T Z + A.
Moreover, for = Z Z T and M := Q Q T Z , we have
which means that the problem reduces to finding the value of Q that maximizes
trace(Q T Q). In fact, we derive from (6.1)
where (6.2) has been used for the first and last variables. Thus, the problem further
reduces to maximizing trace(Q T Q). Then, using U T U = UU T = I , Q T Q = I
and R := U T Q ∈ Rm×r , from
Q T Q = Q T Z Z T Q = Q T U D2U T Q = R T D2 R ,
columns to R ∈ Rm×r , then h i,i is the squared sum over the columns of R along
the
m i-th row and does not exceed 1. Because the (i, j)-th element of R T D 2 R is
2
k=1 r k,i r k, j dk , from (6.2), the trace is
m
trace(R T D 2 R) = trace(R R T D 2 ) = h i,i di2
i=1
m m r
From R T R = Ir , we have i=1 h i,i = i=1 j=1 ri, j = r . Thus, h 1,1 = · · · =
2
Proposition 20 For the matrices Q, M ∈ Rm×n , if d1 (M), . . . , dn (M) are the sin-
gular values of M, then we have
n
∗
Q, M ≤ Q di (M) ,
i=1
and either u i (Q) = u i (M), vi (Q) = vi (M) or u i (Q) = −u i (M), vi (Q) = −vi (M)
n each i = 1, . . . , r , where
for we singular-decomposed M, Q as M =
n
i=1 di (M)u i (M)vi (M) and Q = i=1 di (Q)u i (Q)vi (Q).
for i = 1, . . . , r , because the lengths of u i (M), u i (Q), vi (M), vi (Q) are ones. Note
that (6.14) means either u i (M) = u i (Q), vi (Q) = vi (M) or u i (M) = −u i (Q),
vi (Q) = −vi (M), which completes the proof.
Exercises 76–87
1If m ≥ n, then U ∈ Rm×n , D ∈ Rn×n , and V ∈ Rn×n , and if m ≤ n, then U ∈ Rm×m , D ∈ Rm×m ,
and V ∈ Rn×m .
Exercises 76–87 197
m
r
the element a p,q of A, we obtain − (z i,q − qi,k ak,q )qi, p , which is the
i=1 k=1
( p, q)-th element of −Q T (Z − Q A).
(b) Let := Z Z T . Show that
which means that minimizing Z − M2F reduces to finding the Q that max-
imizes trace(Q T Q).
Hint Z T (I − Q Q T )2F = trace((I − Q Q T )(I − Q Q T )) can be
derived from Exercise 77 (a) and becomes trace((I − Q Q T )2 ) from
Exercise 77 (b).
(c) Show that (b) further reduces to finding the orthogonal matrix R ∈ Rm×r
that maximizes trace(R T D 2 R).
Hint Show that Q T Q = Q T U D 2 U T Q and
Qorthogonal ⇐⇒ R := U T Qorthogonal.
81. The following code is the process of obtaining another image file with the same
lower rank as that of lion.jpg. Generally, in the J P G format, each pixel has
a 256 grayscale of information. It has information about the three colors blue,
red, and green. For example, if we have 480 × 360 pixels vertically and hori-
zontally, each pixel of 480 × 360 × 3 will hold one of 256 values. In the code
below, the rank is r = 2, 5, 10, 20, 50, 100. This time, we approximate the lower
ranks for red, green, and blue (Fig. 6.2). It is necessary to create a folder called
/compressed under the current folder in advance.
1 library(jpeg)
2 image = readJPEG(’lion.jpg’)
3 rank.seq = c(2, 5, 10, 20, 50, 100)
4 mat = array(0, dim = c(nrow(image), ncol(image), 3))
5 for (j in rank.seq) {
6 for (i in 1:3) mat[, , i] = svd.r(image[, , i], j)
7 writeJPEG(mat,
8 paste("compressed/lion_compressed", "_svd_rank_", j, ".
jpg", sep = ""))
9 }
4 m = nrow(z); n = ncol(z)
5 for (j in 1:5) {
6 guess = matrix(rnorm(m * n), nrow = m)
7 for (i in 1:10) guess = svd.r(mask * z + (1 - mask) * guess, r)
8 value = norm(mask * (z - guess), "F")
9 if (value < min) {min.mat = ## Blank (1) ##; min = ## Blank (2) ##}
10 }
11 return(min.mat)
12 }
13
14 library(jpeg)
15 image = readJPEG(’lion.jpg’)
16 m = nrow(image); n = ncol(image)
17 mask = matrix(rbinom(m * n, 1, 0.5), nrow = m)
18 rank.seq = c(2, 5, 10, 20, 50, 100)
19 mat = array(0, dim = c(nrow(image), ncol(image), 3))
20 for (j in rank.seq) {
21 for (i in 1:3) mat[, , i] = mat.r(image[, , i], mask, j)
22 writeJPEG(mat,
23 paste("compressed/lion_compressed", "_mat_rank_", j, ".jpg"
, sep = ""))
24 }
Mazumder et al. (2010) [19] showed that starting with an arbitrary initial value
M0 and updating Mt via
until convergence, both sides of M in (6.17) coincide. Show that this matrix
M satisfies (6.16), where Sλ (W ) denotes the matrix whose singular values di ,
i = 1, . . . , n, are replaced with di − λ if di ≥ λ and zeros otherwise.
Hint Singular-decompose [, Z , M], and subtract min{λ, dk } from the singular
values dk such that the obtained singular values are nonnegative. If we set Mt+1 =
Mt = M, then we obtain (6.16).
87. We construct the function soft.svd that returns Z = U D V T given Z =
U DV T , where D and D are diagonal matrices with the diagonal elements dk
and dk := max{dk − λ, 0}, k = 1, . . . .n. Fill in the blank.
1 soft.svd = function(lambda, z) {
2 n = ncol(z); ss = svd(z); dd = pmax(ss$d - lambda, 0)
3 return(## Blank ##)
4 }
For the program below, obtain three images, changing the values of p and λ.
1 mat.lasso = function(lambda, z, mask) {
2 z = as.matrix(z); m = nrow(z); n = ncol(z)
3 guess = matrix(rnorm(m * n), nrow = m)
4 for (i in 1:20) guess = soft.svd(lambda, mask * z + (1 - mask) *
guess)
5 return(guess)
6 }
7
8 library(jpeg)
9 image = readJPEG(’lion.jpg’)
10 m = nrow(image[, , 1]); n = ncol(image[, , 1])
11 p = 0.5
12 lambda = 0.5
13 mat = array(0, dim = c(m, n, 3))
14 mask = matrix(rbinom(m * n, 1, p), ncol = n)
15 for (i in 1:3) mat[, , i] = mat.lasso(lambda, image[, , i], mask)
16 writeJPEG(mat, paste("compressed/lion_compressed", "_mat_soft.jpg",
sep = ""))
Chapter 7
Multivariate Analysis
In this chapter, we consider sparse estimation for the problems of multivariate analy-
sis, such as principal component analysis and clustering. There are two equivalence
definitions for principal component analysis: finding orthogonal vectors that maxi-
mize the variance and finding a vector that minimizes the reconstruction error when
the dimension is reduced. This chapter first introduces the sparse estimation methods
for principal component analysis, of which SCoTLASS and SPCA are popular. In
each case, the purpose is to find a principal component vector with few nonzero
components. On the other hand, introducing sparse estimation is crucial for cluster-
ing to select variables. In particular, we consider the K-means and convex clustering
problems. The latter has the advantage of not falling into a locally optimal solution
because it becomes a problem of convex optimization.
X v2 . (7.1)
X v j 2 − μ(v j 2 − 1) ,
if we differentiate by v j , we have
XT Xvj − μjvj = 0
1
More precisely, using the sample covariance := X T X based on X and λ j :=
N
μj
, we can express
N
v j = λ j v j . (7.2)
v T X T X v − λv0
under v0 ≤ t, v2 = 1, for some integer t is not convex. On the other hand, when
we wish to add the constraint v1 ≤ t (t > 0), the formulation maximizing
v T X T X v − λv1 (7.3)
u T X v − λv1 (7.4)
μ T δ
L := −u T X v + λv1 + (u u − 1) + (v T v − 1) (7.5)
2 2
7.1 Principal Component Analysis (1): SCoTLASS 203
Xv
by u, from X v − μu = 0, u = 1, we have u = . If we substitute this into
X v2
(7.5), we obtain
δ
−X v2 + λv1 + (v T v − 1) .
2
Although (7.5) takes positive values when we differentiate it by each of u, v twice,
it is not convex w.r.t. (u, v).
√
Example 60 When N = p = 1, X > μδ, it is not convex. In fact, because L is a
bivariate function of u, v, we have
⎡ ⎤
∂2 L ∂2 L
⎢ ∂u∂v ⎥
∇2 L = ⎢
∂u 2 ⎥ = μ −X .
⎣ ∂2 L ∂2 L ⎦ −X δ
∂u∂v ∂v 2
⎧
⎨ (the j − th column of X T u) − λ + δv j = 0, (the j − th column of X T u) > λ
(the j − th column of X T u) + λ + δv j = 0, (the j − th column of X T u) < −λ ,
⎩
(the j − th column of X T u) + λ[−1, 1] 0, −λ ≤ (the j − th column of X T u) ≤ λ
i.e., δv = −S λ (X T u).
In general, when : R p × R p → R satisfies
f (β) ≤ (β, θ ), θ ∈ R p
(7.6)
f (β) = (β, β)
204 7 Multivariate Analysis
we have
f (β t ) = (β t , β t ) ≥ (β t+1 , β t ) ≥ f (β t+1 ) .
In SCoTLASS, for
Sλ (X T u)
then each value expresses at the time, and it is monotonically decreas-
Sλ (X T u)2
ing.
For example, we can construct the function SCoTLASS as follows.
1 soft.th = function(lambda, z) return(sign(z) * pmax(abs(z) - lambda, 0))
2 ## even if z is a vector, soft.th works
3 SCoTLASS = function(lambda, X) {
4 n = nrow(X); p = ncol(X); v = rnorm(p); v = v / norm(v, "2")
5 for (k in 1:200) {
6 u = X %*% v; u = u / norm(u, "2"); v = t(X) %*% u
7 v = soft.th(lambda, v); size = norm(v, "2")
8 if (size > 0) v = v / size else break
9 }
10 if (norm(v, "2") == 0) print("all the elements of v are zero"); return(
v)
11 }
Example 61 Due to nonconvexity, even when λ increases, X v2 may not be mono-
tonically decreasing even though the number of zeros in v decreases with λ. We show
the execution of the code below in Fig. 7.1.
7.1 Principal Component Analysis (1): SCoTLASS 205
50
16.0
# Nonzero Vectors
40
Variance Sum
15.5
30
15.0
1st 1st
2nd 2nd
20
3rd 3rd
4th 4th
14.5
5th 5th
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
λ λ
Fig. 7.1 Execution of Example 61. With λ, the number of nonzero vectors and the sum of the
variances decrease monotonically, If we change the initial value and execute it, these values will be
different each time. This tendency becomes more remarkable as the value of λ increases
1 ## Data Generation
2 n = 100; p = 50; X = matrix(rnorm(n * p), nrow = n); lambda.seq = 0:10
/ 10
3
4 m = 5; SS = array(dim = c(m, 11)); TT = array(dim = c(m, 11))
5 for (j in 1:m) {
6 S = NULL; T = NULL
7 for (lambda in lambda.seq) {
8 v = SCoTLASS(lambda, X); S = c(S, sum(sign(v ^ 2))); T = c(T, norm
(X %*% v, "2"))
9 }
10 SS[j, ] = S; TT[j, ] = T
11 }
12 ## Display
13 par(mfrow = c(1, 2))
14 SS.min = min(SS); SS.max = max(SS)
15 plot(lambda.seq, xlim = c(0, 1), ylim = c(SS.min, SS.max),
16 xlab = "lambda", ylab = "# of nonzero vectors")
17 for (j in 1:m) lines(lambda.seq, SS[j, ], col = j + 1)
18 legend("bottomleft", paste0(1:5, "-th"), lwd = 1, col = 2:(m + 1))
19 TT.min = min(TT); TT.max = max(TT)
20 plot(lambda.seq, xlim = c(0, 1), ylim = c(TT.min, TT.max), xlab = "
lambda", ylab = "Variance Sum")
21 for (j in 1:m) lines(lambda.seq, TT[j, ], col = j + 1)
22 legend("bottomleft", paste0(1:5, "-th"), lwd = 1, col = 2:(m + 1))
23 par(mfrow = c(1, 1))
To handle not only the first principal component but also multiple components,
the following formulation is made in SCoTLASS. We formulate maximizing
u kT X vk
206 7 Multivariate Analysis
under
vk 2 ≤ 1 , vk 1 ≤ c , u k 2 ≤ 1 , u kT u j = 0 , ( j = 1, . . . , k − 1)
⊥
k−1
for Pk−1 := I − i=1 u i u iT . In fact, given u 1 , . . . , u k−1 , we have
k−1
⊥
u Tj Pk−1 X vk = u Tj (I − u i u iT )X vk = 0 ( j = 1, . . . , k − 1) .
i=1
N
N
N
xi (I − Vm VmT )2 = xi (I − Vm VmT )2 xiT = xi (I − Vm VmT )xiT . (7.8)
i=1 i=1 i=1
N
xi Vm VmT xiT = trace(X Vm VmT X T ) = trace(VmT X T X Vm )
i=1
m
m
= v Tj X T X v j = X v j 2 (7.9)
j=1 j=1
m
m
X v j 2 − λ j (v j 2 − 1) .
j=1 j=1
is biconvex for u, v, where xi is the i-th row vector of X . In fact, if we consider the
constraint u2 = 1, we can formulate (7.10) as
1 2 1
N N N
L := xi xiT − xi vu T xiT + xi vv T xiT + λ1 v1 + λ2 v22 + μ(u T u − 1) .
N N N
i=1 i=1 i=1
We observe that if we differentiate it by u j and vk twice, then the results are nonneg-
ative when excluding the term v1 .
When we optimize w.r.t. u when v is fixed, the solution is
XT z
u=
X T z2
λ = 0.00001 λ = 0.001
-0.2 0.0 0.2 0.4 0.6 0.8
Each Element of v
Each element of v
0.5
0.0
-0.5
0 20 40 60 80 100 0 20 40 60 80 100
# Repetitions # Repetitions
Fig. 7.2 Execution of Example 62. We observed the changes of v for λ = 0.00001(Left) and
λ = 0.001(Right)
where ri, j,k = xi, j − u j h=k xi,h vh . Thus, vk is the normalized constraint such that
v2 = 1, i.e.,
⎛ ⎞
1 N
k−1
Sλ ⎝ xi,k ri, j,k u j ⎠
N i=1 j=1
vk = ⎛ ⎞2 (k = 1, . . . , p) .
1
p N k−1
Sλ ⎝ xi,h ri, j,h u j ⎠
h=1
N i=1 j=1
14 for (k in 1:p) {
15 for (i in 1:n) r[i] = sum(u * x[i, ]) - sum(u ^ 2) * sum(x[i, -k]
* v[-k])
16 S = sum(x[, k] * r) / n
17 v[k] = soft.th(lambda, S)
18 }
19 if (sum(v ^ 2) > 0.00001) v = v / sqrt(sum(v ^ 2))
20 g[h, ] = v
21 }
22 ## Graph Display
23 g.max = max(g); g.min = min(g)
24 plot(1:m, ylim = c(g.min, g.max), type = "n",
25 xlab = "# Repetition", ylab = "Each element of v", main = "lambda
= 0.001")
26 for (j in 1:p) lines(1:m, g[, j], col = j + 1)
When we consider sparse PCA not just for the first element but also for up to the
m-th element, we minimize
1
N m m m
L := xi − xi Vm UmT 22 + λ1 v j 1 + λ2 v j 2 + μ (u Tj u j − 1)
N i=1 j=1 j=1 j=1
1 1
N N K
a j (C1 , . . . , C K ) := (xi, j − xi , j )2 − (xi, j − xi , j )2
N i=1 i =1 k=1
N k i∈Ck i ∈Ck
(7.11)
under the constraint w2 ≤ 1, w1 ≤ s, w ≥ 0 (all the elements are nonnegative)
for s > 0. The problem is to remove unnecessary variables by weighting for j =
1, . . . , p and to make the interpretation of the clustering easier. To solve the problem,
they proposed repeating the following updates [31].
p
1. fixing w1 , . . . , w p , find the C1 , . . . , C K that maximize j=1 w j a j (C1 , . . . , C K )
p
2. fixing C1 , . . . , C K , find the w1 , . . . , w p that maximize j=1 w j a j (C1 , . . . , C K )
The first term in (7.11) is constant, and the second term can be written as
1
(xi, j − xi , j )2 = 2 (xi − x̄k )2
Nk i∈C i ∈C i∈C
k k k
1
; thus, to obtain the optimum C1 , . . . , C N when the weights w1 , . . . , w p are fixed,
we may execute the following procedure.
1 k.means = function(X, K, weights = w) {
2 n = nrow(X); p = ncol(X)
3 y = sample(1:K, n, replace = TRUE); center = array(dim = c(K, p))
4 for (h in 1:10) {
5 for (k in 1:K) {
6 if (sum(y[] == k) == 0) center[k, ] = Inf else
7 for (j in 1:p) center[k, j] = mean(X[y[] == k, j])
8 }
9 for (i in 1:n) {
10 S.min = Inf
11 for (k in 1:K) {
12 if (center[k, 1] == Inf) break
13 S = sum((X[i, ] - center[k, ]) ^ 2 * w)
14 if (S < S.min) {S.min = S; y[i] = k}
15 }
16 }
17 }
18 return(y)
19 }
Example 63 Generate data artificially (N = 1000, p = 2), and execute the function
k.means for the weights 1 : 1 and 1 : 100 (Fig. 7.3).
1 ## Data Generation
2 K = 10; p = 2; n = 1000; X = matrix(rnorm(p * n), nrow = n, ncol = p)
3 w = c(1, 1); y = k.means(X, K, w)
4 ## Display Output
5 plot(-3:3, -3:3, xlab = "x", ylab = "y", type = "n")
6 points(X[, 1], X[, 2], col = y + 1)
3
2
2
1
1
0
0
y
y
-1
-1
-2
-2
-3
-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
x x
Fig. 7.3 We execute the function k.means for the weights 1 : 1 (left) and 1 : 100 (right). For case
on the right whose weight of the second element is large, due to the larger penalty, the generated
cluster is horizontally long
Comparing them, if we make the second element large, then we tend to obtain
horizontal clusters.
Proposition 23 Let s > 0 and a ∈ R p with nonnegative elements. Then, the w with
w2 = 1, w1 ≤ s that maximizes w T a can be written as
Sλ (a)
w= (7.12)
Sλ (a)2
for some λ ≥ 0.
10
11 w.a = function(a, s) {
12 w = rep(1, p)
13 a = a / sqrt(sum(a ^ 2))
14 if (sum(a) < s) return(a)
15 p = length(a)
16 lambda = max(a) / 2
17 delta = lambda / 2
18 for (h in 1:10) {
19 for (j in 1:p) w[j] = soft.th(lambda, a[j])
20 ww = sqrt(sum(w ^ 2))
21 if (ww == 0) w = 0 else w = w / ww
22 if (sum(w) > s) lambda = lambda + delta else lambda = lambda -
delta
23 delta = delta / 2
24 }
25 return(w)
26 }
27
28 comp.a = function(X, y) {
29 n = nrow(X); p = ncol(X); a = array(dim = p)
30 for (j in 1:p) {
31 a[j] = 0
32 for (i in 1:n) for (h in 1:n) a[j] = a[j] + (X[i, j] - X[h, j]) ^
2 / n
33 for (k in 1:K) {
34 S = 0
35 index = which(y == k)
36 if (length(index) == 0) break
37 for (i in index) for (h in index) S = S + (X[i, j] - X[h, j]) ^
2
38 a[j] = a[j] - S / length(index)
39 }
40 }
41 return(a)
42 }
1 $w
2 [1] 0.00659343 0.00000000 0.74827158 0.00000000
3 [5] 0.00000000 0.00000000 0.00000000 0.65736124
4 [9] 0.00000000 0.08900768
5 $y
6 [1] 2 2 3 2 2 5 2 5 2 3 1 3 2 4 3 2 1 5 5 4 1 1 2 2
7 [25] 2 3 3 4 2 4 2 2 3 1 1 4 2 3 2 5 2 2 1 1 4 1 4 2
8 [49] 3 1 5 4 5 3 5 4 4 1 3 3 2 3 3 3 3 5 2 1 3 4 3 3
9 [73] 5 4 4 3 5 2 2 5 3 4 3 3 4 1 2 1 3 3 5 5 1 2 4 4
10 [97] 5 4 2 2
7.3 K -Means Clustering 213
Because K -means clustering is not convex, we cannot hope to obtain the optimum
solution by repeating the functions k.means, a.comp, and w.a. However, we can
find the relevant variables for clustering.
1
N
xi − u i 2 + γ wi, j u i − u j 2
2 i=1 i< j
for γ > 0, where wi, j is the constant determined from xi , x j , such as exp(−xi −
x j 22 ). We refer to this clustering method as convex clustering. If u i = u j , we regard
the samples indexed by i, j as being in the same cluster [7].
To solve via the ADMM, for U ∈ R N × p , V ∈ R N ×N × p , and ∈ R N ×N × p , we
define the extended Lagrangian as
1
L ν (U, V, ) := xi − u i 22 + γ wi, j vi, j + λi, j , vi, j − u i + u j
2
i∈V (i, j)∈E (i, j)∈E
ν
+ vi, j − u i + u j 22 , (7.13)
2
(i, j)∈E
Proposition 24 Suppose that the edge set E is {(i, j) | i < j, i, j ∈ V }, i.e., all the
vertices are connected. If we fix V, and denote
yi := xi + (λi, j + νvi, j ) − (λ j,i + νv j,i ) , (7.14)
j:(i, j)∈E j:( j,i)∈E
1
σ (v) + u − v22
2
by proxσ (u), (7.13) is minimized w.r.t. V when
1
vi, j = proxσ · (u i − u j − λi, j ), (7.16)
ν
γ wi, j
where σ := .
ν
Finally, the update of the Lagrange coefficients is as follows:
30 ## Update v
31 update.v = function(u, lambda) {
32 v = array(dim = c(n, n, d))
33 for (i in 1:(n - 1)) for (j in (i + 1):n) {
34 v[i, j, ] = prox(u[i, ] - u[j, ] - lambda[i, j, ] / nu, gamma * w[
i, j] / nu)
35 }
36 return(v)
37 }
38 ## Update lambda
39 update.lambda = function(u, v, lambda) {
40 for (i in 1:(n - 1)) for (j in (i + 1):n) {
41 lambda[i, j, ] = lambda[i, j, ] + nu * (v[i, j, ] - u[i, ] + u[j,
])
42 }
43 return(lambda)
44 }
45 ## Repeats the updates of u,v,lambda for the max_iter times
46 convex.cluster = function() {
47 v = array(rnorm(n * n * d), dim = c(n, n, d))
48 lambda = array(rnorm(n * n * d), dim = c(n, n, d))
49 for (iter in 1:max_iter) {
50 u = update.u(v, lambda); v = update.v(u, lambda); lambda = update.
lambda(u, v, lambda)
51 }
52 return(list(u = u, v = v))
53 }
3
2
2
Second Element
Second Element
1
1
0
0
-1
-1
-2
-2
-3
-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
First Element First Element
Fig. 7.4 Execution of Example 65. We executed convex clustering with γ and dd being (1, 0.5)
and (10, 0.5) for the left and right cases, respectively. Compared with K -means clustering, convex
clustering often constructs clusters with distant samples (left)
216 7 Multivariate Analysis
vertices without realizing the minimum intracluster distribution and the maximum
intercluster distribution of the cluster samples.
1 ## Data Generation
2 n = 50; d = 2; x = matrix(rnorm(n * d), n, d)
3 ## Convex Clustering
4 w = ww(x, 1, dd = 0.5)
5 gamma=1 # gamma = 10
6 nu = 1; max_iter = 1000; v = convex.cluster()$v
7 ## Adjacency Matrix
8 a = array(0, dim = c(n, n))
9 for (i in 1:(n - 1)) for (j in (i + 1):n) {
10 if (sqrt(sum(v[i, j, ] ^ 2)) < 1 / 10 ^ 4) {a[i, j] = 1; a[j, i] =
1}
11 }
12 ## Display Figure
13 k = 0
14 y = rep(0, n)
15 for (i in 1:n) {
16 if (y[i] == 0) {
17 k = k + 1
18 y[i] = k
19 if (i < n) for (j in (i + 1):n) if (a[i, j] == 1) y[j] = k
20 }
21 }
22 plot(0, xlim = c(-3, 3), ylim = c(-3, 3), type = "n", main = "gamma =
10")
23 points(x[, 1], x[, 2], col = y + 1)
The clustering will not be uniquely determined even if the cluster size is given, but
the cluster size will be biased. Therefore, we use the parameter dd to limit the size of
a particular cluster. Therefore, this method guarantees the convexity and efficiency
of the calculation, but considering the tuning effort and clustering criteria, it is not
always the best method.
Additionally, as for the sparse K -means clustering we considered in Sect. 7.3,
for the convex clustering in this section, sparse convex clustering that penalizes
extra variables is proposed in (7.13). We consider the following method to be sparse
convex clustering (B. Wang et al., 2018) [30]. The purpose is to remove unnecessary
variables in the clustering, make its interpretation easier, and save the computation.
Let x ( j) , u ( j) ∈ R N ( j = 1, . . . , p) be the column vectors of X, U ∈ R N × p .
We rewrite
p the first term of (7.13) using the column vectors and add the penalty
( j)
term j=1 r j u 2 for unnecessary variables. For U ∈ R N × p , V ∈ R N ×N × p ,
N ×N × p
∈R , we define the extended Lagrangian for the ADMM as
1 ( j)
p p
L ν (U, V, ) := x − u ( j) 22 + γ1 wi,k vi,k + γ2 r j u ( j) 2
2 j=1 (i,k)∈E j=1
ν
+ λi,k , vi,k − u i + u k + vi,k − u i + u k 22 ,
(i,k)∈E
2 (i,k)∈E
(7.18)
7.4 Convex Clustering 217
where r j is the weight for the penalty and u 1 , . . . , u N and u (1) , . . . , u ( p) are the row
and column vectors of U ∈ R N × p , respectively.
The optimization w.r.t. U when V, are fixed is given as follows.
1 −1 ( j)
G y − Gu ( j) 22 + γ2 r j u ( j) 2
2
Proposition 23 Let s > 0 and a ∈ R p with nonnegative elements. Then, the w for
w2 = 1 and w1 ≤ s that maximizes w T a can be written as
Sλ (a)
w=
Sλ (a)2
Appendix: Proof of Proposition 219
gamma = 1 gamma = 10
2
2
1
1
Coefficients
Coefficient
0
0
-1
-1
-2
-2
2 4 6 8 10 2 4 6 8 10
gamma.2 gamma.2
Fig. 7.5 Fixing γ1 = 1 (left) and γ1 = 100 (right), changing the value of γ2 , we observed the
changes in the value of u i, j for a specific sample (i = 5). For γ = 10, the absolute value of the
coefficient is reduced compared to that for γ = 1
for some λ ≥ 0.
−a j +λ1 w j + λ2 − μ j = 0 (7.19)
μjwj = 0 (7.20)
wj ≥ 0 (7.21)
for j = 1, . . . , p and
wT w ≤ 1 (7.22)
p
wj ≤ s (7.23)
j=1
λ1 (w T w − 1) = 0 (7.24)
⎛ ⎞
p
λ2 ⎝ w j − s⎠ = 0 . (7.25)
j=1
220 7 Multivariate Analysis
Proposition 24 Suppose that the edge set E is {(i, j) | i < j, i, j ∈ V }, i.e., all the
vertices are connected. If we fix V, and denote
yi := xi + (λi, j + νvi, j ) − (λ j,i + νv j,i ) (7.14)
j:(i, j)∈E j:( j,i)∈E
If we differentiate f (U ) by u i , we have
−(xi − u i ) − (λi, j + νvi, j − ν(u i − u j )) + (λi, j + νvi, j − ν(u j − u i )) = 0 ,
j∈V :(i, j)∈E j∈V :( j,i)∈E
which is
ui − ν (u j − u i ) = xi + (λi, j + νvi, j ) − (λi, j + νvi, j ) .
j=i j:(i, j)∈E j:( j,i)∈E
Thus, we have
(N ν + 1)u i − ν u j = yi . (7.27)
j∈V
then from (7.27)(7.28) and j∈V yj = j∈V xj,
yi + ν j∈V xj
ui =
1 + Nν
is the optimum.
1 −1 ( j)
G y − Gu ( j) 22 + γ2 r j u ( j) 2
2
Proof If we differentiate all the terms except the third term in (7.18) by the j-th
element of u i (the i-th element of a j ), then from (7.27), we have
(N ν + 1)u i, j − ν u k, j − yi, j , (7.29)
k∈V
where
yi, j := xi, j + (λi,k, j + νvi,k, j ) − (λi,k, j + νvi,k, j ) .
k:(i,k)∈E k:(k,i)∈E
1 −1 ( j)
G y − Gu ( j) 22 + γ2 r j u ( j) 2 (7.30)
2
for each j = 1, . . . , p.
Exercises 88–100
N
In the following, it is assumed that X ∈ R N × p is centered, i.e., X = (xi, j ), i=1 xi, j =
0 ( j = 1, . . . , p), and xi represents a row vector.
88. V1 ∈ R p , which has a length of one and maximizes X v, is called the first
principal component vector. In a vector that is orthogonal from the first principal
component vector to the i − 1 principal component vector and has a length of
one, the Vi ∈ R p that maximizes X vi is called the i principal component
vector. We wish to find it for i = 1, . . . , m (m ≤ p).
(a) Show that the first principal component vector is the eigenvector of :=
X T X/N .
(b) Show that eigenvectors are orthogonal when the eigenvalues are distinct.
Moreover, show that eigenvectors can be orthogonalized. If the eigenvalues
are the same, what should we do?
(c) To obtain the first m principal component vectors, we choose the ones with
the largest m eigenvalues and normalize them. Why can we ignore the con-
dition “orthogonal from the first principal component vector to the i − 1
principal component vector”?
89. For the principal component vector Vm := [v1 , . . . , vm ] defined in Exercise 88,
show that (I − Vm VmT )2 = I − Vm VmT . In addition, show that the Vm defined
in Exercise 88 and the Vm ∈ R N timesm that minimizes the reconstruction error
N
i=1 x i (I − Vm Vm )x i coincide.
T T
90. When we find v ∈ R for v2 = 1 that maximizes X v2 , we may restrict the
p
for u ∈ R N .
Hint Because
⎡ 2 L is ⎤a bivariate function of u, v, we have
∂ L ∂2 L
⎢ ∂u 2 ∂u∂v ⎥ μ −X
∇2 L = ⎢ ⎥
⎣ ∂ 2 L ∂ 2 L ⎦ = −X δ . If the determinant μδ − X is
2
∂u∂v ∂v 2
negative, it contains a negative eigenvalue.
(X v)T (X v )
(b) If we define f (v) := −X v2 + λv1 , (v, v ) := − +
X v 2
λv1 , then show that majorizes f at v using Schwartz’s inequality.
(c) Suppose that we choose v 0 arbitrarily and generate v 0 , v 1 , . . . via v t+1 :=
arg maxv2 =1 (v, v t ). Express the right-hand side value as X, v, λ. Show
Sλ (X T u)
also that it coincides with .
Sλ (X T u)2
93. For the procedures below (SCoTLASS), fill in the blank, and execute for the
sample.
1 soft.th = function(lambda, z) return(sign(z) * pmax(abs(z) -
lambda, 0))
2 ## Even if z is a vector, soft.th works
3 SCoTLASS = function(lambda, X) {
4 n = nrow(X); p = ncol(X); v = rnorm(p); v = v / norm(v, "2")
5 for (k in 1:200) {
6 u = X %*% v; u = u / norm(u, "2"); v = ## Blank ##
7 v = soft.th(lambda, v); size = norm(v, "2")
8 if (size > 0) v = v / size else break
9 }
10 if (norm(v, "2") == 0) print("All the elements of v are zeros");
return(v)
11 }
12 ## Sample
Exercises 88–100 225
94. We formulate a sparse PCA problem different from SCoTLASS (SPCA, sparse
principal component analysis):
1
N
min xi − xi vu T 22 + λ1 v1 + λ2 v22 (cf. 7.8)
u,v∈R p ,u2 =1 N i=1
(7.34)
For this method, when we fix u and optimize w.r.t. v, we can apply the elastic
net procedure. Show each of the following.
(a) The function (7.34) is biconvex w.r.t. u, v.
Hint When we consider the constraint u2 = 1, (7.34) can be written as
1 2 T T 1 T T
N N N
L := xi xiT − u xi xi v + v xi xi v + λ1 v1 + λ2 v2 + μ(u T u − 1) .
N N N
i=1 i=1 i=1
XT z
u= ,
X T z2
where z 1 = v T x1 , . . . , z N = v T x N .
Hint Differentiating L w.r.t. u j , show that the differentiated vector w.r.t.
2 T
N
j = 1, . . . , p satisfies − x xi v + 2μu = 0.
N i=1 i
95. We construct the SPCA process using the R language. The process is repeated to
observe how each element of the vector v changes with the number of iterations.
Fill in the blanks, execute the process, and display the output as a graph.
226 7 Multivariate Analysis
1 ## Data Generation
2 n = 100; p = 5; x = matrix(rnorm(n * p), ncol = p)
3 ## Compute u,v
4 lambda = 0.001; m = 100
5 g = array(dim = c(m, p))
6 for (j in 1:p) x[, j] = x[, j] - mean(x[, j])
7 for (j in 1:p) x[, j] = x[, j] / sqrt(sum(x[, j] ^ 2))
8 r = rep(0, n)
9 v = rnorm(p)
10 for (h in 1:m) {
11 z = x %*% v
12 u = as.vector(t(x) %*% z)
13 if (sum(u ^ 2) > 0.00001) u = u / sqrt(sum(u ^ 2))
14 for (k in 1:p) {
15 for (i in 1:n) r[i] = sum(u * x[i, ]) - sum(u ^ 2) * sum(x[i,
-k] * v[-k])
16 S = sum(x[, k] * r) / n
17 v[k] = ## Blank (1) ##
18 }
19 if (sum(v ^ 2) > 0.00001) v = ## Blank (2) ##
20 g[h, ] = v
21 }
22 ## Display Graph
23 g.max = max(g); g.min = min(g)
24 plot(1:m, ylim = c(g.min, g.max), type = "n",
25 xlab = "# Repetitions", ylab = "Each element of v", main = "
lambda = 0.001")
26 for (j in 1:p) lines(1:m, g[, j], col = j + 1)
96. It is not easy to perform sparse PCA on general multiple components instead
of the first principal component, and the following is required for SCoTLASS:
maximize u kT X vk w.r.t. u k , vk under
vk 2 ≤ 1 , vk 1 ≤ c , u k 2 ≤ 1 , u kT u j = 0 ( j = 1, . . . , k − 1) .
⊥
k−1
where Pk−1 := I − i=1 u i u iT .
⊥
k−1
Hint When u 1 , . . . , u k−1 are given, show that u j Pk−1 X vk = u j (I − i=1 ui
u iT )X vk = 0 ( j = 1, . . . , k − 1) and that the u k that maximizes L = u kT X vk −
μ(u kT u k − 1) is u k = X vk /(u kT X vk ).
97. K -means clustering is a problem of finding the disjoint subsets C1 , . . . , C K of
{1, . . . , N } that minimize
K
xi − x̄k 22
k=1 i∈Ck
Exercises 88–100 227
from x1 , . . . , x N ∈ R p and the positive integer K , where x̄k is the arithmetic mean
of the samples whose indices are in Ck . Witten–Tibshirani (2010) proposed the
formulation that maximizes the weighted sum
⎧ ⎫
p ⎨1
N
N K
1 2 ⎬
wh di,2 j,h − d (7.35)
⎩N Nk i, j∈C i, j,h ⎭
h=1 i=1 j=1 k=1 k
(a) The following is a general K -means clustering procedure. Fill in the blanks,
and execute it.
1 k.means = function(X, K, weights = w) {
2 n = nrow(X); p = ncol(X)
3 y = sample(1:K, n, replace = TRUE)
4 center = array(dim = c(K, p))
5 for (h in 1:10) {
6 for (k in 1:K) {
7 if (sum(y[] == k) == 0) center[k, ] = Inf else
8 for (j in 1:p) center[k, j] = ## Blank (1) ##
9 }
10 for (i in 1:n) {
11 S.min = Inf
12 for (k in 1:K) {
13 if (center[k, 1] == Inf) break
14 S = sum((X[i, ] - center[k, ]) ^ 2 * w)
15 if (S < S.min) {S.min = S; ## Blank(2) ##}
16 }
17 }
18 }
19 return(y)
20 }
21 ## Data Generation
22 K = 10; p = 2; n = 1000; X = matrix(rnorm(p * n), nrow = n,
ncol = p)
23 w = c(1, 1); y = k.means(X, K, w)
24 ## Display Output
25 plot(-3:3, -3:3, xlab = "x", ylab = "y", type = "n")
26 points(X[, 1], X[, 2], col = y + 1)
p
(b) To obtain w1 , . . . , w p that maximizes j=1 w j a j with
1 1
N N K
a j (C1 , . . . , C K ) := (xi, j − xi ,j)
2
− (xi, j − xi ,j)
2
(cf. (7.11))
N Nk
i=1 i =1 k=1 i∈Ck i ∈Ck
p
given C1 , . . . , C K , we find the value of w that maximizes h=1 wh ah under
w2 = 1, w1 ≤ s, and w ≥ 0 given a ∈ R p with nonnegative elements.
We define λ > 0 as zero and w such that w1 = s for the cases where
w1 < s and w1 = s. Suppose that a ∈ R p has nonnegative elements.
228 7 Multivariate Analysis
98. The following is a sparse K -means procedure based on Exercise 97, to which
the functions below have been added. Fill in the blanks, and execute it.
1 sparse.k.means = function(X, K, s) {
2 p = ncol(X); w = rep(1, p)
3 for (h in 1:10) {
4 y = k.means(## Blank (1) ##)
5 a = comp.a(## Blank (2) ##)
6 w = w.a(## Blank (3) ##)
7 }
8 return(list(w = w, y = y))
9 }
10 comp.a = function(X, y) {
11 n = nrow(X); p = ncol(X); a = array(dim = p)
12 for (j in 1:p) {
13 a[j] = 0
14 for (i in 1:n) for (h in 1:n) a[j] = a[j] + (X[i, j] - X[h, j
]) ^ 2 / n
15 for (k in 1:K) {
16 S = 0
17 index = which(y == k)
18 if (length(index) == 0) break
19 for (i in index) for (h in index) S = S + (X[i, j] - X[h, j
]) ^ 2
20 a[j] = a[j] - S / length(index)
21 }
22 }
23 return(a)
24 }
Exercises 88–100 229
1
L ν (U, V, ) := xi − u i 22 + γ wi, j vi, j + λi, j , vi, j − u i + u j
2
i∈V (i, j)∈E (i, j)∈E
ν
+ vi, j − u i + u j 22 (cf. (7.13))
2
(i, j)∈E
26 }
27 update.v = function(u, lambda) {
28 v = array(dim = c(n, n, d))
29 for (i in 1:(n - 1)) for (j in (i + 1):n) {
30 v[i, j, ] = prox(u[i, ] - u[j, ] - lambda[i, j, ] / nu, gamma
* w[i, j] / nu)
31 }
32 return(v)
33 }
34 update.lambda = function(u, v, lambda) {
35 for (i in 1:(n - 1)) for (j in (i + 1):n) {
36 lambda[i, j, ] = lambda[i, j, ] + nu * (v[i, j, ] - u[i, ] + u
[j, ])
37 }
38 return(lambda)
39 }
40 ## Repeat the updates of u,v,lambda for max_iter times
41 convex.cluster = function() {
42 v = array(rnorm(n * n * d), dim = c(n, n, d))
43 lambda = array(rnorm(n * n * d), dim = c(n, n, d))
44 for (iter in 1:max_iter) {
45 u = ## Blank (1) ##
46 v = ## Blank (2) ##
47 lambda = ## Blank (3) ##
48 }
49 return(list(u = u, v = v))
50 }
51 ## Data Generation
52 n = 50; d = 2; x = matrix(rnorm(n * d), n, d)
53 ## Convex Clustering
54 w = ww(x, 1, dd = 1); gamma = 10; nu = 1; max_iter = 1000; v =
convex.cluster()$v
55 ## Adjacency Matrix
56 a = array(0, dim = c(n, n))
57 for (i in 1:(n - 1)) for (j in (i + 1):n) {
58 if (sqrt(sum(v[i, j, ] ^ 2)) < 1 / 10 ^ 4) {a[i, j] = 1; a[j, i]
= 1}
59 }
60 ## Display
61 k = 0
62 y = rep(0, n)
63 for (i in 1:n) {
64 if (y[i] == 0) {
65 k = k + 1
66 y[i] = k
67 if (i < n) for (j in (i + 1):n) if (a[i, j] == 1) y[j] = k
68 }
69 }
70 plot(0, xlim = c(-3, 3), ylim = c(-3, 3), type = "n", main = "
gamma = 10")
71 points(x[, 1], x[, 2], col = y + 1)
1 ( j)
p p
L ν (U, V, ) := x − u ( j) 22 + γ1 wi,k vi,k + γ2 r j u ( j) 2
2
j=1 (i,k)∈E j=1
ν
+ λi,k , vi,k − u i + u k + vi,k − u i + u k 22 (cf. (7.18))
2
(i,k)∈E (i,k)∈E
1. Alizadeh, A., Eisen, M., Davis, R.E., Ma, C., Lossos, I., Rosenwal, A., Boldrick, J., Sabet, H.,
Tran, T., Yu, X., Pwellm, J., Marti, G., Moore, T., Hudsom, J., Lu, L., Lewis, D., Tibshirani,
R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, K., Levy, R., Wilson,
W., Greve, M., Byrd, J., Botstein, D., Brown, P., Staudt, L.: Identification of molecularly and
clinically distinct subtypes of diffuse large B cell lymphoma by gene expression profiling.
Nature 403, 503–511 (2000)
2. Arnold, T., Tibshirani, R.: genlasso: path algorithm for generalized lasso problems. R package
version 1.5
3. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM J. Imaging Sci. 2, 183–202 (2009)
4. Bertsekas, D.: Convex Analysis and Optimization. Athena Scientific, Nashua (2003)
5. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical
learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1),
1–124 (2011)
6. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge
(2004)
7. Chi, E.C., Lange, K.: Splitting methods for convex clustering. J. Comput. Graph. Stat. (online
access) (2014)
8. Danaher, P., Witten, D.: The joint graphical lasso for inverse covariance estimation across
multiple classes. J. R. Stat. Soc. Ser. B 76(2), 373–397 (2014)
9. Efron, B., Hastie, T., Johnstone, I., Tibshirani, B.: Least angle regression. Ann. Stat. 32(2),
407–499 (2004)
10. Friedman, J., Hastie, T., Hoefling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann.
Appl. Stat. 1(2), 302–332 (2007)
11. Friedman, J., Hastie, T., Simon, N., Tibshirani, R.: glmnet: lasso and elastic-net regularized
generalized linear models. R package version 4.0 (2015)
12. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical
lasso. Biostatistics 9, 432–441 (2008)
13. Jacob, L., Obozinski, G., Vert, J.-P.: Group lasso with overlap and graph lasso. In: Proceeding
of the 26th International Conference on Machine Learning, Montreal, Canada (2009)
14. Johnson, N.: A dynamic programming algorithm for the fused lasso and ‘0-segmentation. J.
Comput. Graph. Stat. 22(2), 246–260 (2013)
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 233
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0
234 Bibliography
15. Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based
on the lasso. J. Comput. Graph. Stat. 12, 531–547 (2003)
16. Suzuki, J.: Statistical Learning with Math and R. Springer, Berlin (2020)
17. Kawano, S., Matsui,H., Hirose, K.: Statistical Modeling for Sparse Estimation. Kyoritsu-
Shuppan, Tokyo (2018) (in Japanese)
18. Lauritzen, S.L.: Graphical Models. Oxford University Press, Oxford (1996)
19. Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large
incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)
20. Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the lasso.
Ann. Stat. 34, 14361462 (2006)
21. Mota, J., Xavier, J., Aguiar, P., Püschel, M.: A proof of convergence for the alternating direction
method of multipliers applied to polyhedral-constrained functions. Optimization and Control,
Mathematics arXiv (2011)
22. Nesterov, Y.: Gradient methods for minimizing composite objective function. Technical Report
76, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain
(UCL) (2007)
23. Ravikumar, P., Liu, H., Lafferty, J., Wasserman, L.: Sparse additive models. J. R. Stat. Soc.
Ser. B 71(5), 1009–1030 (2009)
24. Ravikumar, P., Wainwright, M.J., Raskutti, G., Yu, B.: High-dimensional covariance estimation
by minimizing 1-penalized logdeterminant divergence. Electron. J. Stat. 5, 935–980 (2011)
25. Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for Cox’s proportional
hazards model via coordinate descent. J. Stat. Softw. 39(5), 1–13 (2011)
26. Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph.
Stat. 22(2), 231–245 (2013)
27. Simon, N., Friedman, J., Hastie, T.: A blockwise descent algorithm for group-penalized mul-
tiresponse and multinomial regression. Computation, Mathematics arXiv (2013)
28. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58,
267–288 (1996)
29. Tibshirani, R., Taylor, J.: The solution path of the generalized lasso. Ann. Stat. 39(3), 1335–
1371 (2011)
30. Wang, B., Zhang, Y., Sun, W., Fang, Y.: Sparse convex clustering. J. Comput. Graph. Stat.
27(2), 393–403 (2018)
31. Witten, D., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc.
105(490), 713–726 (2010)
32. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph.
Stat. 15(2), 265–286 (2006)