0% found this document useful (0 votes)
68 views

Sparse Estimation With Math and R 100 Exercises For Building Logic 1st Ed 2021 9811614458 9789811614453

Uploaded by

Jonas Hendler
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Sparse Estimation With Math and R 100 Exercises For Building Logic 1st Ed 2021 9811614458 9789811614453

Uploaded by

Jonas Hendler
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 241

Joe Suzuki

Sparse
Estimation
with Math
and R
100 Exercises for Building Logic
Sparse Estimation with Math and R
Joe Suzuki

Sparse Estimation with Math


and R
100 Exercises for Building Logic
Joe Suzuki
Graduate School of Engineering Science
Osaka University
Toyonaka, Osaka, Japan

ISBN 978-981-16-1445-3 ISBN 978-981-16-1446-0 (eBook)


https://doi.org/10.1007/978-981-16-1446-0

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface

I started considering the sparse estimation problems around 2017 when I moved
from the mathematics department to statistics in Osaka University, Japan. I have
been studying information theory and graphical models for over 30 years.
The first book I found is “Statistical Learning with Sparsity” by T. Hastie, R.
Tibshirani, and M. Wainwright. I thought it was a monograph rather than a textbook
and that it would be tough for a nonexpert to read it through. However, I downloaded
more than 50 papers that were cited in the book and read them all. In fact, the book
does not instruct anything but only suggests how to study sparsity. The contrast
between statistics and convex optimization is gradually attracting me as I understand
the material.
On the other hand, it seems that the core results on sparsity have come out around
2010–2015 for research. However, I still think further possibilities and expansions
are there. This book contains all the mathematical derivations and source programs,
so graduate students can construct any procedure from scratch by getting help from
this book.
Recently, I published books “Statistical Learning with Math and R” (SLMR)
and “Statistical Learning with Math and Python” (SLMP) and will publish “Sparse
Estimation with Math and Python” (SEMP). The common idea is behind the books
(XXMR/XXMP). They not only give knowledge on statistical learning and sparse
estimation but also help build logic in your brain by following each step of the
derivations and each line of the source programs. I often meet with data scientists
engaged in machine learning and statistical analyses for research collaborations and
introduce my students to them. I recently found out that almost all of them think
that (mathematical) logic rather than knowledge and experience is the most crucial
ability for grasping the essence in their jobs. Our necessary knowledge is changing
every day and can be obtained when needed. However, logic allows us to examine
whether each item on the Internet is correct and follow any changes; we might miss
even chances without it.

v
vi Preface

What Makes SEMR Unique?

I have summarized the features of this book as follows:

1. Developing logic
To grasp the essence of the subject, we mathematically formulate and solve each
ML problem and build those programs. The SEMR instills “logic” in the minds
of the readers. The reader will acquire both the knowledge and ideas of ML, so
that even if new technology emerges, they will be able to follow the changes
smoothly. After solving the 100 problems, most of the students would say “I
learned a lot”.
2. Not just a story
If programming codes are available, you can immediately take action. It is
unfortunate when an ML book does not offer the source codes. Even if a package
is available, if we cannot see the inner workings of the programs, all we can do
is input data into those programs. In SEMR, the program codes are available
for most of the procedures. In cases where the reader does not understand the
math, the codes will help them understand what it means.
3. Not just a how to book: an academic book written by a university professor.
This book explains how to use the package and provides examples of executions
for those who are not familiar with them. Still, because only the inputs and
outputs are visible, we can only see the procedure as a black box. In this sense,
the reader will have limited satisfaction because they will not be able to obtain
the essence of the subject. SEMR intends to show the reader the heart of ML
and is more of a full-fledged academic book.
4. Solve 100 exercises: problems are improved with feedback from university
students
The exercises in this book have been used in university lectures and have been
refined based on feedback from students. The best 100 problems were selected.
Each chapter (except the exercises) explains the solutions, and you can solve
all of the exercises by reading the book.
5. Self-contained
All of us have been discouraged by phrases such as “for the details, please refer
to the literature XX”. Unless you are an enthusiastic reader or researcher, nobody
will seek out those references. In this book, we have presented the material in
such a way that consulting external references is not required. Additionally, the
proofs are simple derivations, and the complicated proofs are given in the appen-
dices at the end of each chapter. SEMR completes all discussions, including the
appendices.
6. Readers’ pages: questions, discussion, and program files. The reader can ask
any question on the book via https://bayesnet.org/books.

Osaka, Japan Joe Suzuki


March 2021
Preface vii

Acknowledgments The author wishes to thank Tianle Yang, Ryosuke Shinmura, and Tomohiro
Kamei for checking the manuscript in Japanese. This English book is largely based on the Japanese
book published by Kyoritsu Shuppan Co., Ltd. in 2020. The author would like to thank Kyoritsu
Shuppan Co., Ltd., in particular, its editorial members Mr. Tetsuya Ishii and Ms. Saki Otani. The
author also appreciates Ms. Mio Sugino, Springer, for preparing the publication and providing
advice on the manuscript.
Contents

1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Subderivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 A Comparison Between Lasso and Ridge . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 About How to Set the Value of λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Exercises 1–20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Generalized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1 Generalization of Lasso in Linear Regression . . . . . . . . . . . . . . . . . . . 37
2.2 Logistic Regression for Binary Values . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Logistic Regression for Multiple Values . . . . . . . . . . . . . . . . . . . . . . . 46
2.4 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Appendix: Proof of Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Exercises 21–34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1 When One Group Exists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2 Proximal Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4 Sparse Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5 Overlap Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.6 Group Lasso with Multiple Responses . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.7 Group Lasso via Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.8 Group Lasso for the Generalized Additive Models . . . . . . . . . . . . . . . 96
Appendix: Proof of Proposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Exercises 35–46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

ix
x Contents

4 Fused Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


4.1 Applications of Fused Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2 Solving Fused Lasso via Dynamic Programming . . . . . . . . . . . . . . . . 114
4.3 LARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.4 Dual Lasso Problem and Generalized Lasso . . . . . . . . . . . . . . . . . . . . 119
4.5 ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Appendix: Proof of Proposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Exercises 47–61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2 Graphical Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.3 Estimation of the Graphical Model based
on the Quasi-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.4 Joint Graphical Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Appendix: Proof of Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Exercises 62–75 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6 Matrix Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.1 Singular Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.2 Eckart-Young’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.3 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.4 Sparse Estimation for Low-Rank Estimations . . . . . . . . . . . . . . . . . . . 190
Appendix: Proof of Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Exercises 76–87 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.1 Principal Component Analysis (1): SCoTLASS . . . . . . . . . . . . . . . . . 201
7.2 Principle Component Analysis (2): SPCA . . . . . . . . . . . . . . . . . . . . . . 206
7.3 K -Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.4 Convex Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Appendix: Proof of Proposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Exercises 88–100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Chapter 1
Linear Regression

In general statistics, we often assume that the number of samples N is greater than
the number of variables p. If this is not the case, it may not be possible to solve
for the best-fitting regression coefficients using the least squares method, or it is
too computationally costly to compare a total of 2 p models using some information
criterion.
When p is greater than N (also known as the sparse situation), even for linear
regression, it is more common to minimize, instead of the usual squared error, the
modified objective function to which a term is added to prevent the coefficients from
being too large (the so-called regularization term). If the regularization term is a
constant λ times the L1-norm (resp. L2-norm) of the coefficient vector, it is called
Lasso (resp. Ridge). In the case of Lasso, if the value of λ increases, there will be
more coefficients that go to 0, and when λ reaches a certain value, all the coefficients
will eventually become 0. In that sense, we can say that Lasso also plays a role in
model selection.
In this chapter, we examine the properties of Lasso in comparison to those of
Ridge. After that, we investigate the elastic net, a regularized regression method that
combines the advantages of both Ridge and Lasso. Finally, we consider how to select
an appropriate value of λ.

1.1 Linear Regression

Throughout this chapter, let N ≥ 1 and p ≥ 1 be integers, and let the (i, j) element
of the matrix X ∈ R N × p and the k-th element of the vector y ∈ R N be denoted
by xi, j and yk , respectively. Using these X, y, we find the intercept β0 ∈ R and
the slope β = [β1 , . . . , β p ]T that minimize
y − β0 − X β2 . Here, the L2-norm of
N
z = [z 1 , . . . , z N ]T is denoted by z := 2
i=1 z i .

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 1
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_1
2 1 Linear Regression

First, for the sake of simplicity, we assume that the j-th column of X , ( j =
1, . . . , p),
and y have already been centered. That is, for each j = 1, . . . , p, define
N
x̄ j := N1 i=1 xi, j , and assume that x̄ j has already been subtracted from each xi, j so
N
that x̄ j = 0 is satisfied. Similarly, defining ȳ := N1 i=1 yi , we assume that ȳ was
subtracted in advance from each yi so that ȳ = 0 holds. Under this condition, one of
the parameters (β̂0 , β̂) for which we need to solve, say, β̂0 , is always 0. In particular,

∂   
N p p
0= (yi − β0 − xi, j β j )2 = −2N ( ȳ − β0 − x̄ j β j ) = 2N β0
∂β0 i=1 j=1 j=1

holds. Thus, from now, without loss of generality, we may assume that the intercept
β0 is zero and use this in our further calculations.
We begin by first observing the following equality:
⎡ ⎤ ⎡ ⎤
∂   
N p p
(yi − βk xi,k ) ⎥
2 ⎡ ⎤ y1 − βk x1,k ⎥
⎢ x1,1 · · · x N ,1 ⎢
⎢ ∂β1 i=1 ⎥ ⎢ ⎥
⎢ k=1 ⎥ ⎢ ⎥⎢ k=1 ⎥
⎢ .. ⎥ ⎢ ⎥⎢ .
. ⎥
⎢ ⎥ = −2 ⎢ . ⎥ ⎢ ⎥
⎢ . ⎥ ⎣ vdots . . . .. ⎦ ⎢ . ⎥
⎢ ⎥ ⎢  ⎥
⎣ ∂  
N p p
2⎦ x · · · x ⎣
yN − βk x N ,k

(yi − βk xi,k ) 1, p N,p
∂β p i=1 k=1 k=1

= −2X T (y − X β). (1.1)

In particular, the j-th element of each side can be rewritten as


N 
p
−2 xi, j (yi − βk xi,k ).
i=1 k=1

Thus, when we set the right-hand side of (1.1) equal to 0, and if X T X is invertible,
then β becomes
β̂ = (X T X )−1 X T y. (1.2)

For the case where p = 1, write each column of X as x1 , . . . , x N , we see that


N
xi yi
β̂ = i=1
N
. (1.3)
i=1 xi2

If we had not performed the data centering, we would still obtain the same slope
β̂, though the intercept β̂0 would be


p
β̂0 = ȳ − x̄ j β̂ j . (1.4)
j=1
1.1 Linear Regression 3

Here, x̄ j ( j = 1, . . . , p) and ȳ are the means before data centering.


We can implement the above using the R language as follows:
1 inner.prod = function(x, y) return(sum(x * y))
2 linear = function(X, y) {
3 n = nrow(X); p = ncol(X)
4 X = as.matrix(X); x.bar = array(dim = p); for (j in 1:p) x.bar[j] = mean(X
[, j])
5 for (j in 1:p) X[, j] = X[, j] - x.bar[j] ## mean-centering for X
6 y = as.vector(y); y.bar = mean(y); y = y - y.bar ## mean-centering for y
7 beta = as.vector(solve(t(X) %*% X) %*% t(X) %*% y)
8 beta.0 = y.bar - sum(x.bar * beta)
9 return(list(beta = beta, beta.0 = beta.0))
10 }

In this book, we focus more on the sparse case, i.e., when p is larger than N . In
this case, a problem arises. When N < p, the matrix X T X does not have an inverse.
In fact, since
rank(X T X ) ≤ rank(X ) ≤ min{N , p} = N < p,

X T X is singular. Moreover, when X has precisely the same two columns, rank(X ) <
p, and the inverse matrix does not exist.
On the other hand, if p is rather large, we have p independent variables to choose
from for the predictors of the variable when carrying out model selection. Thus, the
combinations are
{}, {1}, {2}, . . . , {1, . . . , p}

(that is, whether we choose each of the variables or not). This means we have to
compare a total of 2 p models. Then, to extract the proper model combination by using
an information criterion or cross-validation, the computational resources required
will grow exponentially with the number of variables p.
To deal with this kind of problem, let us consider the following. Let λ ≥ 0 be a
constant. We add a term to y − X β2 that penalizes β for being too large in size.
Specifically, we define

1
L := y − X β2 + λβ1 (1.5)
2N
or
1
L := y − X β2 + λβ22 . (1.6)
N

Our work now is to solve for the minimizer β


p
∈ R p to one of the above quanti-
ties. Here, for β = [β1 , . . . , β p ] , β1 := j=1 |β j | is the L1-norm and β2 :=
T

p
j=1 β j is the L2-norm of β. To be more specific, we plug mean-centered
2
4 1 Linear Regression

X ∈ R N × p and y ∈ R N into either (1.5) or (1.6), then minimize it with respect to the
slope β̂ ∈ R p (this is called Lasso or Ridge, respectively), and finally we use (1.4)
to compute β̂0 .

1.2 Subderivative

To address the minimization problem for Lasso, we need a method for optimizing
functions that are not differentiable. When we want to find the points x of the max-
ima or minima of a single-variable polynomial, say, f (x) = x 3 − 2x + 1, we can
differentiate it and find the solution to f  (x) = 0. However, what should we do when
we encounter functions such as f (x) = x 2 + x + 2|x|, which contain an absolute
value? To address this, we need to extend our concept of differentiation to a more
general one.
Throughout the following claims, let us assume that f is convex [4, 6]. In general,
we say that f is convex (downward)1 if, for any 0 < α < 1 and x, y ∈ R,

f (αx + (1 − α)y) ≤ α f (x) + (1 − α) f (y)

holds. For instance, f (x) = |x| is convex (Fig. 1.1, left) because

|αx + (1 − α)y| ≤ α|x| + (1 − α)|y|

is satisfied. To check this, since both sides are nonnegative, the RHS squared minus
the LHS squared gives 2α(1 − α)(|x y| − x y) ≥ 0. As another example, consider

1, x = 0
f (x) = (1.7)
0, x = 0.

This function satisfies the following:

f (α · 0 + (1 − α) · 1) = 1 > 1 − α = α f (0) + (1 − α) f (1)

Therefore, it is not convex (Fig. 1.1, right). If functions f, g are convex, then for any
β, γ ≥ 0, the function β f (x) + γg(x) has to be convex since the following holds:

β{ f (αx + (1 − α)y)} + γ{g(αx + (1 − α)y)}


≤ αβ f (x) + (1 − α)β f (y) + αγg(x) + (1 − α)γg(y)
= α{β f (x) + γg(x)} + (1 − α){β f (y) + γg(y)}.

1 Throughout this book, when we refer to convex, we mean convex downward.


1.2 Subderivative 5

1, x = 0
f (x) = |x| f (x) =
0, x = 0
1
f (x) = −1 f (x) = 1 Not Convex

0 x
Convex

Fig. 1.1 Left: f (x) = |x| is convex. However, at the origin, the derivatives from each side differ;
thus, it is not differentiable. Right: We cannot simply judge from its shape, but this function is not
convex

Next, for any convex function f : R → R, fix x0 ∈ R arbitrarily. For all x ∈ R,


we say that the set of all z ∈ R that satisfies

f (x) ≥ f (x0 ) + z(x − x0 ) (1.8)

is a subderivative of f at x0 .
If f is differentiable at x0 , then the subderivative will be a set that contains only
one element, say, f  (x0 ).2 We prove this as follows.
First, the convex function f is differentiable at x0 ; thus, it satisfies f (x) ≥
f (x0 ) + f  (x0 )(x − x0 ). To see this, since f is convex,

f (αx + (1 − α)x0 ) ≤ α f (x) + (1 − α) f (x0 ).

This can be rewritten as


f (x0 + α(x − x0 )) − f (x0 )
f (x) ≥ f (x0 ) + (x − x0 ).
α(x − x0 )

In fact, whether x < x0 or x > x0 , we have that

f (x0 + α(x − x0 )) − f (x0 )


lim = f  (x0 )
α 0 α(x − x0 )

holds. Thus, the above in equation is true.


Next, when the convex function f is differentiable at x0 , we can show that f  (x0 )
is the one and only value of z that satisfies (1.8). In particular, when x > x0 , for (1.8)
f (x) − f (x0 )
to be satisfied, we need ≥ z. Similarly, when x < x0 , for (1.8) to be
x − x0
f (x) − f (x0 )
satisfied, we need ≤ z. Thus, z needs to be larger than or equal to the
x − x0
derivative on the left and, at the same time, be less than or equal to the derivative on

2 In this case, we would not write it as a set { f  (x0 )} but as an element f  (x0 ).
6 1 Linear Regression

the right at x0 . Since f is differentiable at x0 , those two derivatives are equal; this
completes the proof.
The main interest of this book is specifically the case where f (x) = |x| and
x0 = 0. Hence, by (1.8), its subderivative is the set of z such that for any x ∈ R,
|x| ≥ zx. These values of z lie in the interval greater than or equal to −1 and less
than or equal to 1, and

for any arbitrary x, |x| ≥ zx ⇐⇒ |z| ≤ 1

is true. Let us confirm this. If for any x, |x| ≥ zx holds, then for x > 0 and x < 0,
z ≤ 1 and z ≥ −1, respectively, need to be true. Conversely, if −1 ≤ z ≤ 1, then
zx ≤ |zx| ≤ |x| is true for any arbitrary x ∈ R.

Example 1 By dividing into three cases x < 0, x = 0, and x > 0, find the values
x that attain the minimum of x 2 − 3x + |x| and x 2 + x + 2|x|. Note that for x = 0,
we can find their usual derivatives, but for f (x) = |x| at x = 0, its subderivative is
the interval [−1, 1].

x 2 − 3x + x, x ≥ 0 x 2 − 2x, x ≥ 0
x 2 − 3x + |x| = =
x 2 − 3x − x, x < 0 x 2 − 4x, x < 0

⎨ 2x − 2, x >0
(x 2 − 3x + |x|) = 2x − 3 + [−1, 1] = −3 + [−1, 1] = [−4, −2]  0, x = 0

2x − 4 < 0, x < 0.

Therefore, x 2 − 3x + |x| has a minimum at x = 1 (Fig. 1.2, left).

x 2 + x + 2x, x ≥ 0 x 2 + 3x, x ≥ 0
x 2 + x + 2|x| = =
x 2 + x − 2x, x < 0 x 2 − x, x < 0

⎨ 2x + 3 > 0, x >0
(x 2 + x + 2|x|) = 2x + 1 + 2[−1, 1] = 1 + 2[−1, 1] = [−1, 3] 0, x = 0

2x − 1 < 0, x < 0.

Therefore, x 2 + x + 2|x| has a minimum at x = 0 (Fig. 1.2, right). We use the fol-
lowing code to draw the figures.
1 curve(x ^ 2 - 3 * x + abs(x), -2, 2, main = "y = x^2 - 3x + |x|")
2 points(1, -1, col = "red", pch = 16)
3 curve(x ^ 2 + x + 2 * abs(x), -2, 2, main = "y = x^2 + x + 2|x|")
4 points(0, 0, col = "red", pch = 16)

The subderivative of f (x) = |x| at x = 0 is the interval [−1, 1]. This fact sum-
marizes this chapter.
1.3 Lasso 7

y = x2 − 3x + |x| y = x2 + x + 2|x|
0 2 4 6 8 10 12

8 10
6
y

y
4
2
0
-2 -1 0 1 2 -2 -1 0 1 2
x x

Fig. 1.2 x 2 − 3x + |x| (left) has a minimum at x = 1, and x 2 + x + 2|x| (right) has a minimum at
x = 0. Neither is differentiable at x = 0. Despite not being differentiable, the point is a minimum
for the figure on the right

1.3 Lasso

As stated in Sect. 1.1, the method considered for the minimization of

1
L := y − X β2 + λβ1 (1.5)
2N
is called Lasso [28].
From the formularization of (1.5) and (1.6), we can tell that Lasso and Ridge are
the same in the sense that they both try to control the size of regression coefficients
β. However, Lasso also has the property of leaving significant coefficients as non-
negative, which is particularly beneficial in variable selection. Let us consider its
mechanism.
Note that in (1.5), the division of the first term by 2 is not essential: we would
obtain an equivalent formularization if we double the value of λ. For the sake of
simplicity, first let us assume that

1 
N
1, j = k
xi, j xi,k = (1.9)
N i=1 0, j = k

1 
N
holds, and let s j = xi, j yi . With this assumption, the calculations are made
N i=1
much simpler.
8 1 Linear Regression

Fig. 1.3 Shape of the


function Sλ (x) when λ = 5
Sλ (x)

-4 -2 0 2 4
λ=5

Sλ (x)
-10 -5 0 5 10
x

Solving for the subderivative of L with respect to β j gives


  ⎧
⎨ 1, βj > 0
1  
N p
0∈− xi, j yi − xi,k βk + λ [−1, 1], β j = 0 (1.10)
N i=1 ⎩
k=1 −1, β j < 0,

which means that ⎧


⎨ −s j + β j + λ, βj > 0
0 ∈ −s j + β j + λ[−1, 1], β j = 0

−s j + β j − λ, β j < 0.

Thus, we have that ⎧


⎨ s j − λ, s j > λ
β j = 0, −λ ≤ s j ≤ λ

s j + λ, s j < −λ.

Here, the RHS can be rewritten using the following function:



⎨ x − λ, x > λ
S λ (x) = 0, −λ ≤ x ≤ λ (1.11)

x + λ, x < −λ,

and hence becomes β j = Sλ (s j ). We plot the graph of Sλ (x) when λ = 5 in Fig. 1.3
using the code provided below:
1 soft.th = function(lambda, x) return(sign(x) * pmax(abs(x) - lambda, 0))
2 curve(soft.th(5, x), -10, 10, main = "soft.th(lambda, x)")
3 segments(-5, -4, -5, 4, lty = 5, col = "blue"); segments(5, -4, 5, 4, lty =
5, col = "blue")
4 text(-0.2, 1, "lambda = 5", cex = 1.5, col = "red")
1.3 Lasso 9

Next, let us consider the case where (1.9) is not satisfied. We rewrite (1.10) by

⎨ 1, βj > 0
1 
N
0∈− xi, j (ri, j − xi, j β j ) + λ [−1, 1], β j = 0
N i=1 ⎩
−1, β j < 0.

 1 
N
Here, we denote yi − xi,k βk by ri, j , and let ri, j xi, j be s j . Next, fix βk
k= j
N i=1
(k = j), and update β j . We do this repeatedly from j = 1 to j = p until it converges
(coordinate descent). For example, we can implement the algorithm as follows:
1 linear.lasso = function(X, y, lambda = 0, beta = rep(0, ncol(X))) {
2 n = nrow(X); p = ncol(X)
3 res = centralize(X, y) ## centering (please refer to below code)
4 X = res$X; y = res$y
5 eps = 1; beta.old = beta
6 while (eps > 0.001) { ## wait for this loop to converge
7 for (j in 1:p) {
8 r = y - as.matrix(X[, -j]) %*% beta[-j]
9 beta[j] = soft.th(lambda, sum(r * X[, j]) / n) / (sum(X[, j] * X[, j])
/ n)
10 }
11 eps = max(abs(beta - beta.old)); beta.old = beta
12 }
13 beta = beta / res$X.sd ## restore each variable coefficient to that of
before standardized
14 beta.0 = res$y.bar - sum(res$X.bar * beta)
15 return(list(beta = beta, beta.0 = beta.0))
16 }

Note that here, after we obtain the value of β, we use x̄ j ( j = 1, . . . , p) and ȳ


to calculate the value of β0 . The following centralize function performs data
centering and returns a list of five results.
1 centralize = function(X, y, standardize = TRUE) {
2 X = as.matrix(X)
3 n = nrow(X); p = ncol(X)
4 X.bar = array(dim = p) ## mean of each comlumn of X
5 X.sd = array(dim = p) ## standard deviation of each column of X
6 for (j in 1:p) {
7 X.bar[j] = mean(X[, j])
8 X[, j] = (X[, j] - X.bar[j]) ## centering each column of X
9 X.sd[j] = sqet(var(X[, j]))
10 if (standardize == TRUE) X[, j] = X[, j] / X.sd[j] ## standardize each
column of X
11 }
12 if (is.matrix(y)) { ## the case when y is a matrix
13 K = ncol(y)
14 y.bar = array(dim = K) ## mean of y
15 for (k in 1:K) {
16 y.bar[k] = mean(y[, k])
17 y[, k] = y[, k] - y.bar[k] ## centering y
18 }
10 1 Linear Regression

19 } else { ## the case when y is a vector


20 y.bar = mean(y)
21 y = y - y.bar
22 }
23 return(list(X = X, y = y, X.bar = X.bar, X.sd = X.sd, y.bar = y.bar))
24 }

Thus, we may standardize the data first, then perform Lasso, and finally restore the
data. The aim of doing this is to examine our data all at once. Because the algorithm
sets all β̂ j less than or equal to λ to 0, we do not want it to be based on the indices
j = 1, . . . , p. Each j-th column of X is divided by scale[j], and consequently
the estimated β j will be larger to that extent. We then divide β j by scale[j] as
well.

Example 2 Putting U.S. crime data https://web.stanford.edu/~hastie/StatLearn


Sparsity/data.html into the text file crime.txt, we set the crime rate per 1 million
residents as the target variable and then select appropriate explanatory variables from
the list below by performing Lasso.

Column Cov./Res. Definition of Variable


1 response crime rate per 1 million residents
2 (we currently do not use this)
3 covariate annual police funding
4 covariate % of people 25 years+ with 4 yrs. of high school education
5 covariate % of 16–19-year-old persons not in high school and not high school graduates
6 covariate % of 18–24-year-old persons in college
7 covariate % of people 25 years+ with at least 4 years of college education

We call the function linear.lasso and execute as described below:


1 df = read.table("crime.txt")
2 x = df[, 3:7]; y = df[, 1]; p = ncol(x); lambda.seq = seq(0, 200, 0.1)
3 plot(lambda.seq, xlim = c(0, 200), ylim = c(-10, 20), xlab = "lambda", ylab =
"beta",
4 main = "each coefficient’s value for each lambda", type = "n", col = "
red")
5 for (j in 1:p) {
6 coef.seq = NULL
7 for (lambda in lambda.seq) coef.seq = c(coef.seq, linear.lasso(x, y, lambda
)$beta[j])
8 par(new = TRUE); lines(lambda.seq, coef.seq, col = j)
9 }
10 legend("topright",
11 legend = c("annual police funding", "\% of people 25 years+ with 4 yrs
. of high school",
12 "\% of 16--19 year-olds not in highschool and not
highschool graduates",
13 "\% of people 25 years+ with at least 4 years of college"),
14 col = 1:p, lwd = 2, cex = .8)
1.3 Lasso 11

The Values of Coefficients for each λ

20
Annual Police Funding in $/Resident
25 yrs.+ of High School
15 16 to 19 yrs. not in High School ...
18 to 24 yrs. in College
25 yrs. + in College
10
β
5
0
-5
-10

0 50 100 150 200


λ

Fig. 1.4 Result of Example 2. In the case of Lasso, we see that as λ increases, the coefficients
decrease. At a certain λ, all the coefficients will be 0. The λ at which each coefficient becomes 0
varies

As we can see in Fig. 1.4, as λ increases, the absolute value of each coefficient
decreases. When λ reaches a certain value, all coefficients go to 0. In other words,
for each value of λ, the set of nonzero coefficients differs. The larger λ becomes, the
smaller the set of selected variables.

When performing coordinate descent, it is quite common to begin with a λ large


enough that every coefficient is zero and then make λ smaller gradually. This method
is called a warm start, which utilizes the fact that when we want to calculate the
coefficient β for all values of λ, we can improve the calculation performance by
setting the initial value of β to the estimated β from a previous λ. For example, we
can write the program as follows:
1 warm.start = function(X, y, lambda.max = 100) {
2 dec = round(lambda.max / 50); lambda.seq = seq(lambda.max, 1, -dec)
3 r = length(lambda.seq); p = ncol(X); coef.seq = matrix(nrow = r, ncol = p)
4 coef.seq[1, ] = linear.lasso(X, y, lambda.seq[1])$beta
5 for (k in 2:r) coef.seq[k, ] = linear.lasso(X, y, lambda.seq[k], coef.seq[(
k - 1), ])$beta
6 return(coef.seq)
7 }
12 1 Linear Regression

Example 3 We use the warm start method to reproduce the coefficient β for each λ
in Example 2.
1 crime = read.table("crime.txt"); X = crime[, 3:7]; y = crime[, 1]
2 coef.seq = warm.start(X, y, 200)
3 p = ncol(X); lambda.max = 200; dec = round(lambda.max / 50)
4 lambda.seq = seq(lambda.max, 1, -dec)
5 plot(log(lambda.seq), coef.seq[, 1], xlab = "log(lambda)", ylab = "
coefficients",
6 ylim = c(min(coef.seq), max(coef.seq)), type = "n")
7 for (j in 1:p) lines(log(lambda.seq), coef.seq[, j], col = j)

First, we make the value of λ large enough so that all β j ( j = 1, . . . , p) are 0 and
then gradually decrease the size of λ while performing coordinate  N 2descent. Here, for
simplicity, we assume that for each j = 1, . . . , p, we have i=1 xi, j = 1; moreover,
N
the values of i=1 xi, j yi are all different. In this case, the smallest λ that makes
1 N 
 
β j = 0 ( j = 1, . . . , p) can be calculated by λ = max  xi, j yi . Particularly,
1≤ j≤ p  N 
i=1
for any λ larger than this formula, it will be satisfied that for all j = 1, . . . , p, β j = 0.
Then, we have that ri, j = yi (i = 1, . . . , N ) and

1 
N
−λ ≤ − xi, j (ri, j − xi, j β j ) ≤ λ
N i=1

hold. Thus, when we decrease the size of λ, one of the values of j will satisfy
1 N
N
| i=1 x i, j (ri, j − x i, j β j )| = λ. Again, since βk = 0 (k  = j), if we continue to
make it smaller, we still have that ri, j = yi (i = 1, . . . , N ), and thus the value of
N
| N1 i=1 xi, j yi | for that j becomes smaller than λ.
As a standard R package for Lasso, glmnet is often used [11].

Example 4 (Boston) Using the Boston dataset from the MASS package and setting
the variable of the 14th column as the target variable and the other 13 variables as
predictors, we plot a graph similar to that of the previous one (Fig. 1.5).
1.3 Lasso 13

Fig. 1.5 The execution BOSTON


result of Example 4. The
numbers at the top represent 12 11 11 12 8 5 3 0
the number of estimated

5
coefficients that are not zero

0
Coefficient
-5
-10
-15

-5 -4 -3 -2 -1 0 1 2
log λ

Column Variable Definition of Variable


1 CRIM per capita crime rate by town
2 ZN proportion of residential land zoned for lots over 25,000 sq.ft.
3 INDUS proportion of nonretail business acres per town
4 CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5 NOX nitric oxide concentration (parts per 10 million)
6 RM average number of rooms per dwelling
7 AGE proportion of owner-occupied units built prior to 1940
8 DIS weighted distances to five Boston employment centers
9 RAD index of accessibility to radial highways
10 TAX full-value property-tax rate per $10,000
11 PTRATIO pupil-teacher ratio by town
12 BLACK proportion of blacks by town
13 LSTAT % lower status of the population
14 MEDV median value of owner-occupied homes in multiples of $1,000

1 library(glmnet); library(MASS)
2 df = Boston; x = as.matrix(df[, 1:13]); y = df[, 14]
3 fit = glmnet(x, y); plot(fit, xvar = "lambda", main = "BOSTON")

So far, we have seen how Lasso can be useful in the process of variable selection.
However, we have not explained the reason why we seek to minimize (1.5) for
β ∈ R p . Why do we not instead consider minimizing the following usual information
criterion
14 1 Linear Regression

1
y − X β2 + λβ0 (1.12)
2N

for β? Here,  · 0 represents the number of nonzero elements of that vector.


Lasso, as well as Ridge, which is to be discussed in the next chapter, has the
advantage that minimization is convex. For a globally convex function, a local mini-
mum is also considered the global minimum to find the optimal solution effectively.
On the other hand, the minimization by (1.12) requires an exponential time for the
number of variables, p. In particular, since the function in (1.7) is not convex, β0
cannot be convex either. Because of this, we need to consider instead the minimiza-
tion of (1.5). An optimization problem becomes meaningful only after there exists
an effective search algorithm.

1.4 Ridge

In Sect. 1.1, we made an assumption about X ∈ R N × p , y ∈ R N that the matrix X T X


is invertible, and, based on this, we showed that the β that minimizes the squared
error y − X β2 is given by β̂ = (X T X )−1 X T y.
First, when N ≥ p, the possibility that X T X is singular is not that high, though
we may have another problem instead: the confidence interval becomes significant
when the determinant is small. To cope with this, we let λ ≥ 0 be a constant and
add to the squared error the norm of β times λ. That is, the method considering the
minimization of
1
L := y − X β2 + λβ22 (1.6)
N

is commonly used. This method is called Ridge. Differentiating L with respect to β


gives
2
0 = − X T (y − X β) + 2λβ.
N

If X T X + λI is not singular, we obtain

β̂ = (X T X + N λI )−1 X T y.

Here, whenever λ > 0, even for the case N < p, we have that X T X + N λI is
nonsingular. In particular, since the matrix X T X is positive semidefinite, we have
that its eigenvalues μ1 , . . . , μ p are all nonnegative. Therefore, the eigenvalues of
X T X + N λI can be calculated by

det(X T X + N λI − t I ) = 0 =⇒ t = μ1 + N λ, . . . , μ p + N λ > 0,

and thus all of them are positive.


1.4 Ridge 15

Again, when all the eigenvalues are positive, their product, det(X T X + N λI ),
is also positive, which is the same as saying that X T X + N λI is nonsingular. Note
that this always holds regardless of the sizes of p, N . When N < p, the rank of
X T X ∈ R p× p is less than or equal to N , and hence the matrix is singular. Therefore,
for this case, the following conditions are equivalent:

λ > 0 ⇐⇒ X T X + N λI is nonsingular.

As an example of the Ridge case, we can write the following program:


1 ridge = function(X, y, lambda = 0) {
2 X = as.matrix(X); p = ncol(X); n = length(y)
3 res = centralize(X, y); X = res$X; y = res$y
4 ## the process of Ridge is only the next one line
5 beta = drop(solve(t(X) %*% X + n * lambda * diag(p)) %*% t(X) %*% y)
6 beta = beta / res$X.sd ## restore data to those of before standardized
7 beta.0 = res$y.bar - sum(res$X.bar * beta)
8 return(list(beta = beta, beta.0 = beta.0))
9 }

Example 5 We use the same U.S. crime data as that of Example 2 and perform the
following analysis. To control the size of the coefficient of each predictor, we call
the function ridge and then execute.
1 df = read.table("crime.txt")
2 x = df[, 3:7]; y = df[, 1]; p = ncol(x); lambda.seq = seq(0, 100, 0.1)
3 plot(lambda.seq, xlim = c(0, 100), ylim = c(-10, 20), xlab = "lambda", ylab =
"beta",
4 main = "each coefficient’s value for each lambda", type = "n", col = "
red")
5 for (j in 1:p) {
6 coef.seq = NULL
7 for (lambda in lambda.seq) coef.seq = c(coef.seq, ridge(X, y, lambda)$beta[
j])
8 par(new = TRUE); lines(lambda.seq, coef.seq, col = j)
9 }
10 legend("topright",
11 legend = c("annual police funding", "\% of people 25 years+ with 4 yrs
. of high school",
12 "\% of 16--19 year-olds not in highschool and not
highschool graduates",
13 "\% of people 25 years+ with at least 4 years of college"),
14 col = 1:p, lwd = 2, cex = .8)

In Fig. 1.6, we plot how each coefficient changes with the value of λ.
16 1 Linear Regression

The Values of Coefficients for each λ

20
Annual Police Funding in $/Resident
25 yrs.+ with 4 yrs. of High School
16 to 19 yrs. not in High School ...
15
18 to 24 yrs. in College
25 yrs.+ in College
10
β
5
0
-5
-10

0 20 40 60 80 100
λ

Fig. 1.6 The execution result of Example 5. The changes in the coefficient β with respect to λ
based on Ridge. As λ becomes larger, each coefficient decreases to 0

1 crime = read.table("crime.txt")
2 X = crime[, 3:7]
3 y = crime[, 1]
4 linear(X, y)

1 $beta
2 [1] 10.9806703 -6.0885294 5.4803042 0.3770443 5.5004712
3 $beta.0
4 [1] 489.6486

1 ridge(X, y)

1 $beta
2 [1] 10.9806703 -6.0885294 5.4803042 0.3770443 5.5004712
3 $beta.0
4 [1] 489.6486

1 ridge(X, y, 200)
1.5 A Comparison Between Lasso and Ridge 17

1 $beta
2 [1] 0.056351799 -0.019763974 0.077863094 -0.017121797 -0.007039304
3 $beta.0
4 [1] 716.4044

1.5 A Comparison Between Lasso and Ridge

Next, let us compare Fig. 1.4 of Lasso to Fig. 1.6 of Ridge. We can see that they
are the same in the sense that when λ becomes larger, the absolute value of each
coefficient approaches 0. However, in the case of Lasso, when λ reaches a certain
value, one of the coefficients becomes exactly zero, and the time at which that occurs
varies for each variable. Thus, Lasso can be used for variable selection.
So far, we have shown this fact analytically, but it is also good to have an intuition
geometrically. Figures similar to those in Fig. 1.7 are widely used when one wants
to compare Lasso and Ridge.
Let p = 2 such that X ∈ R N × p is composed of two columns xi,1 , xi,2 (i =
1, . . . , N ). In the least squares process, we solve for the β1 , β2 values that minimize
 N
S := (yi − β1 xi,1 − β2 xi,2 )2 . For now, let us denote them by β̂1 , β̂2 , respectively.
i=1
Here, if we let ŷi = xi,1 β̂1 + xi,2 β̂2 , we have

Lasso Diamond Constraint Ridge Circle Constraint


5

1
1
4

3
3

2
E

E
2

1 11
1

1
0
0
-1

-1

-1 0 1 2 3 -1 0 1 2 3 4
E E

Fig. 1.7 Each ellipse is centered at (β̂1 , β̂2 ), representing the contour line connecting all the points
that give the same value of (1.13). The rhombus in the left figure is the L1 regularization constraint
|β1 | + |β2 | ≤ C  , while the circle in the right figure is the L2 regularization constraint β12 + β22 ≤ C
18 1 Linear Regression


N 
N
xi,1 (yi − ŷi ) = xi,2 (yi − ŷi ) = 0.
i=1 i=1

However, since for any β1 , β2 ,

yi − β1 xi,1 − β2 xi,2 = yi − ŷi − (β1 − β̂1 )xi,1 − (β2 − β̂2 )xi,2


N
holds, we can rewrite the quantity to be minimized, (yi − β1 xi,1 − β2 xi,2 )2 , as
i=1


N 
N 
N 
N
(β1 − β̂1 )2 2
xi,1 + 2(β1 − β̂1 )(β2 − β̂2 ) xi,1 xi,2 + (β2 − β̂2 )2 2
xi,2 + (yi − ŷi )2
i=1 i=1 i=1 i=1
(1.13)
and, of course, if we let (β1 , β2 ) = (β̂1 , β̂2 ) here, we obtain the minimum (= RSS).
Thus, we can view the problems of Lasso and Ridge in the following way: the
minimization of quantities (1.5), (1.6) is equivalent to finding the values of (β1 , β2 )
that satisfy the constraints β12 + β22 ≤ C and |β1 | + |β2 | ≤ C  , respectively, that also
minimize the quantity of (1.13) (here, the case where λ is large is equivalent to the
case where C, C  are small).
The case of Lasso is the same as in the left panel of Fig. 1.7. The ellipses are
centered at (β̂1 , β̂2 ) and represent contours on which the values of (1.13) are the
same. We expand the size of the ellipse (the contour), and once we make contact
with the rhombus, the corresponding values of (β1 , β2 ) are the solution to Lasso.
If the rhombus is small (λ is large), it is more likely to touch only one of the four
rhombus vertices. In this case, one of the β1 , β2 values will become 0. However, in
the Ridge case, as in the right panel of Fig. 1.7, a circle replaces the Lasso rhombus;
hence, it is less likely that β1 = 0, β2 = 0 will occur.
In this case, if the least squares solution (β̂1 , β̂2 ) lies in the green zone of Fig. 1.8,
then we have either β1 = 0 or β2 = 0 as our solution. Moreover, when the rhombus is
small (λ is large), even when (β̂1 , β̂2 ) remains the same, the green zone will become
larger, which is the reason why Lasso performs well in variable selection.
We should not overlook one of the advantages of Ridge: its performance when
dealing with the case of collinearity. That is, it can handle well even the case where
the matrix of explanatory variables contains columns that are highly related. Let us
define by
1
V I F :=
1 − R 2X j |X − j

the VIF (variance inflation factor). The larger this value is, the better the j-th column
variable is explained by the other variables. Here, R 2X j |X − j denotes the coefficient of
determination squared in which X j is the target variable, and the other variables are
predictors.
1.5 A Comparison Between Lasso and Ridge 19

Fig. 1.8 The green zone

6
represents the area in which 1
the optimal solution would
1
satisfy either β1 = 0, or

4
β2 = 0 if the center of the
ellipses (β̂1 , β̂2 ) lies within it

2
0
-2
-2 0 2 4 6

Example 6 We compute the VIF for the Boston dataset. It shows that the ninth and
tenth variables (RAD and TAX) have strong collinearity.
1 R2 = function(x, y) {
2 y.hat = lm(y ~ x)$fitted.values; y.bar = mean(y)
3 RSS = sum((y - y.hat) ^ 2); TSS = sum((y - y.bar) ^ 2)
4 return(1 - RSS / TSS)
5 }
6 vif = function(x) {
7 p = ncol(x); values = array(dim = p)
8 for (j in 1:p) values[j] = 1 / (1 - R2(x[, -j], x[, j]))
9 return(values)
10 }
11 library(MASS); x = as.matrix(Boston); vif(x)

1 [1] 1.831537 2.352186 3.992503 1.095223 4.586920 2.260374 3.100843


2 [8] 4.396007 7.808198 9.205542 1.993016 1.381463 3.581585 3.855684

In usual linear regression, if the VIF is large, the estimated coefficient β̂ will
be unstable. Particularly, if two columns are precisely the same, the coefficient is
unsolvable. Moreover, for Lasso, if two columns are highly related, then generally
one of them will be estimated as 0 and the other as nonzero. However, in the Ridge
case, for λ > 0, even when the columns j, k of X are the same, the estimation is
solvable, and both of them will obtain the same value.
In particular, we find the partial derivative of

1   
N p p
L= (yi − xi, j β j )2 + λ β 2j
N i=1 j=1 j=1

with respect to βk , βl and make it equal to 0.


20 1 Linear Regression

⎪ 1  
N p



⎪ − x (y − xi, j β j ) + λβk
⎨ N i,k i
i=1 j=1
0=

⎪ 1 
N 
p

⎪ − xi,l (yi − xi, j β j ) + λβl .

⎩ N
i=1 j=1

Then, plugging xi,k = xi,l into each gives

1   1  
N p N p
βk = xi,k (yi − xi, j β j ) = xi,l (yi − xi, j β j ) = βl .
λN i=1 j=1
λN i=1 j=1

Example 7 We perform Lasso for the case where the variables X 1 , X 2 , X 3 and
X 4 , X 5 , X 6 have strong correlations. We generate N = 500 groups of data distributed
in the following way:

z 1 , z 2 , , 1 , . . . , 6 ∼ N (0, 1)
x j := z 1 +  j /5, j = 1, 2, 3
x j := z 2 +  j /5, j = 4, 5, 6
y := 3z 1 − 1.5z 2 + 2.

Then, we apply linear regression analysis with Lasso to X ∈ R N × p , y ∈ R N . We


plot how the coefficients change relative to the value of λ in Fig. 1.9. Naturally, one
might expect that similar coefficient values should be given to each of the related
variables, though it turns out that Lasso does not behave in this way.
1 n = 500; x = array(dim = c(n, 6)); z = array(dim = c(n, 2))
2 for (i in 1:2) z[, i] = rnorm(n)
3 y = 3 * z[, 1] - 1.5 * z[, 2] + 2 * rnorm(n)
4 for (j in 1:3) x[, j] = z[, 1] + rnorm(n) / 5
5 for (j in 4:6) x[, j] = z[, 2] + rnorm(n) / 5
6 glm.fit = glmnet(x, y); plot(glm.fit)
7 legend("topleft", legend = c("X1", "X2", "X3", "X4", "X5", "X6"), col = 1:6,
lwd = 2, cex = .8)

1.6 Elastic Net

Up until now, we have discussed the pros and cons of Lasso and Ridge. This section
studies a method intended to combine the advantages of the two, i.e., the elastic net.
Specifically, the method considers the problem of finding β that minimizes
1.6 Elastic Net 21

0 2 5 5 6

1.5
X1
X2
X3
X4

1.0
X5
Coefficient X6

0.5
0.0
-0.5

0 1 2 3 4
L1 Norm

Fig. 1.9 The execution result of Example 7. The Lasso case is different from that of Ridge: related
variables are not given similar estimated coefficients. When  pwe use the glmnet package, its
default horizontal axis is the L1-norm [11]. This is the value j=1 |β j |, which becomes smaller as
λ increases. Therefore, the figure here is left-/right-reversed compared to that where λ or log λ is
set as the horizontal axis.


1 1−α
L := y − X β2 + λ
2
β2 + αβ1 .
2
(1.14)
2N 2
N
Sλ ( N1 i=1 ri, j x i, j )
The β j that minimizes (1.14) for the case of Lasso (α=1) is β̂ j = 1 N 2
,
i=1 x i, j
1 N
N

i=1 ri, j x i, j
while for the case of Ridge (α = 0), it is β̂ j = N
1 N
. In general cases, it
i=1 x i, j + λ
2
N
is N
Sλα ( 1 i=1 ri, j x i, j )
β̂ j = 1  N N . (1.15)
i=1 x i, j + λ(1 − α)
2
N

This is called the elastic net. For each j = 1, . . . , p, if we find the subderivative of
(1.14) with respect to β j , we obtain
22 1 Linear Regression

p ⎨ 1, βj > 0
1  
N
0∈− xi, j (yi − xi,k βk ) + λ(1 − α)β j + λα [−1, 1], β j = 0
N ⎩
i=1 k=1 −1, βj < 0
⎧ ⎫ ⎧
⎨1  ⎬ ⎨ 1, βj > 0
1 
N N
⇐⇒ 0 ∈ − xi, j ri, j + xi,2 j + λ(1 − α) β j + λα [−1, 1], β j = 0
N ⎩N ⎭ ⎩
i=1 i=1 −1, βj < 0
⎧ ⎫ ⎛ ⎞
⎨1  N ⎬ 1 
N
⇐⇒ xi,2 j + λ(1 − α) β = Sλα ⎝ xi, j ri, j ⎠ .
⎩N ⎭ N
i=1 i=1

1  1  2
N N
Here, let s j := xi, j ri, j , t j := x + λ(1 − α), and μ := λα. We have
N i=1 N i=1 i, j
used the fact that

⎨ 1, βj > 0
0 ∈ −s j + t j β j + μ [−1, 1], β j = 0 ⇐⇒ t j β j = Sμ (s j ) ,

−1, βj < 0

where Sλ : R p → R p is a function that returns Sλ for each element.


Then, we can write the program for the elastic net based on (1.15), as shown in
the following. Here, we added a parameter α to the line of #, and the only essential
change lies in the three lines of ##.
1 linear.lasso = function(X, y, lambda = 0, beta = rep(0, ncol(X)), alpha = 1)
{ #
2 X = as.matrix(X); n = nrow(X); p = ncol(X); X.bar = array(dim = p)
3 for (j in 1:p) {X.bar[j] = mean(X[, j]); X[, j] = X[, j] - X.bar[j]}
4 y.bar = mean(y); y = y - y.bar
5 scale = array(dim = p)
6 for (j in 1:p) {scale[j] = sqrt(sum(X[, j] ^ 2) / n); X[, j] = X[, j] /
scale[j]}
7 eps = 1; beta.old = beta
8 while (eps > 0.001) {
9 for (j in 1:p) {
10 r = y - as.matrix(X[, -j]) %*% beta[-j]
11 beta[j] = soft.th(lambda * alpha, ##
12 sum(r * X[, j]) / n) / (sum(X[, j] * X[, j]) / n + ##
13 lambda * (1 - alpha)) ##
14 }
15 eps = max(abs(beta - beta.old)); beta.old = beta
16 }
17 for (j in 1:p) beta[j] = beta[j] / scale[j]
18 beta.0 = y.bar - sum(X.bar * beta)
19 return(list(beta = beta, beta.0 = beta.0))
20 }
1.6 Elastic Net 23

α = 0.75 α = 0.5
1.0 0 3 5 5 5 0 3 6 6 6

1.0
0.5

0.5
Coefficient

Coefficient
0.0

0.0
-0.5

-0.5
-1.0

-1.0
0 1 2 3 4 0 1 2 3 4

L1 Norm L1 Norm

α = 0.25 α=0
0 6 6 6 6 6 6 6 6 6
1.5

1.0
1.0

0.5
Coefficient

Coefficient
0.5

0.0
0.0
-0.5

-0.5
-1.0

0 1 2 3 4 0 1 2 3 4

L1 Norm L1 Norm

Fig. 1.10 The execution result of Example 8. The closer α is to 0 (the closer the model is to Ridge),
the more it is able to handle collinearity, which is in contrast to the case of α = 1 (Lasso), where
the related variable coefficients are not estimated equally

Example 8 If we add the additional parameter alpha = 0, 0.25, 0.5,


0.75 to glm.fit = glmnet(x, y) of Example 7, we obtain the graph of
Fig. 1.10. As α approaches 0, we can observe how the coefficients of related vari-
ables become closer to one another. This outcome reveals how Ridge responds to
collinearity.
24 1 Linear Regression

1.7 About How to Set the Value of λ

To perform Lasso, the CRAN package glmnet is often used. Until now, we have
tried to explain the theory behind calculations from scratch; however, it is also a good
idea to use the precompiled function from said package.
To set an appropriate value for λ, the method of cross-validation (CV) is often
used.3
For example, the tenfold CV for each λ divides the data into ten groups, with
nine of them used to estimate β and one used as the test data, and then evaluates the
model. Switching the test group, we can perform this evaluation ten times in total
and then calculate the mean of these evaluation values. Then, we choose the λ that
has the highest mean evaluation value. If we plug the sample data of the target and
the explanatory variables into the function cv.glmnet, it evaluates each value of
λ and returns the λ of the highest evaluation value as an output.
Example 9 We apply the function cv.glmnet to the U.S. crime dataset from
Examples 2 and 5, obtain the optimal λ, use it for the usual Lasso, and then obtain
the coefficients of β. For each λ, the function also provides the value of the least
squares of the test data and the confidence interval (Fig. 1.11). Each number above
the figure represents the number of nonzero coefficients for that λ.
1 library(glmnet)
2 df = read.table("crime.txt"); X = as.matrix(df[, 3:7]); y = as.vector(df[,
1])
3 cv.fit = cv.glmnet(X, y); plot(cv.fit)
4 lambda.min = cv.fit$lambda.min; lambda.min

1 [1] 20.03869

1 fit = glmnet(X, y, lambda = lambda.min); fit$beta

1 5 x 1 sparse Matrix of class "dgCMatrix"


2 s0
3 V3 9.656911
4 V4 -2.527286
5 V5 3.229431
6 V6 .
7 V7 .

For the elastic net, we have to perform double loop cross-validation for both
(α, λ). The function cv.glmnet provides us the output cvm, which contains the
evaluation values from the cross-validation.

3Please refer to Chap. 3 of “100 Problems in Mathematics of Statistical Machine Learning with R”
of the same series.
1.7 About How to Set the Value of λ 25

55555555443333332111

50000 70000 90000 110000


Mean Square Error

0 1 2 3 4 5
log λ

Fig. 1.11 The execution result of Example 9. Using cv.glmnet, we obtain the evaluation value
(the sum of the squared residuals from the test data) for each λ. This is represented by the red
dots in the figure. Moreover, the extended lines above and below are confidence intervals for the
actual values. The optimal value here is approximately log λmin = 3 (λmin = 20). Each number
5, . . . , 5, 4, 4, 3, . . . , 3, 2, 1, 1, 1 at the top represents the number of nonzero coefficients

Example 10 We generate random numbers as our data and try conducting the double
loop cross-validation for (α, λ).
1 n = 500; x = array(dim = c(n, 6)); z = array(dim = c(n, 2))
2 for (i in 1:2) z[, i] = rnorm(n)
3 y = 3 * z[, 1] - 1.5 * z[, 2] + 2 * rnorm(n)
4 for (j in 1:3) x[, j] = z[, 1] + rnorm(n) / 5
5 for (j in 4:6) x[, j] = z[, 2] + rnorm(n) / 5
6 best.score = Inf
7 for (alpha in seq(0, 1, 0.01)) {
8 res = cv.glmnet(x, y, alpha = alpha)
9 lambda = res$lambda.min; min.cvm = min(res$cvm)
10 if (min.cvm < best.score) {alpha.min = alpha; lambda.min = lambda; best.
score = min.cvm}
11 }
12 alpha.min

1 [1] 0.47

1 lambda.min

1 [1] 0.05042894

1 glmnet(x, y, alpha = alpha.min, lambda = lambda.min)$beta


26 1 Linear Regression

Fig. 1.12 Estimates of the 0 3 6 6 6 6


coefficients of the six
variables from Example 7.

1.0
Here, we use the
cross-validated optimal α as
the parameter

0.5
Coefficient
0.0
-1.5 -1.0 -0.5

0 1 2 3 4 5
L1 Norm

1 6 x 1 sparse Matrix of class "dgCMatrix"


2 s0
3 V1 1.0562856
4 V2 0.9382231
5 V3 0.9258483
6 V4 -0.3443867
7 V5 .
8 V6 .

1 glm.fit = glmnet(x, y, alpha = alpha.min)


2 plot(glm.fit)

We plot the graph of the coefficient values for when α is optimal (in the cross-
validation sense) in Fig. 1.12.

Exercises 1–20

In the following exercises, we estimate the intercept β0 and the coefficients β =


(β1 , . . . , β p ) from N groups of p explanatory variables and a target variable data
(x1,1 , . . . , x1, p , y1 ), . . . , (x N ,1 , . . . , x N , p , y N ). We subtract from each xi, j the x̄ j =
1 N 1 N
N i=1 x i, j ( j = 1, . . . , p) and from each yi the ȳ = N i=1 yi such that each
of them has mean 0 and then estimate β (the estimated value is denoted by β̂).
p
Finally, we let β̂0 = ȳ − j=1 x̄ j β̂ j . Again, we let X = (xi, j )1≤i≤N , 1≤ j≤ p ∈ R N × p
and y = (yi )1≤i≤N ∈ R N .
Exercises 1–20 27

1. Prove that the following two equalities hold:


⎡ ⎤ ⎡ ⎤
∂   
N p p

⎢ (yi − βk xi,k ) ⎥
2
⎡ ⎤ ⎢ y1 − βk x1,k ⎥
⎢ ∂β1 i=1 ⎥ x1,1 · · · x N ,1 ⎢ ⎥
⎢ k=1 ⎥ ⎢ k=1 ⎥
⎢ .. ⎥ ⎢ .. . . .. ⎥ ⎢ .. ⎥
⎢ . ⎥ = −2 ⎣ . . . ⎦ ⎢ . ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ x1, p · · · x N , p ⎣⎢  ⎥
⎣ ∂  
N p p
⎦ ⎦
(yi − βk xi,k )2 yN − βk x N ,k
∂β p i=1 k=1 k=1

= −2X T (y − X β).(cf. (1.1))

Moreover, when X T X is invertible, prove that the value of β that minimizes


N p
y − X β22 := i=1 (yi − k=1 β1 xi,k )2 is givenby β̂ = (X T X )−1 X T y. Here,
for z = [z 1 , . . . , z N ]T ∈ R N , we define z2 := z 12 + · · · + z 2N . Then, write a
program for a function called linear with the R language. This function accepts
a matrix X ∈ R N × p and a vector y ∈ R N as inputs and then calculates the esti-
mated intercept β̂0 ∈ R and the estimated slope β̂ ∈ R p as outputs. Please com-
plete blanks (1), (2) below. Here, the function centralize conducting mean
centering to X, y is used. In the third and fourth lines below, X, y has already
been centered.
1 linear = function(X, y) {
2 n = nrow(X); p = ncol(X)
3 res = centralize(X, y, standardize = FALSE)
4 X = res$X; y = res$y
5 beta = as.vector(## blank (1) ##)
6 beta.0 = ## blank (2) ##
7 return(list(beta = beta, beta.0 = beta.0))
8 }

2. For a function f : R → R that is convex (downward), for x0 ∈ R, we have

f (x) ≥ f (x0 ) + z(x − x0 ) (x ∈ R). (cf. (1.8)

The set of z ∈ R (subderivative) is denoted by ∂ f (x0 ). Show that the following


function f is convex. In addition, at x0 = 0, find ∂ f (x0 ).
(a) f (x) = x
(b) f (x) = |x|

Hint For any 0 < α < 1 and x, y ∈ R, if f (αx + (1 − α)y) ≤ α f (x) + (1 −


α) f (y) holds, then we say that f is convex. For (b), show that |x| ≥ zx (x ∈
R) ⇐⇒ −1 ≤ z ≤ 1.
3. (a) If the functions g(x), h(x) are convex, show that for any β, γ ≥ 0, the function
βg(x) + γh(x) is convex.
28 1 Linear Regression

1, x = 0
(b) Show that the function f (x) = is not convex.
0, x = 0
(c) Decide whether the following functions of β ∈ R p are convex or not. In addi-
tion, provide the proofs.
1
i. y − X β2 + λβ0
2N
1
ii. y − X β2 + λβ1
2N
1
iii. y − X β2 + λβ22
N
Here,  · 0 denotes the number of nonzero elements,  · 1 denotes the sum of
the absolute values of all elements, and  · 2 denotes the square root of the sum
of the square of each element. Moreover, let λ ≥ 0.
4. For a convex function f : R → R, assume that it is differentiable at x = x0 .
Answer the following questions.

(a) Show that f (x) ≥ f (x0 ) + f  (x0 )(x − x0 ).


Hint From f (αx + (1 − α)x0 ) ≤ α f (x) + (1 − α) f (x0 ), derive the following
formula and take the limit.
f (x0 + α(x − x0 )) − f (x)
f (x) ≥ f (x0 ) + (x − x0 ).
α(x − x0 )

Show that the set


(b) ∂ f (x0 ) contains only the derivative of f at x = x0 as its element.
Hint In the last part, you may carry out a calculation similar to the following:

f (x) ≥ f (x0 ) + z(x − x0 ) (x ∈ R)



⎪ f (x) − f (x0 )

⎨ ≥ z, x > x0
x − x0
=⇒

⎪ f (x) − f (x0 )
⎩ ≤ z, x < x0
x − x0
f (x) − f (x0 ) f (x) − f (x0 )
=⇒ lim ≤ z ≤ lim .
x→x0 −0 x − x0 x→x 0 +0 x − x0

However, this case still leaves two possibilities: either { f  (x0 )} or an empty set.
5. Find the local minimum of the following functions. In addition, write the graphs
of the range −2 ≤ x ≤ 2 using the R language.

(a) f (x) = x 2 − 3x + |x| and


(b) f (x) = x 2 + x + 2|x|.
Hint Divide it into three cases: x < 0, x = 0, x > 0.
6. The (i, j) entry of x ∈ R N × p and the i-th element of y ∈ R N are denoted by xi, j
and yi , respectively. Let λ ≥ 0; we want to find β1 , . . . , β p that minimize
Exercises 1–20 29

1   
N p p
L := (yi − β j xi, j )2 + λ |β j |.
2N i=1 j=1 j=1

Here, we assume that


1 
N
1, j = k
xi, j xi,k = (1.16)
N i=1 0, j = k

N
and let s j = N1 i=1 xi, j yi . For λ > 0, holding βk (k = j) constant, find the sub-
derivative of L with respect to β j . In addition, find the value of β j that attains the
minimum.
Hint Consider each of the following cases: β j > 0, β j = 0, β j < 0. For exam-
ple, when β j = 0, we have −β j + s j − λ[−1, 1] 0, but this is equivalent to
−λ ≤ s j ≤ λ.
7. (a) Let λ ≥ 0; define a function Sλ (x) as follows:

⎨ x − λ, x > λ
Sλ (x) := 0, |x| ≤ λ

x + λ, x < −λ.

For this function Sλ (x), by using the function (x)+ = max{x, 0} and

⎨ −1, x < 0
sign(x) = 0, x = 0

1, x > 0

show that it can be re-expressed as Sλ (x) = sign(x)(|x| − λ)+ .


(b) Write a program in the R language that performs (a). Name it soft.th.
Then, execute the following code to see whether it works correctly.
1 curve(soft.th(5, x), -10, 10)

Hint In the R language, if we define a function by dividing it into cases or


using max, it will not plot a graph for us. Use sign, abs, and pmax instead.
8. For the cases where the assumption (1.16) of Exercise 6 is not satisfied, if we fix
βk (k = j) and then update
⎧ 

⎪ri := yi − βk xi,k (i = 1, . . . , n)


 n
k = j  n 
⎪ 1  1 

⎪β j := Sλ xi, j ri x 2
⎩ n i=1 n i=1 i, j
30 1 Linear Regression

for j = 1, . . . , p sequentially, by repeating the process, we can solve for β1 , . . . ,


β p . Moreover, to calculate the intercept, initially center the data; then, the intercept
can be calculated by letting


p
β0 := ȳ − β j x̄ j .
j=1

1  1 
n n
Here, we denote x̄ j = xi, j and ȳ j = yi . Complete the blanks in the
N i=1 N i=1
following and execute the program. In addition, by letting λ = 10, 50, 100, check
how the coefficients that once were 0 change accordingly.
To execute the following codes, please access the site https://web.stanford.edu/
~hastie/StatLearnSparsity/data.html and download the U.S. crime data.4 Then,
store them in a file named crime.txt.5
1 linear.lasso = function(X, y, lambda = 0) {
2 X = as.matrix(X); n = nrow(X); p = ncol(X)
3 res = centralize(X, y); X = res$X; y = res$y
4 eps = 1; beta = rep(0, p); beta.old = rep(0, p)
5 while (eps > 0.001) {
6 for (j in 1:p) {
7 r = ## blank (1) ##
8 beta[j] = soft.th(lambda, sum(r * X[, j]) / n)
9 }
10 eps = max(abs(beta - beta.old)); beta.old = beta
11 }
12 beta = beta / res$X.sd
13 beta.0 = ## blank (2) ##
14 return(list(beta = beta, beta.0 = beta.0))
15 }
16 crime = read.table("crime.txt"); X = crime[, 3:7]; y = crime[, 1]
17 linear.lasso(X, y, 10); linear.lasso(X, y, 50); linear.lasso(X, y, 100)

9. glmnet is an R package commonly used to perform Lasso. The following


code performs the regression analysis with sparsity to the U.S. crime data using
glmnet.

1 library(glmnet)
2 df = read.table("crime.txt"); x = as.matrix(df[, 3:7]); y = df[, 1]
3 fit = glmnet(x, y); plot(fit, xvar = "lambda")

4 Open the window, press Ctrl+A to select all, press Ctrl+C to copy, make a new file of your text
editor, and then press Ctrl+V to paste the data.
5 To execute the R program, you may need to set the working directory or provide the path to where

the dataset file is located.


Exercises 1–20 31

Following the above, conduct an analysis similar to that of the Boston dataset6
by setting the 14th column as the target variable and the other 13 as explanatory and
then write a graph in the same manner.

Column Variable Definition of Variable


1 CRIM per capita crime rate by town
2 ZN proportion of residential land zoned for lots over 25,000 sq.ft.
3 INDUS proportion of nonretail business acres per town
4 CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5 NOX nitric oxide concentration (parts per 10 million)
6 RM average number of rooms per dwelling
7 AGE proportion of owner-occupied units built prior to 1940
8 DIS weighted distances to five Boston employment centers
9 RAD index of accessibility to radial highways
10 TAX full-value property-tax rate per $10,000
11 PTRATIO pupil-teacher ratio by town
12 BLACK proportion of blacks by town
13 LSTAT % lower status of the population
14 MEDV median value of owner-occupied homes in multiples of $1,000

10. The coordinate descent starts from a large λ so that all coefficients are initially
0 and then gradually decreases the size of λ. In addition, for each iteration,
consider using the coefficients estimated with the last λ as the initial values for
the next λ iteration (warm start). We consider the following program and apply
it to U.S. crime data. Fill in the blank.
1 warm.start = function(X, y, lambda.max = 100) {
2 dec = round(lambda.max / 50); lambda.seq = seq(lambda.max, 1, -dec)
3 r = length(lambda.seq); p = ncol(X); coef.seq = matrix(nrow = r, ncol =
p)
4 coef.seq[1, ] = linear.lasso(X, y, lambda.seq[1])$beta
5 for (k in 2:r) coef.seq[k, ] = ## blank ##
6 return(coef.seq)
7 }
8 crime = read.table("crime.txt"); X = crime[, 3:7]; y = crime[, 1]
9 coef.seq = warm.start(X, y, 200)
10 p = ncol(X)
11 lambda.max = 200
12 dec = round(lambda.max / 50)
13 lambda.seq = seq(lambda.max, 1, -dec)
14 plot(log(lambda.seq), coef.seq[, 1], xlab = "log(lambda)", ylab = "
Coefficient",
15 ylim = c(min(coef.seq), max(coef.seq)), type = "n")
16 for (j in 1:p) lines(log(lambda.seq), coef.seq[, j], col = j)

Next, setting λ large enough for all initial coefficients to  be 0, we perform


N
the coordinate descent. For each j, show that when both i=1 xi,2 j = 1 and
N
i=1 x i, j yi are different for all j = 1, . . . , p, the smallest λ that makes every

6 Please install the MASS package.


32 1 Linear Regression
 
1 N 
 
variable coefficient 0 is given by λ = max  xi, j yi . Moreover, execute
1≤ j≤ p  N 
i=1
the following code:
1 X = as.matrix(X); y = as.vector(y); cv = cv.glmnet(X, y); plot(cv)

What is the meaning of the number n?


11. Denote the eigenvalues of X T X by γ1 , . . . , γ p . Using them, derive the condition
where the inverse of X T X does not exist. In addition, show that the eigenvalues of
X T X + N λI are given by γ1 + N λ, . . . , γ p + N λ. Moreover, whenever λ > 0
is satisfied, show that X T X + N λI always has an inverse. Here, we can use the
fact, without proof, that the eigenvalues of a positive semidefinite matrix are all
nonnegative.
12. Let λ ≥ 0. Find the β that minimizes

1  λ 2
N p
1 λ
y − X β22 + β22 := (yi − β1 xi,1 − · · · − β p xi, p )2 + β j (cf. (1.8))
2N 2 2N 2
i=1 j=1

(Ridge regression).
13. Adding to the function linear of Exercise 1 the additional parameter λ ≥ 0,
write an R program ridge that performs the process of Exercise 1.7. Execute
that program, and check whether your results match the following examples.
1 crime = read.table("crime.txt")
2 X = crime[, 3:7]
3 y = crime[, 1]
4 linear(X, y)

1 $beta
2 [1] 10.9806703 -6.0885294 5.4803042 0.3770443 5.5004712
3 $beta.0
4 [1] 489.6486

The output of the function ridge should be as follows:


1 ridge(X, y)

1 $beta
2 [1] 10.9806703 -6.0885294 5.4803042 0.3770443 5.5004712
3 $beta.0
4 [1] 489.6486
Exercises 1–20 33

1 ridge(X, y, 200)

1 $beta
2 [1] 0.056351799 -0.019763974 0.077863094 -0.017121797 -0.007039304
3 $beta.0
4 [1] 716.4044

14. The code below performs an analysis of the U.S. crime data in the following
order: while changing the value of λ, it solves for coefficients using the function
ridge and plots a graph of how each coefficient changes. In the program, change
lambda.seq to log(lambda.seq), and change the horizontal axis label
from lambda to log(lambda). Then, plot the graph (and change the title (the
main part)).
1 df = read.table("crime.txt"); x = df[, 3:7]; y = df[, 1]; p = ncol(x)
2 lambda.max = 3000; lambda.seq = seq(1, lambda.max)
3 plot(lambda.seq, xlim = c(0, lambda.max), ylim = c(-12, 12),
4 xlab = "lambda", ylab = "coefficients", main = "changes in
coefficients according to lambda",
5 type = "n", col = "red") ## this 1 line
6 for (j in 1:p) {
7 coef.seq = NULL
8 for (lambda in lambda.seq) coef.seq = c(coef.seq, ridge(x, y, lambda)$
beta[j])
9 par(new = TRUE)
10 lines(lambda.seq, coef.seq, col = j) ## this 1 line
11 }
12 legend("topright",
13 legend = c("annual police funding", "\% of people 25 years+ with 4
yrs. of high school",
14 "\% of 16--19 year-olds not in highschool and not
highschool graduates",
15 "\% of people 25 years+ with at least 4 years of
college"),
16 col = 1:p, lwd = 2, cex = .8)

N
15. For given xi,1 , xi,2 , yi ∈ R (i = 1, . . . , N ), the β1 , β2 that minimize S := i=1
(yi − β1 xi,1 − β2 xi,2 )2 is denoted by β̂1 , β̂2 , and β̂1 xi,1 + β̂2 xi,2 is denoted by
ŷi , (i = 1, . . . , N ).
(a) Prove that the following two equalities hold:
N 
N
i. xi,1 (yi − ŷi ) = xi,2 (yi − ŷi ) = 0.
i=1 i=1
ii. For any β1 , β2 ,

yi − β1 xi,1 − β2 xi,2 = yi − ŷi − (β1 − β̂1 )xi,1 − (β2 − β̂2 )xi,2 .


34 1 Linear Regression


N
In addition, for any β1 , β2 , show that the quantity (yi − β1 xi,1 − β2 xi,2 )2
i=1
can be rewritten as


N 
N 
N
(β1 − β̂1 )2 2
xi,1 + 2(β1 − β̂1 )(β2 − β̂2 ) xi,1 xi,2 + (β2 − β̂2 )2 2
xi,2
i=1 i=1 i=1


N
+ (yi − ŷi )2 .(cf. (1.13))
i=1
N N 2 N
(b) Assume that i=1 xi,1 2
= i=1 xi,2 = 1, i=1 x i,1 x i,2 = 0. In the case of
usual least squares, we choose β1 = β̂1 , β2 = β̂2 . However, when there
is a constraint requiring |β1 | + |β2 | to be less than or equal to a value,
we have to think of the problem as choosing a point (β1 , β2 ) from a cir-
cle (of as small a radius as possible) centered at (β̂1 , β̂2 ), which has to
lie within the square of the constraint. Fix the square whose vertices are
(1, 0), (0, 1), (−1, 0), (0, −1), and fix a point (β̂1 , β̂2 ) outside the square.
Then, consider the circle centered at (β̂1 , β̂2 ). We expand the circle (increase
its radius) until it touches the square. Now, draw the range of (β̂1 , β̂2 ) such
that when the circle touches the square, one of the coordinates of the point
of contact will be 0.
(c) In (b), what would happen if the square is replaced by a circle of radius 1
(circle touches circle)?
16. A matrix X is composed of p explanatory variables with N samples each. If
it has two columns that are equal for all N samples, let λ > 0, perform Ridge
regression, and show that the estimated coefficients of the two vectors will be
equal.
Hint Note that

1   
N p p
L= (yi − β0 − xi, j β j )2 + λ β 2j .
N i=1 j=1 j=1

1 
N
Partially differentiating it with respect to βk , βl gives − xi,k (yi − β0 −
N i=1

p
1 
N p
xi, j β j ) + λβk and − xi,l (yi − β0 − xi, j β j ) + λβl . Both of them
j=1
N i=1 j=1
have to be 0. Plug xi,k = xi,l into each of them.
17. We perform linear regression analysis with Lasso for the case where Y is the tar-
get variable and the variables X 1 , X 2 , X 3 , and X 4 , X 5 , X 6 are highly correlated.
We generate N = 500 groups of data, distributed as in the relations below, and
then apply linear regression analysis with Lasso to X ∈ R N × p , y ∈ R N .
Exercises 1–20 35

z 1 , z 2 , , 1 , . . . , 6 ∼ N (0, 1)
x j := z 1 +  j /5, j = 1, 2, 3
x j := z 2 +  j /5, j = 4, 5, 6
y := 3z 1 − 1.5z 2 + 2.

Fill in the blanks (1), (2) below. Plot a graph showing how each coefficient
changes with λ.
1 n = 500; x = array(dim = c(n, 6)); z = array(dim = c(n, 2))
2 for (i in 1:2) z[, i] = rnorm(n)
3 y = ## blank (1) ##
4 for (j in 1:3) x[, j] = z[, 1] + rnorm(n) / 5
5 for (j in 4:6) x[, j] = z[, 2] + rnorm(n) / 5
6 glm.fit = glmnet(## blank (2) ##); plot(glm.fit)

18. Instead of the usual Lasso or Ridge, we find the β0 , β values that minimize

1 1−α
y − β0 − X β22 + λ β22 + αβ1 (cf. (1.14)) (1.17)
2N 2

(we mean-center the X, y data, initially let β0 = 0, and restore its value later).
N
Sλ ( N1 i=1 ri, j xi, j )
The β j that minimizes (1.14) is given by β̂ j = 1 N 2
in the case
i=1 x i, j
1 N
N
ri, j xi, j
of Lasso (α = 1), by β̂ j = 1  Ni=1 2
N
in the case of Ridge (α = 0), or, in
N i=1 x i, j + λ
general cases, by
N
Sλα ( 1 i=1 ri, j x i, j )
β̂ j = 1  N N (cf. (1.15)) (1.18)
i=1 x i, j + λ(1 − α)
2
N

(the elastic net cases). By finding the subderivative of (1.17) with respect to β j
and sequentially proving the three equations below, show that (1.18) holds. Here,
x j denotes the j-th column of X .

⎨ 1, βj > 0
1
i. 0 ∈ − X T (y − X β) + λ(1 − α)β j + λα [−1, 1], β j = 0
N ⎩
−1, ⎧ β j < 0;
 
1 
N
1  2
N ⎨ 1, βj > 0
ii. 0 ∈ − xi, j ri, j + xi, j + λ(1 − α) β j + λα [−1, 1], β j = 0
N N ⎩
i=1 i=1 −1, β j < 0;
   
1  2 1 
N N
iii. x + λ(1 − α) β = Sλα xi, j ri, j .
N i=1 i, j N i=1
In addition, revise the function linear.lasso in Exercise 8, let its default
parameter be alpha = 1, and generalize it by changing the formula to (1.18).
36 1 Linear Regression

N
Moreover, correct the value of 1 − α by dividing it by i=1 (yi − ȳ) /N (this
2

normalization exists in glmnet).


19. To the function glm.fit = glmnet(x, y) in Exercise 1.7, add the option
alpha = 0.3 and see the output. What are the differences between alpha
= 0 (Ridge) and alpha = 1 (Lasso)?
20. Even for the case of the elastic net, we need to select the optimal value for α. In
the program provided below, the output cvm of the glmnet function contains
squared errors for each λ with respect to a specific value of α. Then, it assigns to
that α the smallest value of the squared errors, which is attained by some λ, and
the algorithm compares each α and returns the one with the smallest squared
error. Once we obtain the optimal α, plot a figure similar to that of Exercise 1.7.
1 alpha = seq(0.01, 0.99, 0.01); m = length(alpha); mse = array(dim = m)
2 for (i in 1:m) {cvg = cv.glmnet(x, y, alpha = alpha[i]); mse[i] = min(cvg
$cvm)}
3 best.alpha = alpha[which.min(mse)]; best.alpha
4 cva = cv.glmnet(x, y, alpha = best.alpha)
5 best.lambda = cva$lambda.min; best.lambda
Chapter 2
Generalized Linear Regression

In this chapter, we consider the so-called generalized linear regression, which


includes logistic regression (binary and multiple cases), Poisson regression, and Cox
regression. We can formulate these problems in terms of maximizing the likelihood
and solve them by applying the Newton method: differentiate the log-likelihood by
the parameters to be estimated, and solve the equation such that the differentiated
value is zero.
In general, the solution of the maximum likelihood method may not have a finite
value and may not converge even if the Newton method is applied. However, if we
regularize the log-likelihood via Lasso, the larger the value of λ is, the more likely
it converges, and we can choose the relevant variables.

2.1 Generalization of Lasso in Linear Regression

In Chap. 1, centralizing X ∈ R N × p and y ∈ R N and eliminating the intercept β0 ∈ R,


we obtain the β value that minimizes

1
y − X β2 + λβ1
2N

for given λ ≥ 0. In this section, letting W ∈ R N ×N be nonnegative definite, we obtain


the Lasso solution of the extended linear regression.

1
L 0 := (y − β0 − X β)T W (y − β0 − X β) .
2N

To this end, we centralize each column of X and y and eliminate β0 . In particular, if


we centralize

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 37
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_2
38 2 Generalized Linear Regression
⎧ N N

⎪ j=1 wi, j x j,k
⎪ i=1
⎪ X̄ k :=  N  N , k = 1, . . . , p

i=1 j=1 wi, j
N N

⎪ j=1 wi, j y j

⎪ i=1
⎩ ȳ :=  N  N
i=1 j=1 wi, j

xi,k ← xi,k − X̄ k , i = 1, . . . , N , k = 1, . . . , p
yi ← yi − ȳ, i = 1, . . . , N
N N
for i = 1, . . . , N , considering the weights W = (wi, j ), we have wi, j y j =
N N i=1 j=1
0 and i=1 j=1 wi, j x i,k = 0, which means that the solution


p
β̂0 = ȳ − X̄ k β̂k (2.1)
k=1

of the equation obtained by adding all the rows

∂ L0 1
= − W (y − β0 − X β)
∂β0 N
⎡ N N N p ⎤ ⎡ ⎤
j=1 w1, j y j − β0 j=1 w1, j − j=1 w1, j k=1 x j,k βk 0
1 ⎢ .. ⎥ ⎢.⎥
=− ⎢ ⎥=⎣.⎦ .
N ⎣ N
.
N p
⎦ .
N
w y
j=1 N , j j − β0 w
j=1 N , j − w
j=1 N , j x
k=1 j,k k β 0

is zero. Once we obtain the centralization, we minimize

1
L(β) := (y − X β)T W (y − X β) + λβ1 .
2N

If we execute the Cholesky decomposition as W = M T M and define V := M X ,


u := M y, we may write

1
L(β) = u − V β22 + λβ1 .
2N

From the obtained β̂ and (2.1), we obtain β̂0 .


We show a sample R program that realizes the procedure as follows:
1 W.linear.lasso = function(X, y, W, lambda = 0) {
2 n = nrow(X); p = ncol(X); X.bar = array(dim = p)
3 for (k in 1:p) {
4 X.bar[k] = sum(W %*% X[, k]) / sum(W)
5 X[, k] = X[, k] - X.bar[k]
6 }
7 y.bar = sum(W %*% y) / sum(W); y = y - y.bar
8 L = chol(W)
9 # L = sqrt(W)
2.1 Generalization of Lasso in Linear Regression 39

10 u = as.vector(L %*% y); V = L %*% X


11 beta = linear.lasso(V, u, lambda)$beta
12 beta.0 = y.bar - sum(X.bar * beta)
13 return(c(beta.0, beta))
14 }

The matrix X, an input to the function W.linear.lasso, has p columns, corre-


sponding to the number of variables. The output of the function is not a list but a
vector of length p + 1 consisting of the intercept and slope in this order.
The Cholesky decomposition requires an O(N 3 ) execution time for a matrix size
N , and an error occurs when N is too large. However, in the problems addressed in
this chapter, the matrix W is diagonal such that the diagonal elements are positive. In
these cases, we may replace L = chol(W) with L = diag(sqrt(W)). More-
over, because we consider a sparse situation, we may assume that N is not so large
compared with p.

2.2 Logistic Regression for Binary Values

Let Y be a random variable that takes values in {0, 1}. We assume that for each
x ∈ R p (row vector), there exist β0 ∈ R and β = [β1 , . . . , β p ] ∈ R p such that the
probability P(Y = 1 | x) satisfies

P(Y = 1 | x)
log = β0 + xβ (2.2)
P(Y = 0 | x)

(logistic regression). Then, we may write (2.2) as

exp(β0 + xβ)
P(Y = 1 | x) = . (2.3)
1 + exp(β0 + xβ)

Example 11 Let p = 1 and β0 = 0. We display the shapes of the distributions that


are given by the right-hand side of (2.3) for various β ∈ R. The larger the value of
β > 0, the more suddenly P(Y = 1 | x) grows at x = 0 from almost zero to almost
one. We show the curves in 2.1.
1 f = function(x) return(exp(beta.0 + beta * x) / (1 + exp(beta.0 + beta
* x)))
2 beta.0 = 0; beta.seq = c(0, 0.2, 0.5, 1, 2, 10)
3 m = length(beta.seq)
4 beta = beta.seq[1]
5 plot(f, xlim = c(-10, 10), ylim = c(0, 1), xlab = "x", ylab = "y",
6 col = 1, main = "Logistic Curve")
7 for (i in 2:m) {
8 beta = beta.seq[i]
9 par(new = TRUE)
10 plot(f, xlim = c(-10, 10), ylim = c(0, 1), xlab = "", ylab = "",
axes = FALSE, col = i)
40 2 Generalized Linear Regression

Fig. 2.1 Execution of


Example 11. Setting β0 = 0,
Logistic Curve
we draw the curves of (2.3)

1.0
0
when β > 0 changes 0.2
0.5
1

0.8
2
10

P (Y = 1|x)
0.6
0.4
0.2
0.0

-10 -5 0 5 10
x

11 }
12 legend("topleft", legend = beta.seq, col = 1:length(beta.seq), lwd =
2, cex = .8)
13 par(new = FALSE)

On the other hand, if we change the variable Y as Y → 2Y − 1 such that Y takes


values in {−1, 1}, then the distribution can be expressed by

1
P(Y = y | x) = . (2.4)
1 + exp{−y(β0 + xβ)}

In particular, if the realizations of x ∈ R p and Y ∈ {−1, 1} in (2.2) are (xi , yi ), i =


N
1
1, . . . , N , then the likelihood is given by −yi (β0 +xi β)
, which is the product
i=1
1 + e
β0 +xi β
of P(Y = 1 | xi ) = 1+e e
β0 +xi β = 1+e−(β0 +xi β) and
1
P(Y = −1 | xi ) = 1+eβ10 +xi β for i =
1, . . . , N .
For simplicity, in the following, we identify the maximization of the likelihood
and minimization of the minus log-likelihood L.

1 
N
L := log(1 + exp{−yi (β0 + xi β)}) . (2.5)
N i=1

Let vi := exp{−yi (β0 + xi β)} for i = 1, . . . , N , and


2.2 Logistic Regression for Binary Values 41
⎡ yv ⎤
1 1
⎢ 1 + v1 ⎥
⎢ .. ⎥
u; = ⎢
⎢ .
⎥ .

⎣ yN vN ⎦
1 + vN

∂L
Then, the vector ∇ L ∈ R p+1 whose j-th element is , j = 0, 1, . . . , p) can be
∂β j
1 
N
1 ∂L vi
expressed by ∇ L = − X T u. In fact, we observe =− xi, j yi (j =
N ∂β j N i=1 1 + vi
0, 1, . . . , p). If we let
⎡v1 ⎤
··· 0
⎢ (1 + v1 )2 ⎥
⎢ .. .. .. ⎥
W =⎢
⎢ . . .

⎥ symmetric matrix
⎣ vN ⎦
0 ···
(1 + v N )2

∂2 L
then the matrix ∇ 2 L such that the ( j, k)-th element is , j, k = 0, 1, . . . , p can
∂β j βk
1 T
be expressed by ∇ 2 L = X W X . In fact, we have
N

1  1 
N N
∂2 L ∂vi 1 vi
=− xi, j yi =− xi, j yi (−yi xi,k )
∂β j ∂βk N i=1 ∂βk 1 + vi N i=1 (1 + vi )2

1 
N
vi
= xi, j xi,k ( j, k = 0, 1, . . . , p) ,
N i=1 (1 + vi )2

where yi2 = 1.
Next, to obtain the coefficients of logistic regression that maximize the likelihood,
we give the initial value of (β0 , β) and repeatedly apply the updates of the Newton
method    
β0 β0
← − {∇ 2 L(β0 , β)}−1 ∇ L(β0 , β) (2.6)
β β

 ∇L(β0 , β) = 0.
to obtain the solution
β0
If we let z = X + W −1 u, then we have
β
42 2 Generalized Linear Regression
   
β0 β0
− {∇ 2 L(β0 , β)}−1 ∇ L(β0 , β) = + (X T W X )−1 X T u
β β
 
β0
= (X T W X )−1 X T W (X + W −1 u) = (X T W X )−1 X T W z , (2.7)
β

which means that we need only to repeatedly execute the steps


1. obtain W, z from β0 , β and
2. obtain β0 , β from W, z.

Example 12 We execute the maximum likelihood solution following the above pro-
cedure.
1 ## Data Generation
2 N = 1000; p = 2; X = matrix(rnorm(N * p), ncol = p); X = cbind(rep(1,
N), X)
3 beta = rnorm(p + 1); y = array(N); s = as.vector(X %*% beta); prob = 1
/ (1 + exp(s))
4 for (i in 1:N) {if (runif(1) > prob[i]) y[i] = 1 else y[i] = -1}
5 beta

1 [1] -0.5859092 0.1610445 0.4134176

1 ## Computation of the ML solution


2 beta = Inf; gamma = rnorm(p + 1)
3 while (sum((beta - gamma) ^ 2) > 0.001) {
4 beta = gamma
5 s = as.vector(X %*% beta)
6 v = exp(-s * y)
7 u = y * v / (1 + v)
8 w = v / (1 + v) ^ 2
9 z = s + u / w
10 W = diag(w)
11 gamma = as.vector(solve(t(X) %*% W %*% X) %*% t(X) %*% W %*% z)
##
12 print(gamma)
13 }
14 beta ## The true value that we wish to estimate

1 [1] -0.68982062 0.06228453 0.37459366

Repeating the cycle of Newton method, the tentative estimates approach to the
true value.
1 [1] -0.8248500 -0.3305656 -0.6027963
2 [1] -0.4401921 0.2054357 0.6739074
3 [1] -0.68982062 0.06228453 0.37459366
4 [1] -0.68894021 0.07094424 0.40108031
2.2 Logistic Regression for Binary Values 43

Although the maximum likelihood method can estimate the values of β0 , β, when
p is large relative to N , the absolute values of the estimates β̂0 , β̂ may go to infinity.
Worse, the larger the value of p, the more likely the procedure diverges. For example,
suppose that the rank of X is N and that N < p. Then, for any α1 , . . . , α N > 0,
there exists (β0 , β) ∈ R p+1 such that yi (β0 + xi β) = αi > 0 (i = 1, . . . , N ). If we
multiply β0 , β by two, the likelihood strictly increases. In general, the larger the
value of p is, the more often this phenomenon occurs.
This section considers applying Lasso to obtain a reasonable solution, although
it does not maximize the likelihood. The main idea is to regard the almost zero
coefficients as zeros to choose relevant variables when p is large. To this end, we
add a regularization term to (2.5):

1 
N
L := log(1 + exp{−yi (β0 + xi β)}) + λβ1 (2.8)
N i=1

and extend the Newton method as follows. We note that (2.7) is the β that minimizes

1
(z − X β)T W (z − X β) . (2.9)
2N
In fact, we observe
   T     
1 β0 β0 β0
∇ z−X W z−X = X WX
T
− XT Wz .
2 β β β

After centralizing X, y introduced in Sect. 2.1, we minimize

1
(z − X β)T W (z − X β) + λβ1 , (2.10)
2N

where the M in Sect. 2.1 is a diagonal matrix with diagonal elements wi , i =
1, . . . , N . In other words, we need only to repeatedly execute the steps
1. obtain W, z from β0 , β and
2. obtain β0 , β that minimize (2.10) from W, z
(proxy Newton method). Using the function W.linear.lasso in Sect. 2.1, we
need to update only ## in the program of Example 12 as follows, where we assume
that the leftmost column of X consists of N ones and X contains p + 1 columns.
1 logistic.lasso = function(X, y, lambda) {
2 p = ncol(X)
3 beta = Inf; gamma = rnorm(p)
4 while (sum((beta - gamma) ^ 2) > 0.01) {
5 beta = gamma
6 s = as.vector(X %*% beta)
7 v = as.vector(exp(-s * y))
44 2 Generalized Linear Regression

8 u = y * v / (1 + v)
9 w = v / (1 + v) ^ 2
10 z = s + u / w
11 W = diag(w)
12 gamma = W.linear.lasso(X[, 2:p], z, W, lambda = lambda)
13 print(gamma)
14 }
15 return(gamma)
16 }

Example 13 Generating data, we examine the behavior of the function


logistic.lasso.
1 N = 100; p = 2; X = matrix(rnorm(N * p), ncol = p); X = cbind(rep(1, N
), X)
2 beta = rnorm(p + 1); y = array(N); s = as.vector(X %*% beta); prob = 1
/ (1 + exp(s))
3 for (i in 1:N) {if (runif(1) > prob[i]) y[i] = 1 else y[i] = -1}
4 logistic.lasso(X, y, 0)

1 [1] -0.4066920 1.0055186 0.7396638

1 logistic.lasso(X, y, 0.1)

1 [1] -0.3759565 0.6798822 0.4313101

1 logistic.lasso(X, y, 0.2)

1 [1] -0.3489496 0.4094120 0.1710077

Example 14 After estimating parameters via logistic regression, we classify each


new data point that follows the same distribution. We evaluate the exponent based
on the estimated β, whether or not the exponent X is positive (2.2). %*%
beta.est is positive or not (2.2).
1 ## Data Generation
2 N = 100; p = 2; X = matrix(rnorm(N * p), ncol = p); X = cbind(rep(1, N
), X)
3 beta = 10 * rnorm(p + 1); y = array(N); s = as.vector(X %*% beta);
prob = 1 / (1 + exp(s))
4 for (i in 1:N) {if (runif(1) > prob[i]) y[i] = 1 else y[i] = -1}
5 ## Parameter Estimation
6 beta.est = logistic.lasso(X, y, 0.1)
7 ## Classification
8 for (i in 1:N) {if (runif(1) > prob[i]) y[i] = 1 else y[i] = -1}
9 z = sign(X %*% beta.est) ## If the exponent is positive, then z=+1,
otherwise z=-1
10 table(y, z)
2.2 Logistic Regression for Binary Values 45

1 z
2 y -1 1
3 -1 70 3
4 1 7 20

Note that y and z are the correct answer and the predictive value, respectively. In
this example, we predict the correct answer with a precision of 90 %.

On the other hand, the function glmnet used for linear regression can be used
for logistic regression [11].

Example 15 The following program analyzes the dataset breastcancer, which


consists of 250 samples (58 cases and 192 control expression data points) w.r.t.
1,001 variables: 1,000 genes (X : covariates) and a case/control (Y : response) of
breast cancer.
1 library(glmnet)
2 df = read.csv("breastcancer.csv")s
3 ## Put breastcancer.csv in the current directory
4 x = as.matrix(df[, 1:1000])
5 y = as.vector(df[, 1001])
6 cv = cv.glmnet(x, y, family = "binomial")
7 cv2 = cv.glmnet(x, y, family = "binomial", type.measure = "class")
8 par(mfrow = c(1, 2))
9 plot(cv)
10 plot(cv2)
11 par(mfrow = c(1, 1))

We evaluate the CV values for each λ and connect the plots (Fig. 2.2). The option
cv.glmnet (default) evaluates the CV based on the binary deviation
 
−2 log(1 + exp{−(β̂0 + xi β̂)}) − 2 log(1 + exp{β̂0 + xi β̂})
i:yi =1 i:yi =−1

[11], where β̂ is the estimate from the training data, (x1 , y1 ), . . . , (xm , ym ) are the test
data, and we evaluate the CV by switching the training and test data several times. If
we specify the option type.measure = "class", we evaluate the CV based
on the error probability.
Using glmnet, we obtain the set of relevant genes with nonzero coefficients for
the λ that gives the best performance. It seems that the appropriate log λ is between
−4 and −3. Setting λ = 0.03, we execute the following procedure, in which we
have used beta = drop(glm$beta) rather than the matrix expression beta
= glm$beta because the former is easier to realize beta[beta != 0].
1 glm = glmnet(x, y, lambda = 0.03, family = "binomial")
2 beta = drop(glm$beta); beta[beta != 0]
46 2 Generalized Linear Regression

75 77 69 54 41 33 35 29 10 3 1 76 77 71 60 46 35 32 32 25 9 3 1

0.25
0.5 0.6 0.7 0.8 0.9 1.0 1.1

Misclassification Error
Binomial Deviance

0.20
0.15
0.10
-6 -5 -4 -3 -2 -6 -5 -4 -3 -2
log λ log λ

Fig. 2.2 The execution result of Example 15. We evaluated the dataset Breast Cancer for each λ; the
left and right figures show the CV evaluations based on the binary deviation and error probability.
glmnet shows the confidence interval by the upper and lower bounds in the graph [11]. We observe
that log λ is the best between −4 and −3. Because we evaluate the CV based on the samples, we
show the range of the best λ

2.3 Logistic Regression for Multiple Values

When the covariates take K ≥ 2 values rather than two, the probability associated
with the logistic curves is generalized as
(k)
eβ0,k +xβ
P(Y = k | x) =  K (k = 1, . . . , K ) . (2.11)
β0,l +xβ (l)
l=1 e

Thus far, we have regarded β ∈ R p as a column vector. In this section, we regard


β ∈ R p×K as a matrix, β j = [β j,1 , . . . , β j,K ] ∈ R K as a row vector, and β (k) =
[β1,k , . . . , β p,k ] ∈ R p as a column vector. Given the observation (x1 , y1 ), . . . ,
(x N , y N ) ∈ R p × {1, . . . , K } we can evaluate the negated log-likelihood

1 
N K
exp{β0,h + xi β (h) }
L := − I (yi = h) log  K .
(l)
l=1 exp{β0,l + x i β }
N i=1 h=1

Similar to binary logistic regression, we calculate ∇ L and ∇ 2 L.

Proposition 1 The differential of L is given by the column vector

1 
N
∂L
=− xi, j {I (yi = k) − πk,i } ,
∂β j,k N i=1

where
2.3 Logistic Regression for Multiple Values 47

exp(β0,k + xi β (k) )
πk,i :=  K .
(l)
l=1 exp(β0,l + x i β )

For the proof, see the appendix.

Proposition 2 The twice differential of L is given by the matrix

1 
N
∂2 L
= xi, j xi, j wi,k,k ,
∂β j,k β j ,k N i=1

where wi,k,k is defined as



πi,k (1 − πi,k ), k = k
wi,k,k := . (2.12)
−πi,k πi,k , k =k

For the proof, see the appendix.

Proposition 3 (Gershgorin) Let A = (ai, j ) ∈ Rn×n be a symmetric matrix. If ai,i ≥



j =i |ai, j | for all i = 1, . . . , n, then A is nonnegative definite.

For the proof, see the appendix.

Proposition 4 L is convex.

Proof Wi = (wi,k,k ) ∈ R K ×K is nonnegative definite. In fact, from Proposition 2,


we have  
wi,k,k = πi,k (1 − πi,k ) = πi,k πi,k = |wi,k,k |
k =k k =k

for all k = 1, . . . , K , and the condition of Proposition 3 is satisfied. Therefore, for


any γ = (γ j,k ) ∈ R p×K , we have


p

K 
p

K
∂2 L
γ j,k γj k
j=1 k=1 j =1 k =1
∂β j,k ∂β j ,k
 
p
K  p
 K
1 
N
= γ j,k xi, j wi,k,k xi, j γ j k
j=1 k=1 j =1 k =1
N i=1

1   
N K K p p
= ( xi, j γ j,k )wi,k,k ( xi, j γ j ,k ) ≥ 0 ,
N i=1 k=1 k =1 j=1 j =1

and L is convex. 

From a similar discussion as in the previous section, we have the procedure that
minimizes
48 2 Generalized Linear Regression

1  
N K K  p
exp{β0,h + xi β (h) }
L:=− I (yi = h) log  K +λ β j,k(2.13)
1 .
(l)
l=1 exp{β0,l + x i β }
N i=1 h=1 k=1 j=1

The second term in (2.8) is extended as


K 
K 
p
λ βk 1 = λ |β j,k | .
k=1 k=1 j=1

Because the computation is rather complicated, we often approximate the Taylor


expansion such that the nondiagonal elements of ∇ 2 L are zero for k = k. Because
the objective function is convex, there will be no problem with the convergence.
We note that the value of (2.11) does not change even if we subtract γ0 + xγ
from all the exponents in the numerator and denominator for any γ0 ∈ R, γ ∈ R p .
Thus, the first term of (2.13) does not change even if we replace β0,k + xβ (k) by
β0,k − γ0 + x(β (k) − γ). However, the second term changes. The γ j that minimizes
K
γ = (γ1 , . . . , γ p ), k=1 |β j,k − γ j | is the median of β j,1 , . . . , β j,K .

Proposition 5 For an ordered sequence an1 ≤ · · · ≤ an of finite length, x = am+1 and


x = (am + am+1 )/2 minimize f (x) = i=1 |x − ai | when n = 2m + 1 and n = 2m,
respectively. Thus, the median of a1 , . . . , an minimizes the function f .

For the proof, see the appendix.


By minimizing (2.13), the condition is satisfied (the median is chosen).
Based on the above discussion, we execute the following procedure, in which the
variable vin logistic.lasso ranges over k = 1, . . . , K .
1 multi.lasso = function(X, y, lambda) {
2 X = as.matrix(X)
3 p = ncol(X)
4 n = nrow(X)
5 K = length(table(y))
6 beta = matrix(1, nrow = K, ncol = p)
7 gamma = matrix(0, nrow = K, ncol = p)
8 while (norm(beta - gamma, "F") > 0.1) {
9 gamma = beta
10 for (k in 1:K) {
11 r = 0
12 for (h in 1:K) {if (k != h) r = r + exp(as.vector(X %*% beta[h,
]))}
13 v = exp(as.vector(X %*% beta[k, ])) / r
14 u = as.numeric(y == k) - v / (1 + v)
15 w = v / (1 + v) ^ 2
16 z = as.vector(X %*% beta[k, ]) + u / w
17 beta[k, ] = W.linear.lasso(X[, 2:p], z, diag(w), lambda = lambda
)
18 print(beta[k, ])
19 }
20 for (j in 1:p) {
21 med = median(beta[, j])
2.3 Logistic Regression for Multiple Values 49

22 for (h in 1:K) beta[h, j] = beta[h, j] - med


23 }
24 }
25 return(beta)
26 }

However, the value of β0,k is not unique. In the glmnet package, to maintain unique-
K
ness, the values of β0,k , k = 1, . . . , K , are determined such that k=1 β0,k = 0 [11].
When we obtain β0,1 , . . . , β0,K , we subtract its arithmetic mean from them.

Example 16 (Fisher’s Iris) Using the function multi.lasso, we applied multi-


ple value logistic regression to Fisher’s Iris dataset to obtain the maximum likelihood
estimates β̂ ∈ R p×K and the predictions X β̂. We found that all β̂2,k are zeros, which
is because for each j, β j,k = 0 for only one k such that β j,k is the median. As a
result, for the first 50, the next 50, and the last 50 rows, the “1”, “2”, and “3” have
the largest values of X β̂, respectively.
1 df = iris
2 x = matrix(0, 150, 4); for (j in 1:4) x[, j] = df[[j]]
3 X = cbind(1, x)
4 y = c(rep(1, 50), rep(2, 50), rep(3, 50))
5 beta = multi.lasso(X, y, 0.01)
6 X %*% t(beta)

Example 17 (Fisher’s Iris) For Example 16, we obtained the optimum λ via the
two CV procedures using the R package glmnet [11], which we show in Fig. 2.3.
1 library(glmnet)
2 df = iris
3 x = as.matrix(df[, 1:4]); y = as.vector(df[, 5])
4 n = length(y); u = array(dim = n)
5 for (i in 1:n) if (y[i] == "setosa") u[i] = 1 else
6 if (y[i] == "versicolor") u[i] = 2 else u[i] = 3
7 u = as.numeric(u)
8 cv = cv.glmnet(x, u, family = "multinomial")
9 cv2 = cv.glmnet(x, u, family = "multinomial", type.measure = "class")
10 par(mfrow = c(1, 2)); plot(cv); plot(cv2); par(mfrow = c(1, 1))
11 lambda = cv$lambda.min; result = glmnet(x, y, lambda = lambda, family
= "multinomial")
12 beta = result$beta; beta.0 = result$a0
13 v = rep(0, n)
14 for (i in 1:n) {
15 max.value = -Inf
16 for (j in 1:3) {
17 value = beta.0[j] + sum(beta[[j]] * x[i, ])
18 if (value > max.value) {v[i] = j; max.value = value}
19 }
20 }
21 table(u, v)
50 2 Generalized Linear Regression

3 3 3 3 3 3 3 2 2 2 1 1 0 3 3 3 3 3 3 3 2 2 2 1 1 0

0.6
2.0

0.5
Misclassification Error
Multinomial Deviance
1.5

0.4
0.3
1.0

0.2
0.5

0.1
0.0
0.0

-10 -8 -6 -4 -2 -10 -8 -6 -4 -2
log λ log λ

Fig. 2.3 We draw the scores obtained via glmnet for each λ, in which the left and right scores
are the multiple deviance and error probability, respectively. We may choose the λ that minimizes
the score. However, if the sample size is small, some blurring occurs. Thus, the upper and lower
bounds that specify the confidence interval are displayed. The figures 0, 1, 2, 3 in the upper locations
indicate how many variables are chosen as nonzero for the λ chosen.

1 z
2 y -1 1
3 -1 70 3
4 1 7 20

2.4 Poisson Regression

Suppose that there exists μ > 0 such that the distribution of a random variable Y that
ranges over the nonnegative integers is

μk −μ
P(Y = k) = e (k = 0, 1, 2, . . .) (2.14)
k!
2.4 Poisson Regression 51

(Poisson distribution). One can check that μ coincides with the mean value of Y .
Hereafter, we assume that μ depends on x ∈ R p and can be expressed by

μ(x) = E[Y | X = x] = eβ0 +xβ (x ∈ R p ) .

Given the observations (x1 , y1 ), . . . , (x N , y N ), the likelihood is


N y
μi i
e−μi (2.15)
i=1
yi !

with μi := μ(xi ) = eβ0 +xi β . We consider applying Lasso when we estimate the
parameters β0 , β: minimize the negated log-likelihood

1 
N
L(β0 , β) := − {yi (β0 + xi β) − eβ0 +xi β }
N i=1

added by the regularization term as

L(β0 , β) + λβ1 . (2.16)

1
Similar to logistic regression, if we write ∇ L = − X T u and ∇ 2 L = X T W X ,
N
we have ⎡ ⎤
y1 − eβ0 +x1 β
⎢ .. ⎥
u=⎣ . ⎦
y N − eβ0 +x N β

and ⎡ ⎤
eβ0 +x1 β · · · 0
⎢ .. .. .. ⎥
W =⎣ . . . ⎦ .
0 · · · eβ0 +x N β

For example, we may construct the procedure as follows:


1 poisson.lasso = function(X, y, lambda) {
2 beta = rnorm(p + 1); gamma = rnorm(p + 1)
3 while (sum((beta - gamma) ^ 2) > 0.0001) {
4 beta = gamma
5 s = as.vector(X %*% beta)
6 w = exp(s)
7 u = y - w
8 z = s + u / w
9 W = diag(w)
10 gamma = W.linear.lasso(X[, 2:(p + 1)], z, W, lambda)
11 print(gamma)
12 }
52 2 Generalized Linear Regression

13 return(gamma)
14 }

Example 18 We examined sparse Poisson regression using artificial data.


1 n = 100; p = 3
2 beta = rnorm(p + 1)
3 X = matrix(rnorm(n * p), ncol = p); X = cbind(1, X)
4 s = as.vector(X %*% beta)
5 y = rpois(n, lambda = exp(s))
6 beta
7 poisson.lasso(X, y, 0.2)

The function poisson.lasso constructed in this book requires much time


to be executed. For business, we might want to use the function glmnet(X, y,
family = "poisson") [11].

Example 19 For the dataset birthwt in the R package MASS for the birth rate, we
executed Poisson regression. The meanings of the variables are listed in Table 2.1.
Because the first variable (whether the birth weight is over 2.5 kg) overlaps with the
last variable (the birth weight), we deleted it. We regarded the number of physician
visits during the first trimester as the response and executed regression based on
the other eight covariates (# samples: N = 189), where we chose as the λ value the

Table 2.1 The meanings of the variables in the birthwt dataset


Column Variable name Meaning
1 low Indicator of birth weight less
than 2.5 kg (0 = higher, 1 =
lower)
2 age Mother’s age in years
3 lwt Mother’s weight in pounds at
last menstrual period
4 race Mother’s race (1 = white, 2 =
black, 3 = other)
5 smoke Smoking status during
pregnancy (0 = no, 1 = yes)
6 ptl Number of previous premature
labors
7 ht History of hypertension (0 =
no, 1 = yes)
8 ui Presence of uterine irritability
(0 = no, 1 = yes)
9 ftv Number of physician visits
during the first trimester
10 bwt Birth weight in grams
2.4 Poisson Regression 53

optimum value based on the CV. As a result, we found that mother’s age (age),
mother’s weight (lwt), and mother’s hypertension (ht) are the main factors.
1 library(glmnet)
2 library(MASS)
3 data(birthwt)
4 df = birthwt[, -1]
5 dy = df[, 8]
6 dx = data.matrix(df[, -8])
7 cvfit = cv.glmnet(x = dx, y = dy, family = "poisson", standardize =
TRUE)
8 coef(cvfit, s = "lambda.min")

1 9 x 1 sparse Matrix of class "dgCMatrix"


2 1
3 (Intercept) -1.180159594
4 age 0.030900888
5 lwt 0.001653569
6 race .
7 smoke .
8 ptl .
9 ht -0.007016192
10 ui .
11 bwt

2.5 Survival Analysis

In this section, we consider survival analysis. The problem setting is similar to the
ones we considered thus far: finding the relation between covariates and the response
from N tuples of the response and p covariate values. For survival analysis, we not
only assume that the response Y (survival time) takes positive values y but also allow
the response to take the form Y ≥ y. If an individual dies, the exact survival time is
obtained. However, if the survey is terminated before the time of death (the individual
changes hospitals, etc.), we regard the survival time to have exceeded the time. To
distinguish between the two cases, for the latter, we add the symbol +. The reason
we consider the latter case is to utilize the sample: even if the survey is terminated
before the time of death, some part of the data can be used to estimate the survival
analysis.
We estimate the covariate coefficients based on past data and predict the survival
time for a future individual. If we apply Lasso to the problem, we can determine
which covariates explain survival.

Example 20 From the kidney dataset, we wish to estimate the kidney survival
time. The meanings of the covariates are listed in Table 2.2. The symbol status
= 0 expresses survey termination before death and means that the response takes
54 2 Generalized Linear Regression

Table 2.2 The meaning of the variables in the kidney dataset


Column Variable Meaning
1 Id Patient ID
2 Time Time
3 Status 0: survey termination, 1: death
4 Age Age
5 Sex Sex (male: 1, female: 2)
6 Disease 0: GN, 1: AN, 2: PKD, 3:
Other
7 Frail Frailty estimate

a larger value than time. On the other hand, status = 1 expresses death and
means that the covariate takes the same value of time. The rightmost four are the
covariates.
1 library(survival)
2 data(kidney)
3 kidney

1 id time status age sex disease frail


2 1 1 8 1 28 1 Other 2.3
3 2 1 16 1 28 1 Other 2.3
4 3 2 23 1 48 2 GN 1.9
5 4 2 13 0 48 2 GN 1.9
6 ........................................

1 y = kidney$time
2 delta = kidney$status
3 Surv(y, delta)

1 [1] 8 16 23 13+ 22 28 447 318 30 12 24 245 7


2 [14] 9 511 30 53 196 15 154 7 333 141 8+ 96 38
3 [27] 149+ 70+ 536 25+ 17 4+ 185 177 292 114 22+ 159+ 15
4 [40] 108+ 152 562 402 24+ 13 66 39 46+ 12 40 113+ 201
5 [53] 132 156 34 30 2 25 130 26 27 58 5+ 43 152
6 [66] 30 190 5+ 119 8 54+ 16+ 6+ 78 63 8+

As marked in the last set of data, due to the survey termination, if the survival time
is longer, the symbol + follows the time.

Now, we consider the theoretical framework of survival analysis (Cox model). By


P(t < Y < t + δ | Y ≥ t), we express the probability of an individual surviving time
t through t + δ (t < Y < t + δ) given the event that he survives at time t (Y ≥ t).
By S(t), we denote the probability (survival function) that the survival time T is at
least t (the individual lives at time t). In survival analysis, we define the function by
2.5 Survival Analysis 55

Fig. 2.4 The Kaplan-Meier

1.0
curve for each of the kidney
diseases

0.8
Others

Survival Rate
GN

0.6
AN
PKD

0.4
0.2
0.0
0 100 200 300 400 500
Time

P(t < Y < t + δ | Y ≥ t)


h(t) := lim
δ→0 δ

or, equivalently, by
S (t)
h(t) = −
S(t)

(hazard function). We express the function h by the product of h 0 (t) that does not
depend on the covariates x ∈ R p (row vector) and exp(xβ):

h(t) = h 0 (t) exp(xβ) ,

where the values of β ∈ R p are unknown and need to be estimated. In particular, we


regard that the intercept β0 is contained in h 0 (t) and assume β0 = 0.
Given observations (x1 , y1 ), . . . , (x N , y N ) ∈ R p × R≥0 , we estimate β ∈ R p that
maximizes the likelihood of h(t), where the function h 0 (t) is not used because it does
not depend on β. If p is large, i.e., under sparse situations, similar to generalized
linear regression, we can apply the L1 regularization.
The main issue of survival analysis is to identify the covariates that determine the
survival time.

Example 21 From the kidney dataset in the R package survival, we depicted


the survival rates as time passes in Fig. 2.4 for the diseases AN, GN, and PKD. The
graph is stepwise because we made estimations from data. We show how to obtain
this graph later.
1 fit = survfit(Surv(time, status) ~ disease, data = kidney)
2 plot(fit, xlab = "time", ylab = "survival rate", col = c("red", "green
", "blue", "black"))
56 2 Generalized Linear Regression

3 legend(300, 0.8, legend = c("Other", "GN", "AN", "PKD"),


4 lty = 1, col = c("red", "green", "blue", "black"))

Note that the variable time in Example 21 takes survival times before death
and survey termination. If we remove the times before survey termination, then the
number of samples decreases for the estimation. However, if we follow the estimation
procedure below, we can utilize the data before survey termination.
If there is no survey termination, then we estimate S(t)

1 
N
Ŝ(t) := δ(ti > t) (2.17)
N i=1

from N death times t1 ≤ t2 ≤ · · · ≤ t N −1 ≤ t N , where δ(ti > t) is one for ti > t and
zero otherwise. If t1 < · · · < t N , the function is stepwise and decreases by 1/N at
each t = ti .
Next, we consider the general case in which a survey termination can occur. Let
di ≥ 1 (i = 1, . . . , k) be the number of deaths at time  [ti , ti+1 ), and assume that
k
t1 < · · · < tk . Then, the number of total deaths is D := i=1 di ≤ N , and N − D
survey terminations occur if we assume that the total number of samples is N . If we
denote by m j the number of survey terminations in the interval [t j , t j+1 ), then the
number n j of surviving individuals immediately before time t j can be expressed by


k
n j := (di + m i )
i= j

.
We define the Kaplan-Meier estimate as

⎪ 1, t < t1

Ŝ(t) =  n i − di
l
(2.18)

⎩ , t ≥ t1
i=1
ni

for tl ≤ t < tl+1 , l = 1, . . . , k. Since n k = dk + m k , if m k = 0, then Ŝ(t) = 0 for


t > tk . Otherwise, if m k > 0, then Ŝ(t) > 0 for t > tk . k
If there is no survey termination, then n j − d j = n j+1 from n j = i= j di , which
means that (2.18) becomes
n2 n3 n l+1 n l+1 n l+1
Ŝ(t) = × × ··· × = =
n1 n2 nl n1 N

and coincides with (2.17).


Cox (1972) proposed maximizing the partial likelihood
2.5 Survival Analysis 57

 e xi β
 xjβ
(2.19)
i:δi =1 j∈Ri e

to estimate the parameter β, where xi ∈ R p and yi are covariates and time, respec-
tively, and δi = 1 and δi = 0 express death and survey termination, respectively, for
i = 1, . . . , N .
We define the risk set Ri to be the set of j such that y j ≥ yi and formulate the
Lasso as follows [25]:

1  e xi β
− log  xjβ
+ λβ1 . (2.20)
N j∈Ri e
i:δi =1

To obtain the solution, we compute u, W similarly to logistic regression and Poisson


regression. In the following, we define

 e xi β
L := − log  xjβ
,
i:δi =1 j∈Ri e

and let δi = 1, j ∈ Ri ⇐⇒ i ∈ C j .

Proposition 6 The once and twice differentials of L are


⎧ ⎫
∂L 
N ⎨  e xi β ⎬
=− xi,k δi − 
∂βk ⎩ h∈R j e xh β ⎭
i=1 j∈Ci

∂2 L N  N  e xi β 
= xi,k x h,l  x r β )2
{I (i = h) e xs β − I (h ∈ R j )e xh β } .
∂βk ∂βl i=1 h=1 j∈C
( r ∈R j
e s∈R
i j

In particular, L is convex.

For the proof, see the appendix.


∂L
Hence, u = (u i ) ∈ R N and W = (wi,h ) ∈ R N ×N such that = −X T u and
∂βk
∂2 L
= X T W X are given as follows:
∂βk ∂βl

 e xi β
u i := δi −  xh β
j∈Ci h∈R j e

 e xi β 
wi,h :=  β
{I (i = h) e xs β − I (h ∈ R j )e xh β } ,
j∈Ci
( r ∈R j e x r ) 2
s∈R j
58 2 Generalized Linear Regression

where x1 , . . . , x N ∈ R p are the rows of X ∈ R N × p . In particular, j ∈ Ci means h ∈


R j when i = h, and if we let

e xi β
πi, j :=  xh β
,
h∈R j e

then the diagonal elements of W are


 
 e xi β e xi β 
wi :=  xh β
1−  xh β
= πi, j (1 − πi, j ) .
j∈Ci h∈R j e h∈R j e j∈Ci


Moreover, we can write u i = δi − j∈Ci πi, j .
Based on the above discussion, we construct the function cox.lasso below. To
avoid a complicated computation, we approximate the off-diagonal elements of W
to be zero. Because the objective function is convex, there will be no problem with
convergence.
1 cox.lasso = function(X, y, delta, lambda = lambda) {
2 delta[1] = 1
3 n = length(y)
4 w = array(dim = n); u = array(dim = n)
5 pi = array(dim = c(n, n))
6 beta = rnorm(p); gamma = rep(0, p)
7 while (sum((beta - gamma) ^ 2) > 10 ^ {-4}) {
8 beta = gamma
9 s = as.vector(X %*% beta)
10 v = exp(s)
11 for (i in 1:n) {for (j in 1:n) pi[i, j] = v[i] / sum(v[j:n])}
12 for (i in 1:n) {
13 u[i] = delta[i]
14 w[i] = 0
15 for (j in 1:i) if (delta[j] == 1) {
16 u[i] = u[i] - pi[i, j]
17 w[i] = w[i] + pi[i, j] * (1 - pi[i, j])
18 }
19 }
20 z = s + u / w; W = diag(w)
21 print(gamma)
22 gamma = W.linear.lasso(X, z, W, lambda = lambda)[-1]
23 }
24 return(gamma)
25 }

Example 22 We apply the dataset kidney to the function cox.lasso. The pro-
cedure converges for all values of λ, and the estimates coincide with the ones com-
puted via glmnet[11].
1 df = kidney
2 index = order(df$time)
3 df = df[index, ]
2.5 Survival Analysis 59

4 n = nrow(df); p = 4
5 y = as.numeric(df[[2]])
6 delta = as.numeric(df[[3]])
7 X = as.numeric(df[[4]])
8 for (j in 5:7) X = cbind(X, as.numeric(df[[j]]))
9 z = Surv(y, delta)
10 cox.lasso(X, y, delta, 0)

1 [1] 0 0 0 0
2 [1] 0.0101287 -1.7747758 -0.3887608 1.3532378
3 [1] 0.01462571 -1.69299527 -0.41598742 1.38980788
4 [1] 0.01591941 -1.66769665 -0.42331475 1.40330234
5 [1] 0.01628935 -1.66060178 -0.42528537 1.40862969

1 cox.lasso(X, y, delta, 0.1)

1 [1] 0 0 0 0 [1] 0.00000000 -1.04944510 -0.08990115 1.00822550 [1]


2 0.00000000 -0.98175893 -0.06107446 0.97534148 [1] 0.00000000
3 -0.96078476 -0.05449001 0.96180929 [1] 0.00000000 -0.95475614
4 -0.05306296 0.95673222

1 cox.lasso(X, y, delta, 0.2)

1 [1] 0 0 0 0
2 [1] 0.0000000 -0.5366227 0.0000000 0.7234343
3 [1] 0.0000000 -0.5142360 0.0000000 0.6890634
4 [1] 0.0000000 -0.5099687 0.0000000 0.6800883

1 glmnet(X, z, family = "cox", lambda = 0.1)$beta

1 4 x 1 sparse Matrix of class "dgCMatrix"


2 s0
3 X .
4 -0.87359015
5 -0.05659599
6 0.92923820

Example 23 We downloaded the survival time data of patients


with lymphoma, i.e., 1846-2568-2-SP.rda (Alizadeh, 2000) [1]
https://www.jstatsoft.org/rt/suppFileMetadata/v039i05
/0/722, and executed the following procedure:
1 library(survival)
2 load("LymphomaData.rda"); attach("LymphomaData.rda")
3 names(patient.data); x = t(patient.data$x)
4 y = patient.data$time; delta = patient.data$status; Surv(y, delta)
60 2 Generalized Linear Regression

Fig. 2.5 To examine how


much the 28 genes with
nonzero coefficients affect
the survival time, we drew
the Kaplan-Meier curves for
the sample sets with either
positive or negative z := X β̂
to find a significant
difference in the survival
times between them

The variables x, y, and delta store the expression data of p = 7,399 genes, the
times, and the death (= 1)/survey termination (= 0) for N = 240 samples. We exe-
cuted the following codes to obtain the λ that minimizes the CV. We found that only
28 coefficients of genes out of 7,399 are nonzero and concluded that the other genes
do not affect the survival time.
Moreover, for the β̂ in which only 28 elements are nonzero, we computed z = X β̂.
We decided that those variables are likely to determine the survival times and drew
the Kaplan-Meier curves for the samples with z i > 0 (1) and z i < 0 (-1) to compare
the survival times (Fig. 2.5).
1 library(ranger); library(ggplot2); library(dplyr); library(ggfortify)
2 cv.fit = cv.glmnet(x, Surv(y, delta), family = "cox")
3 fit2 = glmnet(x, Surv(y, delta), lambda = cv.fit$lambda.min, family =
"cox")
4 z = sign(drop(x %*% fit2$beta))
5 fit3 = survfit(Surv(y, delta) ~ z)
6 autoplot(fit3)
7 mean(y[z == 1])
8 mean(y[z == -1])
Appendix: Proof of Propositions 61

Appendix: Proof of Propositions

Proposition 1 The differential of L is given by the column vector

1 
N
∂L
=− xi, j {I (yi = k) − πk,i } ,
∂β j,k N i=1

where
exp(β0,k + xi β (k) )
πk,i :=  K .
(l)
l=1 exp(β0,l + x i β )

Proof Differentiating

1 
N K
exp{β0,h + xi βh }
L:=− I (yi = h) log  K
l=1 exp{β0,l + x i βl }
N i=1 h=1

1 
N K
=− I (yi = h)
N i=1 h=1
  

K
(β0,h + xi βh ) − log exp(β0,l + xi βl )
l=1

by β j,k ( j = 1, . . . , p, k = 1, . . . , K ), we obtain
 
1  
N K
∂L exp{β0,k + xi βk }
=− xi, j I (yi = h) I (h = k) −  K
∂β j,k N i=1 h=1 l=1 exp{β0,l + x i βl }
 
1 
N
exp{β0,k + xi βk }
=− xi, j I (yi = k) −  K
l=1 exp{β0,l + x i βl }
N i=1

1 
N
=− xi, j {I (yi = k) − πk,i } .
N i=1

Proposition 2 The twice differential of L is given by the matrix

1 
N
∂2 L
= xi, j xi, j wi,k,k
∂β j,k β j ,k N i=1

where wi,k,k is defined as


62 2 Generalized Linear Regression

πi,k (1 − πi,k ), k = k
wi,k,k := . (2.12)
−πi,k πi,k , k =k

Proof For k = k, differentiating by β j ,k ( j = 1, . . . , p), we obtain


 
1  1 
N N
∂2 L ∂ ∂πk,i
k= − xi, j {I (yi = k) − πk,i } = xi, j
∂β j,k ∂β j ,k ∂β j ,k N N ∂β j ,k
i=1 i=1
K
1 
N
xi, j exp(β0,k + xi βk ){ l=1 exp(β0,l + xi βl )} − xi, j {exp(β0,k + xi βk )}2
= xi, j K
N
i=1 { l=1 exp(β0,l + xi βl )}2

1 
N
= xi, j xi, j πi,k (1 − πi,k ) .
N
i=1

For k = k, differentiating by β j ,k ( j = 1, . . . , p), we obtain


 
1  1 
N N
∂2 L ∂ ∂πk,i
= − xi, j {I (yi = k) − πk,i } = xi, j
∂β j,k ∂β j ,k ∂β j ,k N i=1 N i=1 ∂β j ,k
 
1 
N
xi, j exp(β0,k + xi βk )
= xi, j exp(β0,k + xi βk ) −  K
N i=1 { l=1 exp(β0,l + xi βl )}2
1 
N
=− xi, j xi, j πi,k πi,k .
N i=1

Proposition
 3 (Gershgorin) Let A = (ai, j ) ∈ Rn×n be a symmetric matrix. If ai,i ≥
j =i |ai, j | for all i = 1, . . . , n, then A is nonnegative definite.

Proof For any eigenvalue λ of square matrix A = (a j,k ) ∈ Rn×n , there exists an
eigenvector x = [x1 , . . . , xn ]T and 1 ≤ i ≤ n such that xi = 1 and |x j | ≤ 1, j = i.
Thus, from
 
n
ai,i + ai, j x j = ai, j x j = λxi = λ ,
j =i j=1

we have  
|λ − ai,i | = | ai, j x j | ≤ |ai, j |
j =i j =i

and  
ai,i − |ai, j | ≤ λ ≤ ai,i + |ai, j | ,
j =i j =i


at least for one 1 ≤ i ≤ n, which means that if ai,i ≥ j =i |ai, j | for i = 1, . . . , n,
then all the eigenvalues are nonnegative, and the matrix A is nonnegative definite. 
Appendix: Proof of Propositions 63

Proposition 5 For an ordered sequence an1 ≤ · · · ≤ an of finite length, x = am+1 and


x = (am + am+1 )/2 minimize f (x) = i=1 |x − ai | when n = 2m + 1 and n = 2m,
respectively. Thus, the median of a1 , . . . , an minimizes the function f .
Proof If n = 2m + 1 and ai−1 < ai = · · · = am+1 = · · · = a j < a j+1 , i ≤ m + 1 ≤
j, then the subderivative of f at am+1 is

( j − i + 1)[−1, 1] − (i − 1) + (n − j) = [n − 2 j, n − 2i + 2]
= [2m − 2 j + 1, 2m − 2i + 3] ⊇ [−1, 1]

and contains zero. If n = 2m and am = am+1 , then the subderivative of f at (am +


am+1 )/2 is zero. Finally, if n = 2m and ai−1 < ai = · · · = am = am+1 = · · · = a j <
a j+1 , i ≤ m, m + 1 ≤ j, then the subderivative of f at (am + am+1 )/2 is

( j − i + 1)[−1, 1] − (i − 1) + (n − j) = [n − 2 j, n − 2i + 2] = [2m − 2 j, 2m − 2i + 2] ⊇ [−2, 2]

and contains zero. Thus, for any case, x minimizes f (x). 


Proposition 6 The once and twice differentials of L are
⎧ ⎫
∂L N ⎨  e xi β ⎬
=− xi,k δi − 
∂βk ⎩ h∈R j e
xh β ⎭
i=1 j∈C i

∂2 L 
N 
N  e xi β
= xi,k x h,l  x r β )2
∂βk ∂βl i=1 h=1 j∈Ci
( r ∈R j e

{I (i = h) e xs β − I (h ∈ R j )e xh β } .
s∈R j

In particular, L is convex.

Proof Since for Si = h∈Ri e xh β , we have

  x j,k e x j β N 
x j,k e x j β N  e xi β
= = xi,k ,
Si Si Sj
i:δi =1
j∈R i j=1 i∈C j i=1 j∈C i

each element of ∇ L = −X T u is derived as follows:


  
∂L  j∈Ri x j,k e
xjβ
=− xi,k −  xh β
∂βk h∈Ri e
i:δi =1
⎧ ⎫
 N ⎨  e xi β ⎬
=− xi,k δi −  .
⎩ h∈R j e
xh β ⎭
i=1 j∈C i
64 2 Generalized Linear Regression

On the other hand, each element of ∇ 2 L = X T W X is derived as follows:


 
∂2 L N  ∂ e xi β
= xi,k  xh β
∂βk ∂βl i=1 j∈C
∂βl h∈R j e
i


N  1  
= xi,k  {x e xi β
2 i,l
e xs β − e xi β x h,l e xh β }
i=1 j∈Ci
( r ∈R j e xr β ) s∈R j h∈R j


N 
N  e xi β 
= xi,k x h,l  {I (i = h) e xs β − I (h ∈ R j )e xh β } .
i=1 h=1 j∈Ci
( r ∈R j e x r β )2 s∈R j

Hence, we have
∂2 L  N N
= xi,k x h,l wi,h ,
∂βk ∂βl i=1 h=1

where

⎪  e xi β 

⎪  { e xs β − I (i ∈ R j )e xh β }, h = i

⎨ ( e x r β ) 2
j∈Ci r ∈R j s∈R j
wi,h =  xi β .

⎪ e xh β

⎪ −  {I (h ∈ R j )e }, h =i
⎩ ( r ∈R e xr β )2
j∈Ci j


Moreover, from i ∈ R j , such W = (wi, j ) satisfies wi,i = h =i |wi,h | for each
i = 1, . . . , N . From Proposition 3, W is nonnegative definite. Thus, for arbitrary
[z 1 , . . . , z p ] ∈ R p , we have


p

p

N 
N
z k zl xi,k x h,l wi,h ≥ 0 ,
k=1 l=1 i=1 h=1

where L is nonnegative definite (its Hessian is nonnegative definite). 

Exercises 21–34

21. Let Y be a random variable that takes values in {0, 1}. We assume that there exist
β0 ∈ R and β ∈ R p such that the conditional probability P(Y = 1 | x) given for
each x ∈ R p can be expressed by

P(Y = 1 | x)
log = β0 + xβ (cf. (2.2)) (2.21)
P(Y = 0 | x)

(logistic regression).
Exercises 21–34 65

(a) Prove that (2.21) can be expressed by

exp(β0 + xβ)
P(Y = 1 | x) = (cf. (2.3)). (2.22)
1 + exp(β0 + xβ)

(b) Let p = 1 and β0 = 0. The following procedure draws the curves by com-
puting the right-hand side of (2.22) for various β ∈ R. Fill in the blanks and
execute the procedure for β = 0. How do the shapes evolve as β grows?
1 f = function(x) return(exp(beta.0 + beta * x) / (1 + exp(beta
.0 + beta * x)))
2 beta.0 = 0; beta.seq = c(0, 0.2, 0.5, 1, 2, 10)
3 m = length(beta.seq)
4 beta = beta.seq[1]
5 plot(f, xlim = c(-10, 10), ylim = c(0, 1), xlab = "x", ylab =
"y",
6 col = 1, main = "Logistic Curve")
7 for (i in 2:m) {
8 beta = beta.seq[i]
9 par(new = TRUE)
10 plot(f, xlim = c(-10, 10), ylim = c(0, 1), xlab = "", ylab =
"", axes = FALSE, col = i)
11 }
12 legend("topleft", legend = beta.seq, col = 1:length(beta.seq),
lwd = 2, cex = .8)
13 par(new = FALSE)

(c) For logistic regression (2.21), when the realizations x ∈ R p and Y ∈ {0, 1}
N
e yi {β0 +xi β}
are (xi , yi ) (i = 1, . . . , N ), the likelihood is given by . In its
i=1
1 + eβ0 +xi β
Lasso evaluation

1 
N
− [yi (β0 + xi β) − log(1 + eβ0 +xi β )] + λβ1 , (2.23)
N i=1

Y takes values in {0, 1}. However, an alternative expression as in (2.24) is


often used, in which Y takes values in {−1, 1}:

1 
N
log(1 + exp{−yi (β0 + xi β)}) + λβ1 (cf. (2.8)) . (2.24)
N i=1

Show that if we replace yi = 0 with yi = −1, then (2.23) is equivalent to


(2.24).

Hereafter, we denote by X ∈ R N ×( p+1) the matrix such that the (i, j)-th element is
xi, j for j = 1, . . . , p and the leftmost column (the 0th column) is a vector consisting
of N ones, and let xi = [xi,1 , . . . , xi, p ]. We assume that the random variable Y takes
values in {−1, 1}.
66 2 Generalized Linear Regression


N
22. For L(β0 , β) := log{1 + exp(−yi (β0 + xi β))}, show the following equa-
i=1
tions, where vi := exp{−yi (β0 + xi β)}.
∂L
(a) The matrix ∇ L such that the j-th element is ( j = 0, 1, . . . , p) can be
∂β j
expressed by ∇ L = −X T u, where
⎡ yv ⎤
1 1
⎢ 1 + v1 ⎥
⎢ .. ⎥
u=⎢
⎢ .
⎥ .

⎣ yN vN ⎦
1 + vN

∂2 L
(b) The matrix ∇ 2 L such that the ( j, k)-th element is ( j, k = 0, 1, . . . , p)
∂β j βk
can be expressed by ∇ 2 L = X T W X , where
⎡ v1 ⎤
··· 0
⎢ (1 + v1 )2 ⎥
⎢ .. .. .. ⎥
W =⎢
⎢ . . .

⎥ diagonal matrix .
⎣ vN ⎦
0 ···
(1 + v N )2

Moreover, setting λ = 0, we construct the procedure that estimates the logistic


regression coefficients as follows. First, we give an initial value of (β0 , β) and
update the value via the Newton method until it converges. Execute the following,
and examine the convergence. In particular, observe that the procedure tends to
diverge as we make p larger.
1 ## Data Generation
2 N = 1000; p = 2; X = matrix(rnorm(N * p), ncol = p); X = cbind(rep
(1, N), X)
3 beta = rnorm(p + 1); y = array(N); s = as.vector(X %*% beta); prob
= 1 / (1 + exp(s))
4 for (i in 1:N) {if (runif(1) > prob[i]) y[i] = 1 else y[i] = -1}
5 beta
6 ## Maximum Likelihood Computation
7 beta = Inf; gamma = rnorm(p + 1)
8 while (sum((beta - gamma) ^ 2) > 0.001) {
9 beta = gamma
10 s = as.vector(X %*% beta)
11 v = exp(-s * y)
12 u = y * v / (1 + v)
13 w = v / (1 + v) ^ 2
14 z = s + u / w
15 W = diag(w)
16 gamma = as.vector(solve(t(X) %*% W %*% X) %*% t(X) %*% W %*% z)
17 print(gamma)
18 }
Exercises 21–34 67

23. Show that the (β0 , β) that maximize the likelihood diverges when λ = 0.
(a) N < p, and the rank of X is N .
Hint In Exercise 2.5, if we multiply −X T u = 0 by −X from left to right,
then it becomes X X T u = 0. Since X X T is invertible, unless u = 0, X̃ X̃ T u =
0 does not reach the stationary solution. Moreover, u cannot be zero for finite
(β0 , β).
(b) There exist (β0 , β) such that yi (β0 + xi β) > 0 for all i = 1, . . . , N .
Hint If such (β0 , β) exist, (2β0 , 2β) will make L smaller.
24. The following procedure analyzes the dataset breastcancer, which consists
of 1,000 covariate variables (the expression data of 1,000 genes) and one binary
response (case/control) for 256 samples (58 cases and 192 control). Identify the
relevant genes that affect breast cancer.
1 df = read.csv("breastcancer.csv")
2 x = as.matrix(df[, 1:1000])
3 y = as.vector(df[, 1001])
4 cv = cv.glmnet(x, y, family = "binomial")
5 cv2 = cv.glmnet(x, y, family = "binomial", type.measure = "class")
6 par(mfrow = c(1, 2))
7 plot(cv)
8 plot(cv2)
9 par(mfrow = c(1, 1))

The function cv.glmnet evaluates the test data based on the binary deviance

1 
N
− log P(Y = yi | X = xi )
N i=1

and error rate if we do not specify anything and if the option is type.measure
= "class", respectively.
Fill in the optimum λ based on the CV, make the following codes, and find the
genes with nonzero coefficients.
1 glm = glmnet(x, y, lambda = ## Blank ##, family = "binomial")

Hint Use beta = drop(glm$beta) rather than beta = glm$beta.


Then, it will be easier to construct a code similar to beta[beta != 0].
25. Logistic regression can be extended as follows if the response takes K ≥ 2 values
rather than binary values.
(k)
eβ0,k +xβ
P(Y = k | x) =  K (k = 1, . . . , K ). (cf. (2.11)) (2.25)
β0,l +xβ (l)
l=1 e

(a) (2.25) remains correct even if we subtract γ0 + xγ from the exponents in


the numerator and denominator for any γ0 ∈ R, γ ∈ Rp. K
K
 p second term of (2.24) can be extended as λ k=1 βk 1 = λ k=1
(b) The
(k)
j=1 |β j,k |. The first term does not change even if we replace β0,k + xβ
68 2 Generalized Linear Regression

by β0,k − γ0 + x(β (k) − γ), but the second term changes.  K One can check that
for γ = (γ1 , . . . , γ p ), the γ j value that minimizes k=1 |β j,k − γ j | is the
median of β j,1 , . . . , β j,K . How can we obtain the minimum Lasso evaluation
from the original β j,1 , . . . , β j,K ( j = 1, . . . , p)?
(c) In the glmnet package, to maintain the uniqueness of the β0,k value, it
K
is set such that k=1 β0,k = 0. After obtaining the original β0,1 , . . . , β0,K ,
what computation should be performed to obtain the unique β0,1 , . . . , β0,K ?
26. Execute the following procedure for the Iris dataset with n = 150 and p = 4.
Output the two graphs via cv.glmnet, and find the optimum λ in the sense of
CV. Moreover, find the β0 , β w.r.t. λ, and execute it using the 150 covariates.
1 library(glmnet)
2 df = read.table("iris.txt", sep = ",")
3 x = as.matrix(df[, 1:4])
4 y = as.vector(df[, 5])
5 y = as.numeric(y == "Iris-setosa")
6 cv = cv.glmnet(x, y, family = "binomial")
7 cv2 = cv.glmnet(x, y, family = "binomial", type.measure = "class")
8 par(mfrow = c(1, 2))
9 plot(cv)
10 plot(cv2)
11 par(mfrow = c(1, 1))
12 lambda = cv$lambda.min
13 result = glmnet(x, y, lambda = lambda, family = "binomial")
14 beta = result$beta
15 beta.0 = result$a0
16 f = function(x) return(exp(beta.0 + x %*% beta))
17 z = array(dim = 150)
18 for (i in 1:150) z[i] = drop(f(x[i, ]))
19 yy = (z > 1)
20 sum(yy == y)

We evaluate the correct rate, not for K = 2 but for K = 3. Fill in the blanks and
execute it.
1 library(glmnet)
2 df = read.table("iris.txt", sep = ",")
3 x = as.matrix(df[, 1:4]); y = as.vector(df[, 5]); n = length(y); u
= array(dim = n)
4 for (i in 1:n) if (y[i] == "Iris-setosa") u[i] = 1 else
5 if (y[i] == "Iris-versicolor") u[i] = 2 else u[i] = 3
6 u = as.numeric(u)
7 cv = cv.glmnet(x, u, family = "multinomial")
8 cv2 = cv.glmnet(x, u, family = "multinomial", type.measure = "
class")
9 par(mfrow = c(1, 2)); plot(cv); plot(cv2); par(mfrow = c(1, 1))
10 lambda = cv$lambda.min; result = glmnet(x, y, lambda = lambda,
family = "multinomial")
11 beta = result$beta; beta.0 = result$a0
12 v = array(dim = n)
13 for (i in 1:n) {
14 max.value = -Inf
15 for (j in 1:3) {
Exercises 21–34 69

16 value = ## Blank ##
17 if (value > max.value) {v[i] = j; max.value = value}
18 }
19 }
20 sum(u == v)

Hint Each beta and beta.0 is a list of length 3, and the former is a vector that
stores the coefficient values.
27. If the response takes K ≥ 2 values rather than two, the probability associated
with the logistic curve is generalized as follows:
(k)
eβ0,k +xβ
P(Y = k | x) =  K (k = 1, . . . , K ). (cf. (2.11))
l=1 eβ0,l +xβ (l)

First, given observations (x1 , y1 ), . . . , (x N , y N ) ∈ R p × {1, . . . , K }, we can


compute the negated log-likelihood

1 
N K
exp{β0,h + xi β (h) }
L := − I (yi = h) log  K
(l)
l=1 exp{β0,l + x i β }
N i=1 h=1

. Using the fact that the twice differential of L is


 N
∂2 L xi, j xi, j πi,k (1 − πi,k ), k = k
= i=1
N ,
∂β j,k β j ,k i=1 x i, j x i, j πi,k πi,k , k =k
N
we write them as i=1 xi, j xi, j wi,k,k . Show that the matrix Wi = (wi,k,k ) ∈
R K ×K for each i = 1, 2, . . . n is nonnegative definite.
28. We assume that the parameter μ := E[Y ] > 0 of the Poisson distribution

μk −μ
P(Y = k) = e (k = 0, 1, 2, . . .) (cf. (2.14))
k!
can be expressed by

μ(x) = E[Y | X = x] = eβ0 +xβ (x ∈ R p )

for x ∈ R p . Then, given observations (x1 , y1 ), . . . , (x N , y N ), the likelihood is


N y
μi i
e−μi (cf. (2.15)) (2.26)
i=1
yi !

with μi := μ(xi ) = eβ0 +xi β . To obtain β0 , β, we minimize

1
L(β0 , β) + λβ1 (cf. (2.16)) (2.27)
N
70 2 Generalized Linear Regression

N
with L(β0 , β) := − i=1 {yi (β0 + xi β) − eβ0 +xi β } if we apply Lasso.
(a) How can we derive (2.27) from (2.26)?
(b) If we write ∇ L = −X T u, show that
⎡ ⎤
y1 − eβ0 +x1 β
⎢ .. ⎥
u=⎣ . ⎦ .
β0 +x N β
yN − e

(c) If we write ∇ L = X T W X , show that


⎡ ⎤
eβ0 +x1 β · · · 0
⎢ ⎥
W = ⎣ ... ..
.
..
. ⎦ (diagonal matrix) .
0 · · · eβ0 +x N β

We execute Poisson regression for λ ≥ 0. Fill in the blank and execute the pro-
cedure.
1 ## Data Generation
2 N = 1000
3 p = 7
4 beta = rnorm(p + 1)
5 X = matrix(rnorm(N * p), ncol = p)
6 X = cbind(rep(1, N), X)
7 s = X %*% beta
8 y = rpois(N, lambda = exp(s))
9 beta
10 ## Conputation of the ML estimates
11 lambda = 100
12 beta = Inf
13 gamma = rnorm(p + 1)
14 while (sum((beta - gamma) ^ 2) > 0.01) {
15 beta = gamma
16 s = as.vector(X %*% beta)
17 w = ## Blank (1) ##
18 u = ## Blank (2) ##
19 z = ## Blank (3) ##
20 W = diag(w)
21 gamma = # Blank (4) #
22 print(gamma)
23 }

In the following, we consider survival analysis, particularly for the Cox model.
Using the random variables T, C ≥ 0 that express death and survey termination, we
define Y := min{T, C}. For t ≥ 0, let S(t) be the probability of the event T > t:

S (t)
h(t) := −
S(t)
Exercises 21–34 71

or equivalently,
P(t < Y < t + δ | Y ≥ t)
h(t) := lim .
δ→0 δ

A Cox model can be expressed by the product of h 0 (t) (hazard function) that
depends only on the covariate x ∈ R:

h(t) = h 0 (t) exp(xβ).

In particular, we regard the constant β0 as contained in h 0 (t) and assume β0 = 0.


Given (x1 , y1 ), . . . , (x N , y N ) ∈ R p × R≥0 , we estimate β ∈ R p that maximizes the
likelihood of h(t). However, h 0 is not used to compute the ML estimate because
it is constant. When p is large, i.e., under sparse situations, L1 regularization is
considered similar to other regression models.
29. We estimate the survival time of kidney patients from the kidney dataset.

Column Variable Meaning


1 id patient ID
2 time time
3 status 0: survey termination, 1: death
4 age age
5 sex sex (male: 1, female: 2)
6 disease 0: GN, 1: AN, 2: PKD, 3: Other
7 frail frailty estimate

We first execute the following code:


1 library(survival)
2 data(kidney)
3 names(kidney)
4 y = kidney$time
5 delta = kidney$status
6 Surv(y, delta)

(a) What procedure does the function Surv execute?


(b) We draw the survival time curves for each kidney disease. Replace the curves
for diseases with those for sex, and add labels and a legend.
1 fit = survfit(Surv(time, status) ~ disease, data = kidney)
2 plot(fit, xlab = "Time", ylab = "Error Rate", col = c("red", "
green", "blue", "black"))
3 legend(300, 0.8, legend = c("Others", "GN", "AN", "PKD"),
4 lty = 1, col = c("red", "green", "blue", "black"))
5 ## Execute the below as well.
6 library(ranger); library(ggplot2); library(dplyr); library(
ggfortify); autoplot(fit)

30. The variable time in Exercise 29 takes both the survival time and the time
before survey termination into account. Let t1 < t2 < · · · < tk (k ≤ N ) be the
72 2 Generalized Linear Regression

ordered survival time (excluding the survey terminations). If thereare di deaths


at time ti for i = 1, . . . , k, then the total number of deaths is D = kj=1 d j . If no
survey termination occurs for the N samples, we have D = N . If we denote by
m j ( j = 1, . . . , k) the number of survey terminations in the interval [t j , t j+1 ),
then the number (the size of the risk set) of survivors immediately before t j is


k
nj = (di + m i ) ( j = 1, . . . , k) .
i= j

Then, the probability S(t) of the survival time T being larger than t can be
estimated as (Kaplan-Meier estimate): for tl ≤ t < tl+1 ,

⎪ 1, t < t1

Ŝ(t) =  n i − di
l
(cf. (2.18))

⎩ , t ≥ t1 .
i=1
ni

What do the estimates become when no survey termination occurs during tl ≤


t < tl+1 , l = 1, . . . , k?
31. Cox (1972) proposed maximizing the partial likelihood function

 e xi β
 xjβ
(cf. (2.19))
i:δi =1 j∈Ri e

for estimating the parameter β, where xi ∈ R p and yi are covariates and time,
respectively, and δi = 1 and δi = 0 correspond to death and survey termination,
respectively, for i = 1, . . . , N .
On the other hand, Ri is the set of indices j such that y j ≥ yi . We formulate the
Lasso as follows:

1  e xi β
− log  xjβ
+ λβ1 . (cf. (2.20))
N j∈Ri e
i:δi =1

To solve the solution, similar to logistic and Poisson regressions, we compute


u, W , where L is defined as follows:

 e xi β
L := − log  xjβ
.
i:δi =1 j∈Ri e

(a) Let j ∈ Ri , δi = 1 ⇐⇒ i ∈ C j . Show


Exercises 21–34 73
  
∂L  j∈Ri x j,k e x j β
=− xi,k − 
∂βk h∈Ri e xh β
i:δi =1
⎧ ⎫

N ⎨  e xi β ⎬
=− xi,k δi −  .
⎩ h∈R j e
xh β ⎭
i=1 j∈Ci

Express u in ∇ L= −X T u.
Hint For Si = h∈Ri e xh β , we have

  x j,k e x j β N 
x j,k e x j β N  e xi β
= = xi,k .
Si Si Sj
i:δi =1
j∈R i j=1 i∈C i=1 j∈C j i

(b) Each element of ∇ 2 L = X T W X can be expressed by


 
∂2 L 
N  ∂ e xi β
= xi,k  xh β
∂βk ∂βl ∂βl h∈R j e
i=1 j∈Ci


N  1  
= xi,k  {xi,l e xi β e xs β − e xi β x h,l e xh β }
( r∈R j e x r β )2
i=1 j∈Ci s∈R j h∈R j


N 
N  e xi β 
= xi,k x h,l  x r β )2
{I (i = h) e xs β − I (h ∈ R j )e xh β } .
( r∈R j e
i=1 h=1 j∈Ci s∈R j

Find the diagonal elements of W .


Hint When i = h, j ∈ Ci implies h ∈ R j .
32. For the data 1846-2568-2-SP.rda (Alizadeh, 2000)
htt ps : //www. jstatso f t.or g/r t/supp FileMetadata/v039i05/0/722 of
malignant lymphoma survival time, answer the following:
(a) Download the data and execute the following procedure:
1 library(survival)
2 load("LymphomaData.rda"); attach("LymphomaData.rda")
3 names(patient.data); x = t(patient.data$x)
4 y = patient.data$time; delta = patient.data$status; Surv(y,
delta)

The variables x, y, and delta store the expression data of p = 7,399


genes, times, and death (= 1) or survey termination (= 0). The sample size is
N = 240. Execute the following code, find the λ that minimizes the CV, and
find how many genes have nonzero coefficients among the 7,399. Moreover,
output cv.fit.
1 cv.fit = cv.glmnet(x, Surv(y, delta), family = "cox")

(b) Fill in the blank and draw the Kaplan-Meier curve for samples such that
xi β > 0 and for those such that xi β < 0 to distinguish them.
74 2 Generalized Linear Regression

1 fit2 = glmnet(x, Surv(y, delta), lambda = cv.fit$lambda.min,


family = "cox")
2 z = sign(drop(x %*% fit2$beta))
3 fit3 = survfit(Surv(y, delta) ~ ## Blank ##)
4 autoplot(fit3)
5 mean(y[z == 1])
6 mean(y[z == -1])

33. It is assumed that logistic regression and the support vector machine (SVM) are
similar even when Lasso is applied.
a. The following code executes Lasso for logistic regression and the SVM for
the South Africa heart disease dataset: https://www2.stat.duke.edu/~cr173/
Sta102_Sp14/Project/heart.pdf.
Because those procedures are in separate packages glmnet and
sparseSVM, the plots are different, and no legend is available for the
first plot. Thus, we construct a graph using the coefficients output by the
packages. Output the graph for the SVM as well.
1 library(ElemStatLearn)
2 library(glmnet)
3 library(sparseSVM)
4 data(SAheart)
5 df = SAheart
6 df[, 5] = as.numeric(df[, 5])
7 x = as.matrix(df[, 1:9]); y = as.vector(df[, 10])
8 p = 9
9 binom.fit = glmnet(x, y, family = "binomial")
10 svm.fit = sparseSVM(x, y)
11 par(mfrow = c(1, 2))
12 plot(binom.fit); plot(svm.fit, xvar = "norm")
13 par(mfrow = c(1, 1))
14 ## The outputs seemed to be similar, but we are not convinced
that they are close because no legend is available.
15 ## So, we made a graph from scratch.
16 coef.binom = binom.fit$beta; coef.svm = coef(svm.fit)[2:(p +
1), ]
17 norm.binom = apply(abs(coef.binom), 2, sum); norm.binom = norm
.binom / max(norm.binom)
18 norm.svm = apply(abs(coef.svm), 2, sum); norm.svm = norm.svm /
max(norm.svm)
19 par(mfrow = c(1, 2))
20 plot(norm.binom, xlim = c(0, 1), ylim = c(min(coef.binom), max
(coef.binom)),
21 main = "Logistic Regression", xlab = "Norm", ylab = "
Coefficient", type = "n")
22 for (i in 1:p) lines(norm.binom, coef.binom[i, ], col = i)
23 legend("topleft", legend = colnames(df), col = 1:p, lwd = 2,
cex = .8)
24 par(mfrow = c(1, 1))

b. From the gene expression data of past patients with leukemia, we distinguish
between acute lymphocytic leukemia (ALL) and acute myeloid leukemia
Exercises 21–34 75

(AML) for each future patient. In particular, our goal is to determine which
genes should be checked to distinguish between them. To this end, we down-
load the training data file leukemia_big.csv from the site listed below:
htt ps : //web.stan f or d.edu/ hastie/C AS I f iles/D AT A/leukemia.
html. The data contain samples for N = 72 patients (47 ALL and 25 AML)
and p = 7,128 genes:
htt ps : //www.ncc.go. j p/j p/r cc/about/ pediatricl eukemia/index.
html.
After executing the following, output the coefficients obtained via logistic
regression and the SVM. In most of the genome data, the rows and columns
are genes and samples, respectively. However, similar to the dataset we have
seen thus far, the rows and columns are instead samples and genes.
1 df = read.csv("http://web.stanford.edu/~hastie/CASI_files/DATA
/leukemia_big.csv")
2 dim(df)
3 names = colnames(df)
4 x = t(as.matrix(df))
5 y = as.numeric(substr(names, 1, 3) == "ALL")
6 p = 7128
7 binom.fit = glmnet(x, y, family = "binomial")
8 svm.fit = sparseSVM(x, y)
9 coef.binom = binom.fit$beta; coef.svm = coef(svm.fit)[2:(p +
1), ]
10 norm.binom = apply(abs(coef.binom), 2, sum); norm.binom = norm
.binom / max(norm.binom)
11 norm.svm = apply(abs(coef.svm), 2, sum); norm.svm = norm.svm /
max(norm.svm)
Chapter 3
Group Lasso

Group Lasso is Lasso such that the variables are categorized into K groups k =
1, . . . , K . The pk variables θk = [θ1,k , . . . , θ pk ,k ]T ∈ R pk in the same group share
the same times at which the nonzero coefficients become zeros when we increase
the λ value. This chapter considers groups with nonzero and zero coefficients to be
active and nonactive, respectively, for each λ. In other words, group Lasso chooses
active groups rather than active variables. The active and nonactive status may be
different among the groups.

Example 24 (Multiple Responses) For the linear regression discussed in Chap. 1,


we consider only one response. However, in this section, we also consider the case
where K responses exist. If there are p covariates, we require pK coefficients.
We consider the situation such that for each j = 1, . . . , p, the K coefficients share
the active/nonactive status [27]. For example, suppose that in a baseball league, the
numbers of HRs (home runs) and RBIs (runs batted in) are responses and the numbers
of hits and walks are covariates. Then, the correlation between HRs and RBIs is
strong, and we anticipate that the solution paths of p = 2 covariates over λ ≥ 0 are
similar between the K = 2 coefficients. We discuss the problem in Sect. 3.6.

Example 25 (Logistic Regression for Multiple Values) For the logistic regression
and classification discussed in Chap. 2, we might have constructed a group consisting
of k = 1, . . . , K for each covariate j = 1, . . . , p to determine which covariate plays
an important role in the classification task [27]. For example, in Fisher’s Iris dataset,
although there are p × K = 4 × 3 = 12 parameters, we may divide them into p = 4
(sepal/petal, width/length) groups that contain K = 3 iris species to observe the
p = 4 active/nonactive status. When we increase λ, K = 3 coefficients in the same
variable indexed by j become zero at once for some λ > 0. We discuss the problem
in Sect. 3.7.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 77
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_3
78 3 Group Lasso

Example 26 (Generalized Additive Model) Similar to the linear regression dis-


cussed in Chap. 1, we are given the data X ∈ R N × p , y ∈ R N .
We consider the case where for each i = 1, . . . , N , we can write


K
yi = f k (xi ; θk ) + i ,
k=1

where f k contains pk parameters θk = [θ1,k , . . . , θ pk ,k ] ∈ R pk and the noise i follows


a Gaussian distribution with zero mean and unknown variance σ 2 > 0. If we apply the
problem to Lasso, because we decide whether each function f k should be contained,
we need to decide whether each θk rather than each of θ j,k , j = 1, . . . , pk , should
be active as a group [23]. We discuss the problem in Sect. 3.8.

In this chapter, generalizing the notion of Lasso, we categorize the variables into
K groups, each of which contains pk variables (k = 1, . . . , K ), and given observa-
tions z i,k ∈ R pk (k = 1, . . . , K ), yi ∈ R (i = 1, . . . , N ), we consider the problem of
finding the
θ1 = [θ1,1 , . . . , θ p1 ,1 ]T , . . . , θ K = [θ1,K , . . . , θ p K ,K ]T

that minimize
1  
N K K
(yi − z i,k θk ) + λ
2
θk 2 (3.1)
2 i=1 k=1 k=1

Although similar to the previous chapters, we require data preprocessing such


as centralization and normalization in this chapter. In addition, we skip the details
when we deal with numerical data generated according  to the standard Gaussian
pk
distribution, except in Sect. 3.6. We write θk 2 := j=1 θ j,k and regard z i,k as a
2

row vector.
Note that (3.1) is a generalization of the linear Lasso ( p1 = · · · = p K = 1, K = p)
discussed in Chap. 1. In fact, if we let xi := [z i,1 , . . . , z i, p ] ∈ R1× p (row vector) and
β := [θ1,1 , . . . , θ p,1 ]T , we have

1
N
(yi − xi β)2 + λβ1 .
2 i=1

3.1 When One Group Exists

In this section, we consider the case where only one group exists K = 1 and pk = p.
Then, (3.1) can be expressed by
3.1 When One Group Exists 79

1
N
(yi − z i,1 θ1 )2 + λθ1 2 .
2 i=1

We proceed the discussion by replacing p1 with p, z i,1 ∈ R p with xi = [xi,1 , . . . , xi, p ]


(row vector), and θ1 ∈ R p with β = [β1 , . . . , β p ]T . In particular, we find the optimum
β via a method different from coordinate descent.  Although the second term was
p p
λβ1 = λ j=1 |β j | in Chap. 1, it is λβ2 = λ j=1 β j in this section, different
2
 p
from that of Ridge regression λβ22 = λ j=1 β 2j as well.
For the function 
f (x, y) := x 2 + y 2 , (3.2)

if we set y = 0, because f (x, y) = |x|, the slopes when approaching from the left
and right are different. On the other hand, the partial differentials outside the origin
are  
f x (x, y) = x/x 2 + y 2
, (3.3)
f y (x, y) = y/ x 2 + y 2

and they are continuous.


We define the subderivative of one-variable function in Chap. 1. Similarly, we
define the subderivative of f : R→ R at (x0 , y0 ) ∈ R2 by the set of (u, v) ∈ R2 such
that
f (x, y) ≥ f (x0 , y0 ) + u(x − x0 ) + v(y − y0 ) (3.4)

for all (x, y) ∈ R2 . Similar to the one-variable functions, if f is differentiable at


(x0 , y0 ), the subderivative (u, v) is the derivative at (x0 , y0 ).
As we did in Chap. 1, we consider the subderivative at the origin. First, from the
definition (3.4), the subderivative of (3.2) at (x0 , y0 ) = (0, 0) is the set of (u, v) ∈ R2
such that 
x 2 + y 2 ≥ ux + vy

for all (x, y) ∈ R2 . If we let x = r cos θ, y = r sin θ and u = s cos φ, v = s sin φ (s ≥


0, 0 ≤ φ < 2π), we require r ≥ r s cos(θ − φ) for arbitrary r ≥ 0, 0 ≤ θ < 2π.
Thus, if (u, v) is outside the unit circle (s > 1), then it is not included in the sub-
derivative. Conversely, if it is inside the unit circle (s ≤ 1), the inequality holds.
Thus, the disk {(u, v) ∈ R2 | u 2 + v 2 ≤ 1} is the subderivative.
In the following, we consider finding the value of β that minimizes

1
y − X β22 + λβ2 (3.5)
2

from X ∈ R N × p and y ∈ R N , λ ≥ 0, where we denote β2 :=
p
j=1 β 2j for β =
[β1 , . . . , β p ]T ∈ R p . In this chapter, we do not divide the first term by N as in (3.5).
We may interpret N λ as if it was λ.
80 3 Group Lasso

(a) β = 0 (p = 1) (b) β = 0 (p = 1)

XT y − λ XT y XT y + λ XT y − λ XT y XT y + λ
0 0

0 0
(c) β = (p = 2) (d) β = (p = 2)
0 0

XT y
XT y
XT y 2 ≥λ
λ XT y 2 <λ
λ

Fig. 3.1 For p = 1, if X T y is a distance λ from the origin, then β = 0. If within λ, then β = 0.
For p = 2, if X T y is a distance λ from the origin, then β = [0, 0]T . If within λ, then β = [0, 0]T

For (3.5), if p = 1, the equation that gives the subdirective  0 is written as

− X T (y − X β) + λ[−1, 1]  0 . (3.6)

If β = 0, as we discussed in Chap. 1, under the assumption that X T X is the unit


matrix, we have β = S λ (X T y). If β = 0, substituting β = 0 into (3.6), we have

|X T y| ≤ λ . (3.7)

(3.7) means that the line segment with the center X T y of length ±λ contains the
origin (Fig. 3.1a, b).
Suppose p = 2. When (β1 , β2 ) = (0, 0), from (3.3), the subderivative of β2 at
β is β/β2 , and the equation that gives the subderivative that contains zero becomes
 
β 0
− X (y − X β) + λ
T
 , (3.8)
β2 0

which means
β
XT Xβ = XT y − λ . (3.9)
β2
3.1 When One Group Exists 81

Thus, the original point X T y approaches the origin by the length λ due to Lasso. On
the other hand, when β = 0, we have
   
u 0
The solution of(3.5) is β = 0 ⇐⇒ −X y + λ T
u +v ≤1 
2 2
v 0
⇐⇒ X T y2 ≤ λ ,

which means that the disk with a center and radius of X T y and λ, respectively,
contains the origin (Fig. 3.1c, d).
For p = 1, we have the formula Sλ (X T y). However, for p ≥ 2, we wait for
the convergence similar to the Newton method discussed in Chap. 2—Repeating a
recursive equation. We consider a method for finding the solution β ∈ R p .
Setting ν > 0, we repeat the updates

γ := β + ν X T (y − X β) (3.10)
νλ
β = 1− γ (3.11)
γ2 +

to await the convergence of β ∈ R p , where we denote (u)+ := max{0, u}.


We give a validation of the method and specify the value of ν.
The procedure is constructed as follows.
1 gr = function(X, y, lambda) {
2 nu = 1 / max(eigen(t(X) %*% X)$values)
3 p = ncol(X)
4 beta = rep(1, p); beta.old = rep(0, p)
5 while (max(abs(beta - beta.old)) > 0.001) {
6 beta.old = beta
7 gamma = beta + nu * t(X) %*% (y - X %*% beta)
8 beta = max(1 - lambda * nu / norm(gamma, "2"), 0) * gamma
9 }
10 return(beta)
11 }

Example 27 Artificially generating data, we execute the function gr. We show the
result in Fig. 3.2, in which we observe that the p variables share the λ value in which
the coefficients become zero when we increase λ.
1 ## Data Generation
2 n = 100
3 p = 3
4 X = matrix(rnorm(n * p), ncol = p); beta = rnorm(p); epsilon = rnorm(n)
5 y = 0.1 * X %*% beta + epsilon
6 ## Display the Change of Coefficients
7 lambda = seq(1, 50, 0.5)
8 m = length(lambda)
9 beta = matrix(nrow = m, ncol = p)
10 for (i in 1:m) {
11 est = gr(X, y, lambda[i])
12 for (j in 1:p) beta[i, j] = est[j]
82 3 Group Lasso

13 }
14 y.max = max(beta); y.min = min(beta)
15 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
16 xlab = "lambda", ylab = "Coefficients", type = "n")
17 for (j in 1:p) lines(lambda, beta[, j], col = j + 1)
18 legend("topright", legend = paste("Coefficients", 1:p), lwd = 2, col =
2:(p + 1))
19 segments(lambda[1], 0, lambda[m], 0)

3.2 Proximal Gradient Method

When we alternately update (3.10) and (3.11), the sequence {βt } is not guaranteed
to converge to the solution of (3.5). In the following, we show that we can always
obtain the correct solution by choosing an appropriate value of ν > 0.
Suppose λ = 0. Then, we note that the update via (3.10) and (3.11) amounts to

βt+1 ← βt + ν X T (y − X βt ) (t = 0, 1, . . .)

after setting an initial value β0 . Let g(β) be (3.5) with λ = 0. Then, since −X T (y −
X β) is the derivative of g(β) by β, we can write the update as

βt+1 ← βt − ν∇g(βt ) . (3.12)

We call the procedure a gradient method because it updates β such that the decrease
in g is maximized when it minimizes g.
In the following, we express the second term of (3.5) by h(β) := λβ2 and
evaluate the convergence of a variant of the gradient method (proximal gradient)

βt+1 ← βt − ν{∇g(βt ) + ∂h(β)}

for λ > 0, where ∂ denotes the subgradient. If we use the function



1
proxh (z) := arg minθ∈R p z − θ22 + h(θ)
2

(proxy operation), we can express the update as

βt+1 ← proxνh (βt − ν∇g(βt )) . (3.13)

In fact, (3.13) substitutes into βt+1 the θ ∈ R p that minimizes

1
βt − ν∇g(βt ) − θ22 + νh(θ) , (3.14)
2
3.2 Proximal Gradient Method 83

which can be examined by differentiating (3.14) by θ.


The above procedure gives a general method that finds the minimum value of the
function f expressed by the sum of convex functions g, h with g differentiable. We
can apply (3.10) and (3.11) to the standard optimization problem, in which γ and β
are updated via γ ← βt − ν∇g(βt ) and β ← proxνh (γ), respectively.
Finally, by repeatedly applying (3.13), we find an appropriate ν by which we
obtain β = β ∗ that minimizes f = g + h : R p → R. We note that there exists a
constant L such that

(x − y)T ∇ 2 g(z)(x − y) ≤ Lx − y22 (3.15)

for arbitrary x, y, z ∈ R p . We refer to this L as a Lipschitz constant . In fact, if an


orthogonal matrix P diagonalizes X T X , from ∇ 2 g(z) = X T X , P T (L I − X T X )P is
nonnegative definite, and the maximum eigenvalue of X T X is a Lipschitz constant.
In the following, we make the following general assumption: g, h are convex, g
is differentiable, and L > 0 is a Lipschitz constant. We define

L
Q(x, y) := g(y) + (x − y)T ∇g(y) + x − y2 + h(x) (3.16)
2
p(y) := argmin x∈R p Q(x, y) (3.17)

for arbitrary x, y ∈ R p . Then, the update (3.13) can be expressed by

βt+1 ← p(βt ) . (3.18)

This method is called the ISTA (iterative shrinkage-thresholding algorithm), and we


can show that the precision of the procedure βt+1 ← p(βt ) is at most O(k −1 ) for k
repetitions.
Let ν := 1/L.

Proposition 7 (Beck and Teboulle, 2009 [3]) The sequence {βt } generated by the
ISTA satisfies
Lβ0 − β∗ 22
f (βk ) − f (β∗ ) ≤ ,
2k

where β∗ is the optimum solution.

For the proof, see the Appendix.


Next, we modify the ISTA. Using the sequence {αt }

1+ 1 + 4αt2
α1 = 1, αt+1 := (3.19)
2
we generate {γt } as well as {βt }: setting γ1 = β0 ∈ R p , we update the sequences via
βt = p(γt ) and
84 3 Group Lasso

0.2
β1

-0.20-0.15-0.10-0.05 0.00 0.05


β2
β3
0.1
Coefficients

Coefficients
0.0

β1
β2
β3
-0.2 -0.1

β4
β5
β6
β7
β8

0 10 20 30 40 50 0 10 20 30 40 50
λ λ

Fig. 3.2 We execute the function gr (Example 27) for p = 3 and p = 8 and observe that the
coefficients become zero at the same time as we increase λ

αt − 1
γt+1 = βt + (βt − βt−1 ) . (3.20)
αt+1

This method is called the FISTA (fast iterative shrinkage-thresholding algorithm).


Then, we achieve an improvement in the performance over that of the ISTA. The
convergence rate is at most O(k −2 ) for k repetitions.
Proposition 8 (Beck and Teboulle, 2009 [3]) The sequence {βt } generated by the
FISTA satisfies
Lβ0 − β∗ 22
f (βk ) − f (β∗ ) ≤ ,
(k + 1)2

where β∗ is the optimum value.1


For the proof, see Page 193 of Beck and Teboulle (2009). Using Lemma 1, Proposi-
tion 8 can be obtained from lengthy but plain formula transformations. If we replace
the function gr in the group Lasso with the FISTA, we obtain the following proce-
dure. If we execute the same procedure as in Example 27, we obtain the same shape
as in Fig. 3.2.
1 fista = function(X, y, lambda) {
2 nu = 1 / max(eigen(t(X) %*% X)$values)
3 p = ncol(X)
4 alpha = 1
5 beta = rep(1, p); beta.old = rep(1, p)
6 gamma = beta
7 while (max(abs(beta - beta.old)) > 0.001) {
8 print(beta)
9 beta.old = beta
10 w = gamma + nu * t(X) %*% (y - X %*% gamma)
11 beta = max(1 - lambda * nu / norm(w, "2"), 0) * w

1 Nesterov’s accelerated gradient method (2007) [22].


3.2 Proximal Gradient Method 85

12 alpha.old = alpha
13 alpha = (1 + sqrt(1 + 4 * alpha ^ 2)) / 2
14 gamma = beta + (alpha.old - 1) / alpha * (beta - beta.old)
15 }
16 return(beta)
17 }

Example 28 To compare the efficiencies of the ISTA and FISTA, we construct the
program (Fig. 3.3). We could not see a significant difference when N = 100, p =
1, 3. However, the FISTA can be applied to general optimization problems and works
efficiently for large-scale problems.2 We show the program in Exercise 39.

3.3 Group Lasso

In the following, we apply the group Lasso procedure for one group in a cyclic
manner to obtain the group Lasso for K ≥ 1 groups. We can obtain the solution of
(3.1) using the coordinate descent method among the groups. If each of the groups
k = 1, . . . , K contains pk variables, we apply the method introduced in Sects. 3.1
and 3.2 as if p = pk .

ISTA FISTA (p = 1) ISTA FISTA (p = 3)


0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

ISTA ISTA
FISTA FISTA
0.8
0.6
L2 Error

L2 Error
0.4
0.2
0.0

2 4 6 8 10 2 4 6 8 10
# Repetitions # Repetitions

Fig. 3.3 The difference in the convergence rate between the ISTA and FISTA (Example 28). It
seems that the difference is not significant unless the problem size is rather large

2 If there exists m > 0 such that (∇ f (x) − ∇ f (y))T (x − y) ≥ m2 x − y2 for arbitrary x, y ∈ R,
then we say that f : R → R is strongly convex. In this case, the error can be exponentially decreased
w.r.t. k (Nesterov, 2007).
86 3 Group Lasso

When we apply the coordinate decent method, we compute the residual



ri,k = yi − z i,h θh (i = 1, . . . , N ) .
h=k

For example, we can construct the following procedure.


1 group.lasso = function(z, y, lambda = 0) {
2 J = length(z)
3 theta = list(); for (j in 1:J) theta[[j]] = rep(0, ncol(z[[j]]))
4 for (m in 1:10) {
5 for (j in 1:J) {
6 r = y; for (k in 1:J) {if (k != j) r = r - z[[k]] %*% theta[[k]]}
7 theta[[j]] = gr(z[[j]], r, lambda) # fista(X, r, lambda) is ok
8 }
9 }
10 return(theta)
11 }

Note that z and theta are lists. If the number of groups is one, they are the X, β
discussed in Chap. 1.
For a single group, we have seen that the variables share the active/nonactive status
for λ ≥ 0. Variables in the same group share the same status for multiple groups, but
those in different groups do not. We manually input the variables into the procedure
for grouping before executing it (the grouping is not automatically constructed).

Example 29 Suppose that we have four variables and that the first two and last two
variables constitute two separate groups. We artificially generate data and estimate
the coefficients of the four variables. When we make λ larger, we observe that the
first two and last two variables share the active/nonactive status (Fig. 3.4).

Fig. 3.4 Group Lasso with


Group 1
0.0 0.2 0.4 0.6 0.8 1.0 1.2

J = 2, p1 = p2 = 2
(Example 29). If we make λ Group 1
larger, we observe that each Group 2
group shares the
Group 2
Coefficients

active/nonactive status

0 100 200 300 400 500


λ
3.3 Group Lasso 87

1 ## Data Generation
2 N = 100; J = 2
3 u = rnorm(n); v = u + rnorm(n)
4 s = 0.1 * rnorm(n); t = 0.1 * s + rnorm(n); y = u + v + s + t + rnorm(n)
5 z = list(); z[[1]] = cbind(u, v); z[[2]] = cbind(s, t)
6 ## Display the coefficients that change with lambda
7 lambda = seq(1, 500, 10); m = length(lambda); beta = matrix(nrow = m,
ncol = 4)
8 for (i in 1:m) {
9 est = group.lasso(z, y, lambda[i])
10 beta[i, ] = c(est[[1]][1], est[[1]][2], est[[2]][1], est[[2]][2])
11 }
12 y.max = max(beta); y.min = min(beta)
13 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
14 xlab = "lambda", ylab = "Coefficients", type = "n")
15 lines(lambda, beta[, 1], lty = 1, col = 2); lines(lambda, beta[, 2], lty
= 2, col = 2)
16 lines(lambda, beta[, 3], lty = 1, col = 4); lines(lambda, beta[, 4], lty
= 2, col = 4)
17 legend("topright", legend = c("Group1", "Group1", "Group2", "Group2"),
18 lwd = 1, lty = c(1, 2), col = c(2, 2, 4, 4))
19 segments(lambda[1], 0, lambda[m], 0)

3.4 Sparse Group Lasso

In the following, we extend the formulation of (3.1) to

1  
N K K
(yi − z i,k θk )2 + λ {(1 − α)θk 2 + αθk 1 } (3.21)
2 i=1 k=1 k=1

to contain sparsity within the group as well as between groups [26]. We refer to this
extended group Lasso as sparse group Lasso. The difference lies only in the second
term from (3.1), introducing the parameter 0 < α < 1, as done in the elastic net.
In sparse group Lasso, although the variables of the nonactive groups are non-
active, the variables in the active groups may be either active or nonactive. In this
sense, sparse group Lasso extends the ordinary group Lasso and allows the active
groups to have nonactive variables.
If we differentiate both sides of (3.21) by θk ∈ R pk , we have


N
− z i,k (ri,k − z i,k θk ) + λ(1 − α)sk + λαtk = 0 (3.22)
i=1


with ri,k := yi − l=k z i,l θ̂l , where sk , tk ∈ R pk are the subderivatives of θk 2 ,
θk 1 .
Next, we derive the conditions for θk = 0 to be optimum in terms of sk , tk .
The minimum value of (3.21) except the term with the coefficient 1 − α is
88 3 Group Lasso
 N

N 
θk = Sλα z i,k ri,k 2
z i,k ,
i=1 i=1

which means that a necessary for θk = 0 to have a solution in (3.22) is


 
 
N 
 
λ(1 − α) ≥ Sλα z i,k ri,k  .
 
i=1 2

Then, (3.11), which holds for α = 0, is extended to

νλ(1 − α)
β = 1− Sλα (γ) , (3.23)
Sλα (γ)2 +

where (u)+ := max{0, u}. Therefore, we can construct a procedure that repeats (3.10)
and (3.23) alternatively.
Although we abbreviate the proof, we have the following proposition.

Proposition 9 For arbitrary 0 < ν < L1 , β ∈ R p minimizing (3.21) and β ∈ R p


being a solution of (3.10), (3.23) are equivalent.

The actual procedure can be constructed as follows, where the three lines marked
as ## are different from the ordinary group Lasso.
1 sparse.group.lasso = function(z, y, lambda = 0, alpha = 0) {
2 J = length(z)
3 theta = list(); for (j in 1:J) theta[[j]] = rep(0, ncol(z[[j]]))
4 for (m in 1:10) {
5 for (j in 1:J) {
6 r = y; for (k in 1:J) {if (k != j) r = r - z[[k]] %*% theta[[k]]}
7 theta[[j]] = sparse.gr(z[[j]], r, lambda, alpha)
##
8 }
9 }
10 return(theta)
11 }
12
13 sparse.gr = function(X, y, lambda, alpha = 0) {
14 nu = 1 / max(2 * eigen(t(X) %*% X)$values)
15 p = ncol(X)
16 beta = rnorm(p); beta.old = rnorm(p)
17 while (max(abs(beta - beta.old)) > 0.001) {
18 beta.old = beta
19 gamma = beta + nu * t(X) %*% (y - X %*% beta)
20 delta = soft.th(lambda * alpha, gamma)
##
21 beta = max(1 - lambda * nu * (1 - alpha) / norm(delta, "2"), 0) *
delta ##
22 }
23 return(beta)
24 }
3.4 Sparse Group Lasso 89

α=0 α = 0.001

1.2
1.2
Group 1 Group 1
Group 1 Group 1
Group 2 Group 2

1.0
1.0

Group 2 Group 2

0.8
0.8
Coefficients

Coefficients
0.6
0.6

0.4
0.4

0.2
0.2
0.0

0.0
0 100 200 300 400 500 0 50 100 150 200 250
λ λ

α = 0.01 α = 0.1
1.2

Group 1 Group 1
Group 1 Group 1
1.0

Group 2 Group 2
1.0

Group 2 Group 2
0.8
0.8
Coefficients

Coefficients
0.6
0.6

0.4
0.4

0.2
0.2
0.0

0.0

0 10 20 30 40 50 2 4 6 8
λ λ

Fig. 3.5 We observe that the larger the value of α, the greater the number of variables that are
active/nonactive in the group (Example 30)

Example 30 We executed the sparse group Lasso for α = 0, 0.001, 0.01, 0.1 under
the same setting as in Example 29. We find that the larger the value of α, the greater
the number of variables that are nonactive in the active groups (Fig. 3.5).

3.5 Overlap Lasso

We consider the case where some groups overlap. For example, the groups are
{1, 2, 3}, {3, 4, 5} rather than {1, 2}, {3, 4}, {5}.
For groups k = 1, . . . , K , among the coefficients β1 , . . . , β p of variables, we
specify the βk for which Kwe do not use zero. We prepare the variables θ1 , . . . , θ K ∈ R
p

such that R  β = k=1 θk (see Example 31) and find the β value that minimizes
p
90 3 Group Lasso

1 K K
y − X θk 22 + λ θk 2
2 k=1 k=1

from the data X ∈ R N × p , y ∈ R N .

Example 31 Suppose that we have five variables β1 , . . . , β5 and two groups θ1 , θ2


that only the third variable with the coefficient β3 may be overlapping, and that θ1
and θ2 contain β1 , β2 and β4 , β5 , respectively. We write
⎡ ⎤ ⎡ ⎤
β1 0
⎢ β2 ⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥
θ1 = ⎢ β3,1 ⎥ , θ2 = ⎢
⎢ ⎥ ⎥
⎢ β3,2 ⎥ (β3 = β3,1 + β3,2 ) ,
⎣ 0 ⎦ ⎣ β4 ⎦
0 β5

and differentiate L by β1 , β2 , β3,1 , β3,2 , β4 , β5 . We write the first and last three
columns of X ∈ R N ×5 as X 1 ∈ R N ×3 and X 2 ∈ R N ×3 . If we differentiate L by γ1
and γ2 (the first and last three elements of θ1 , θ2 ), we have

∂L γ1
= −X 1T (y − X 1 γ1 ) + λ
∂γ1 γ1 2
∂L γ2
= −X 2T (y − X 2 γ2 ) + λ .
∂γ2 γ2 2

Thus, (3.8) amounts to

(the first 3 of θ1 )
−X 1T (y − X θ1 ) + λ =0
θ1 2
(the last 3 of θ2 )
−X 2T (y − X θ2 ) + λ =0.
θ2 2

If we set θ j = 0, we have X 1T y2 ≤ λ and X 2T y2 ≤ λ, which means that we can


optimize θ1 and θ2 independently.

3.6 Group Lasso with Multiple Responses

From the observations X ∈ R N × p , β ∈ R p×K , y ∈ R N ×K , we find the β ∈ R p×K


such that
 p
L(β) := L 0 (β) + λ β j 2
j=1

with
3.6 Group Lasso with Multiple Responses 91

1  
N K p
L 0 (β) := (yi,k − xi, j β j,k )2 . (3.24)
2 i=1 k=1 j=1

In Chap. 1, we consider only the case of K = 1. If we extend the number of responses


to K as yi = [yi,1 , . . . , yi,K ], we assume that for j = 1, . . . , p, the K coefficients
( j)
β j = [β
j,1 , . . . , β j,K ] (row vector) share the active/nonactive status. Setting ri,k :=
yi,k − h= j xi,h βh,k , if we differentiate (3.24) by β j,k , we have


N
( j)
{−xi, j (ri,k − xi, j β j,k )} ,
i=1

which means that if we differentiate L(β) by β j , we have


N 
N
( j)
βj xi,2 j − xi, j ri + λ∂β j 2 ,
i=1 i=1

( j) ( j) ( j)
where ri := [ri,1 , . . . , ri,K ]. Therefore, the solution is

1 λ 
N
( j)
β̂ j =  N 1− N ( j)
xi, j ri . (3.25)
2
i=1 x i, j  i=1 x i, j ri 2 + i=1

From the above discussion, if we consider centralization and normalization, we


can construct the following procedure.
1 gr.multi.linear.lasso = function(X, Y, lambda) {
2 n = nrow(X); p = ncol(X); K = ncol(Y)
3 ## Centralization:the function centralize is defined in Chapter 1
4 res = centralize(X, Y)
5 X = res$X
6 Y = res$y
7 ## Computation of Coefficients
8 beta = matrix(rnorm(p * K), p, K); gamma = matrix(0, p, K)
9 while (norm(beta - gamma, "F") / norm(beta, "F") > 10 ^ (-2)) {
10 gamma = beta ## Store the beta values (for comparizon)
11 R = Y - X %*% beta
12 for (j in 1:p) {
13 r = R + as.matrix(X[, j]) %*% t(beta[j, ])
14 M = t(X[, j]) %*% r
15 beta[j, ] = sum(X[, j] ^ 2) ^ (-1) * max(1 - lambda / sqrt(sum(M ^
2)), 0) * M
16 R = r - as.matrix(X[, j]) %*% t(beta[j, ])
17 }
18 }
19 ## Intercept
20 for (j in 1:p) beta[j, ] = beta[j, ] / res$X.sd[j]
21 beta.0 = res$y.bar - as.vector(res$X.bar %*% beta)
22 return(rbind(beta.0, beta))
23 }
92 3 Group Lasso

Fig. 3.6 We find the Hitters with many HRs and RBIs
relevant variables that affect
HRs and RBIs via group

-1.5 -1.0 -0.5 0.0 0.5 1.0


Lasso. For each color, the
solid and dotted lines express
HRs and RBIs, respectively.
SHs have negative

Coefficients
coefficients, which shows
that SHs and (HBPs, RBIs)
have negative correlations.
H
Even if we make λ large
SB
enough, the coefficients of H BB
and BB do not become zero, HBP
keeping positive values, SO
which means that they are SH
essentially important items DP
for HRs and RBIs
0 50 100 150 200
λ

We do not have to repeat updates for each j (for each group) in this procedure but
may update only once.
Example 32 From a dataset that contains nine items of a professional baseball
team,3 i.e., the number of hits (H), home runs (HRs), runs batted in (RBIs), stolen
bases (SBs), base on balls (BB), hits by pitch (HBPs), strike outs (SOs), sacrifice
hits (SHs), and double plays (DPs), we find the relevant covariates among the seven
for the responses HRs and RBIs. In general, we require estimation of the coefficients
β j,k ( j = 1, . . . , 7, k = 1, 2), because the HRs and RBIs are correlated, and we
make β j,1 , β j,2 share the active/nonactive status for each j. We draw the change in
the coefficients with λ in Fig. 3.6.
1 df = read.csv("giants_2019.csv")
2 X = as.matrix(df[, -c(2, 3)])
3 Y = as.matrix(df[, c(2, 3)])
4 lambda.min = 0; lambda.max = 200
5 lambda.seq = seq(lambda.min, lambda.max, 5)
6 m = length(lambda.seq)
7 beta.1 = matrix(0, m, 7); beta.2 = matrix(0, m, 7)
8 j = 0
9 for (lambda in lambda.seq) {
10 j = j + 1
11 beta = gr.multi.linear.lasso(X, Y, lambda)
12 for (k in 1:7) {
13 beta.1[j, k] = beta[k + 1, 1]; beta.2[j, k] = beta[k + 1, 2]
14 }
15 }
16 beta.max = max(beta.1, beta.2); beta.min = min(beta.1, beta.2)

3 The hitters of the Yomiuri Giants, Japan, each of which has at least one opportunity to bat.
3.6 Group Lasso with Multiple Responses 93

17 plot(0, xlim = c(lambda.min, lambda.max), ylim = c(beta.min, beta.max),


18 xlab = "lambda", ylab = "Coefficients", main = "Hitters with many HR
and RBI")
19 for (k in 1:7) {
20 lines(lambda.seq, beta.1[, k], lty = 1, col = k + 1)
21 lines(lambda.seq, beta.2[, k], lty = 2, col = k + 1)
22 }
23 legend("bottomright", c("H", "SB", "BB", "SH", "SO", "HBP", "DP"),
24 lty = 2, col = 2:8)

3.7 Group Lasso via Logistic Regression

In Chap. 2, we considered logistic regression with sparse estimation for binary and
multiple values. In this section, among the pK parameters, the variables with the
coefficients β j,k (k = 1, . . . , K ) with the same j comprise a group. Variables in the
same group share the active/nonactive status, and the coordinate descent method is
applied among the p groups.
In Chap. 2, we consider minimizing


p

K
L 0 (β) + λ |β j,k |
j=1 k=1

with
⎡ ⎛ ⎛ ⎞⎞⎤

N 
p

K 
K p
L 0 (β) := ⎣ yi,k xi β j,k − log ⎝ exp ⎝ xi, j β j,h ⎠⎠⎦
i=1 j=1 k=1 h=1 j=1


1, yi = k
when yi,k = . In this section, we replace the last term by the L2 norm
0, yi = k
and minimize 
p 

 K
L(β) := L 0 (β) + λ  β 2j,k
j=1 k=1

to choose the relevant variables for classification. From the same discussion in
Chap. 2, we have

∂ L 0 (β) N
=− xi, j (yi,k − πi,k )
∂β j,k i=1

∂ 2 L 0 (β)  N
= xi, j xi, j  wi,k,k  ,
∂β j,k ∂β j  ,k  i=1
94 3 Group Lasso

exp(xi β (k) )
where πi,k =  K and (2.12) has been applied. Because wi,k,k  are
(h)
h=1 exp(x i β )
constants, from the Taylor expansion at β = γ, we derive


N 
p
1  
N p p
L 0 (β) = L 0 (γ) − xi j (β j − γ j )(yi − πi ) + xi, j xi, j  (β j − γ j )Wi (β j  − γ j  )T ,
2 
i=1 j=1 i=1 j=1 j =1

where yi = [yi,1 , . . . , yi,K ]T , πi = [πi,1 , . . . , πi,K ]T and Wi = (wi,k,k  ) are the con-
stants at β = γ. We observe that the Lipschitz constant is at most t such that

(β − γ)Wi (β − γ)T ≤ tβ − γ2 ,

i.e., t := 2 maxi,k πi,k (1 − πi,k ). In fact, the matrix that is obtained by subtracting
Wi = (wi,k,k  ) from the diagonal matrix of size K with the elements 2πi,k (1 −
πi,k ) (k = 1, . . . , K ) is Q i = (qi,k,k  ), with

πi,k (1 − πi,k  ), k  = k
qi,k,k  = ,
πi,k πi,k  , k  = k

and  
qi,k,k = πi,k (1 − πi,k ) = πi,k πi,k  = |qi,k,k  |
k  =k k  =k

for each k = 1, . . . , K . Thus, from Proposition 3, the matrix Q i is nonnegative


definite.
 p
Hence, from (3.16), (3.17), to minimize L(β) = L 0 (β) + λ β j 2 with
 i=1
 K

β j 2 =  β 2j,k , we need only to minimize
k=1

 2

N  N  
t   
p p p
− xi, j (β j − γ j )(yi − πi ) +  x (β − γ )  +λ β j 2 ,
2 i=1  
i, j j j
i=1 j=1  j=1  j=1
2

in other words, to minimize4


 2
N  
 yi − πi  λ
p p
1  
x (β − γ ) − + β j 2
2 i=1  t 
i, j j j
 j=1  t j=1
2

If we differentiate the first term by β j , we have

4 It is not essential to assume that β, γ are elements of R p in (3.16), (3.17).


3.7 Group Lasso via Logistic Regression 95
⎧ ⎫
N ⎨
  yi − πi ⎬  
N N
( j)
x 2 (β j − γ j ) + xi, j xi,h (βh − γh ) − xi, j = βj xi,2 j − xi, j ri ,
⎩ i, j t ⎭
i=1 h = j i=1 i=1

where
yi − πi  
p
( j)
ri := + xi,h γh − xi,h βh . (3.26)
t h=1 h= j

Thus, β j can be expressed by



1 λ/t 
N
( j)
β j :=  N 1− N ( j)
xi, j ri . (3.27)
2
i=1 x i, j  i=1 x i, j ri 2 + i=1

We suppose that one parameter is updated in each step from γ1 , . . . , γ p to


β1 , . . . , β p and construct the procedure to repeat (3.26), (3.27) for j = 1, . . . , p
as follows.
1 gr.multi.lasso = function(X, y, lambda) {
2 n = nrow(X); p = ncol(X); K = length(table(y))
3 beta = matrix(1, p, K)
4 gamma = matrix(0, p, K)
5 Y = matrix(0, n, K); for (i in 1:n) Y[i, y[i]] = 1
6 while (norm(beta - gamma, "F") > 10 ^ (-4)) {
7 gamma = beta
8 eta = X %*% beta
9 P = exp(eta); for (i in 1:n) P[i, ] = P[i, ] / sum(P[i, ])
10 t = 2 * max(P * (1 - P))
11 R = (Y - P) / t
12 for (j in 1:p) {
13 r = R + as.matrix(X[, j]) %*% t(beta[j, ])
14 M = t(X[, j]) %*% r
15 beta[j, ] = sum(X[, j] ^ 2) ^ (-1) * max(1 - lambda / t / sqrt(sum(
M ^ 2)), 0) * M
16 R = r - as.matrix(X[, j]) %*% t(beta[j, ])
17 }
18 }
19 return(beta)
20 }

Example 33 For Fisher’s Iris dataset, we draw the graph in which β changes with
λ (Fig. 3.7). The same color expresses the same variable, and each contains K = 3
coefficients. We find that the sepal length plays an important role. The code is as
follows.
96 3 Group Lasso

Fig. 3.7 For Fisher’s Iris The Cofficients that change with λ
data, we observed the
changes in the coefficients
Sepal Length
with λ (Example 33). The

1.0
Sepal Width
same color expresses the Petal Length
same variable, and each Petal Width

0.5
consists of K = 3

Coefficients
coefficients. We found that

-1.0 -0.5 0.0


the sepal length plays an
important role

0 50 100 150
λ

1 df = iris
2 X = cbind(df[[1]], df[[2]], df[[3]], df[[4]])
3 y = c(rep(1, 50), rep(2, 50), rep(3, 50))
4 lambda.seq = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 125, 150)
5 m = length(lambda.seq); p = ncol(X); K = length(table(y))
6 alpha = array(dim = c(m, p, K))
7 for (i in 1:m) {
8 res = gr.multi.lasso(X, y, lambda.seq[i])
9 for (j in 1:p) {for (k in 1:K) alpha[i, j, k] = res[j, k]}
10 }
11 plot(0, xlim = c(0, 150), ylim = c(min(alpha), max(alpha)), type = "n",
12 xlab = "lambda", ylab = "Coefficients", main = "Each lambda and its
Coefficients")
13 for (j in 1:p) {for (k in 1:K) lines(lambda.seq, alpha[, j, k], col = j +
1)}
14 legend("topright", legend = c("Sepal Length", "Sepal Width", "Petal
Length", "Petal Width"),
15 lwd = 2, col = 2:5)

3.8 Group Lasso for the Generalized Additive Models

We consider Example 26 again. We prepare the basis functions  j,k : R p → R ( j =


1, . . . , pk , k = 1, . . . , K ) in advance. From the observations X ∈ R N × p , y ∈ R N ,
 θ j,k ( j = 1, . . . , pk ) that minimizes the squared error between the residual
we find
y − h=k f h (X ) ∈ R N and
3.8 Group Lasso for the Generalized Additive Models 97


pk
f k (X ; θk ) = θ j,k  j,k (X )
j=1

for k = 1, . . . , K . We repeat it over k = 1, . . . , K cyclically until convergence.


Estimating a function f in this way is called backfitting(Hastie-Tibshirani, 1990).
To introduce Lasso, we formulate as follows (Ravikumar et al., 2009). For λ > 0, if
we write

1 K K
L := y − f k (X )2 + λ θk 2
2 k=1 k=1

1   
N pk K K
= {yi −  j,k (xi )θ j,k }2 + λ θk 2 (3.28)
2 i=1 j=1 k=1 k=1

and z i,k := [1,k (xi ), . . . ,  pk ,k (xi )], then (3.28) coincides with (3.1).

Example 34 From the observation (x1 , y1 ), . . . , (x N , y N ) ∈ R × R, we regress yi


on f 1 (xi ) + f 2 (xi ). Setting

f 1 (x; α, β) = α + βx
f 2 (x; p, q, r ) = p cos x + q cos 2x + r cos 3x

and J = 2, p1 = 2, and p2 = 3 from


⎡ ⎤ ⎡ ⎤
1 x1 cos x1 cos 2x1 cos 3x1
⎢ ⎥ ⎢ ⎥
z 1 = ⎣ ... ... ⎦ , z 2 = ⎣ ... ..
.
..
. ⎦ ,
1 xN cos x N cos 2x N cos 2x N
1.0

1 f1 (x) + f2 (x)
x
4

f1 (x)
cos x
Coefficients Values

f2 (x)
Function Values

cos 2x
0.5

cos 3x
2
0
0.0

-2
-0.5

0 50 100 150 200 -4 -2 0 2 4


λ x

Fig. 3.8 Express y = f (x) in terms of the sum of the outputs of f 1 (x), f 2 (x). We obtain the
coefficients of 1, x, cos x, cos 2x, cos 3x for each λ (left) and draw the functions f 1 (x), f 2 (x)
whose coefficients have already been obtained (right)
98 3 Group Lasso

we seek to obtain θ1 = [α, β]T , θ2 = [ p, q, r ]T .


To this end, we made the following program and obtained the coefficients as in
Fig. 3.8, left. We then decomposed the function f (x) into f 1 (x) and f 2 (x) (Fig. 3.8,
right).
1 ## Data Generation
2 n = 100; J = 2; x = rnorm(n); y = x + cos(x)
3 z[[1]] = cbind(rep(1, n), x)
4 z[[2]] = cbind(cos(x), cos(2 * x), cos(3 * x))
5 ## Display the Change of the Coefficients
6 lambda = seq(1, 200, 5); m = length(lambda); beta = matrix(nrow = m, ncol
= 5)
7 for (i in 1:m) {
8 est = group.lasso(z, y, lambda[i])
9 beta[i, ] = c(est[[1]][1], est[[1]][2], est[[2]][1], est[[2]][2], est
[[2]][3])
10 }
11 y.max = max(beta); y.min = min(beta)
12 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
13 xlab = "lambda", ylab = "Coefficients", type = "n")
14 lines(lambda, beta[, 1], lty = 1, col = 2); lines(lambda, beta[, 2], lty
= 2, col = 2)
15 lines(lambda, beta[, 3], lty = 1, col = 4); lines(lambda, beta[, 4], lty
= 2, col = 4)
16 lines(lambda, beta[, 5], lty = 3, col = 4)
17 legend("topright", legend = c("1", "x", "cos x", "cos 2x", "cos 3x"),
18 lwd = 1, lty = c(1, 2, 1, 2, 3), col = c(2, 2, 4, 4, 4))
19 segments(lambda[1], 0, lambda[m], 0)
20
21 i = 5Ã£ÂŁÂŁ # lambda[5] is used
22 f.1 = function(x) beta[i, 1] + beta[i, 2] * x
23 f.2 = function(x) beta[i, 3] * cos(x) + beta[i, 4] * cos(2 * x) + beta[i,
5] * cos(3 * x)
24 f = function(x) f.1(x) + f.2(x)
25 curve(f.1(x), -5, 5, col = "red", ylab = "Function Value")
26 curve(f.2(x), -5, 5, col = "blue", add = TRUE)
27 curve(f(x), -5, 5, add = TRUE)
28 legend("topleft", legend = c("f = f.1 + f.2", "f.1", "f.2"),
29 col = c(1, "red", "blue"), lwd = 1)

Appendix: Proof of Proposition

Proposition 7 (Beck and Teboulle, 2009 [3]) The sequence {βt } generated by the
ISTA satisfies
Lβ0 − β∗ 22
f (βk ) − f (β∗ ) ≤ ,
2k

where β∗ is the optimum solution.


Appendix: Proof of Proposition 99

Proof: Before showing Proposition 7, we prove the following lemma:

Lemma 1
L
f (x) − f ( p(y)) ≥  p(y) − y2 + L(y − x)T ( p(y) − y)
2
Proof of Lemma 1
We note that in general, we have

f (x) ≤ Q(x, y) (3.29)

for arbitrary x, y ∈ R p . In fact, g is convex and differentiable; from Taylor’s theorem,


we have
1 L
g(x) = g(y) + (x − y)T ∇g(y) + (x − y)T ∇ 2 g(y + θ(x − y))(x − y) ≤ g(y) + (y − x)T ∇g(y) + x − y22 .
2 2

Then, the condition that the subderivative of Q(x, y) by x contains 0 is expressed


as
∇g(y) + L( p(y) − y) + γ(y) = 0 , (3.30)

where γ(y) is a subderivative of h(x) when x = p(y). From the convexity of g, h,


we can write

g(x) ≥ g(y) + (x − y)T ∇g(y)


h(x) ≥ h( p(y)) + (x − p(y))T γ(y) .

From these and (3.30), we have the inequality

f (x) − Q( p(y), y) = g(x) + h(x) − Q( p(y), y)


≥ g(y) + (x − y)T ∇g(y) + h( p(y)) + (x − p(y))T γ(y)
L
− {g(y) + ( p(y) − y)T ∇g(y) +  p(y) − y2 + h( p(y))}
2
L
= −  p(y) − y + (x − p(y)) (∇g(y) + γ(y))
2 T
2
L
= −  p(y) − y2 + L(x − y + y − p(y))T (y − p(y))
2
L
=  p(y) − y2 + L(x − y)T (y − p(y)) . (3.31)
2
Thus, from (3.29), (3.31), Lemma 1 follows. 
Proof of Proposition 7 If we substitute x := β∗ , y := βt in Lemma 1 from p(βt ) =
βt+1 , we have
100 3 Group Lasso

2
{ f (β∗ ) − f (βt+1 )} ≥ βt+1 − βt 2 + 2(βt − β∗ )T (βt+1 − βt )
L
= βt+1 − βt , βt+1 − βt + 2(βt − β∗ )
= (βt+1 − β∗ ) − (βt − β∗ ), (βt+1 − β∗ ) + (βt − β∗ )
= β∗ − βt+1 2 − β∗ − βt 2 ,

where ·, · denotes the inner product in the associated linear space. If we add them
over t = 0, 1, . . . , k − 1, we have

2 
k−1
{k f (β∗ ) − f (βt+1 )} ≥ β∗ − βk 2 − β∗ − β0 2 . (3.32)
L t=0

Next, if we substitute x := βt , y := βt in Lemma 1, then we have

2
{ f (βt ) − f (βt+1 )} ≥ βt − βt+1 2 .
L
If we add them after multiplying each by t on both sides over t = 0, 1, . . . , k − 1,
we have

2 2
k−1 k−1
t{ f (βt ) − f (βt+1 )} = {t f (βt ) − (t + 1) f (βt+1 ) + f (βt+1 )}
L t=0 L t=0

2  k−1  k−1
= {−k f (βk ) + f (βt+1 )} ≥ tβt − βt+1 2 .
L t=0 t=0
(3.33)

Finally, adding both sides of (3.32) and (3.33), we obtain

2k  k−1
{ f (β∗ ) − f (βk )} ≥ β∗ − βk 2 − β∗ − β0 2 + tβt − βt+1 2
L t=0
≥ −β0 − β∗ 2 ,

which proves Proposition 7. 

Exercises 35–46

34. For the function f (x, y) := x 2 + y 2 , answer the following questions.
(a) Find the subderivative of f (x, y) at (x, y) = (0, 0).
(b) Let p ≥ 2. For β ∈ R p , Differentiate the L2 norm β2 for β = 0.
(c) The subderivative of two variables is defined by the set (u, v) ∈ R2 such that
Exercises 35–46 101

f (x, y) ≥ f (x0 , y0 ) + u(x − x0 ) + v(y − y0 ) (cf.(3.4))

for (x, y) ∈ R2 . Find the subderivative of f (x, y) at (x, y) = (0, 0).


Hint Let x = r cos θ, y = r sin θ, u = s cos φ, and v = s sin φ (s ≥ 0, 0 ≤
φ < 2π). Then, r ≥ r s cos(θ − φ) is required for all r ≥ 0, 0 ≤ θ < 2π.
35. Let X ∈ R N × p , y ∈ R N , and λ > 0. We wish to find β ∈ R p that minimizes

1
y − X β22 + λβ2 . (cf. (3.5))
2
(a) Show that the necessary and sufficient condition for β = 0 to be a solution
is X T y2 ≤ λ.
Hint If we take the subderivative, the condition of containing zero is as
follows:  
β 0
−X T (y − X β) + λ  (cf. (3.8))
β2 0

Substitute β = 0 into the equation.


(b) Show that if β = 0 has a solution, then β satisfies

β
XT Xβ = XT y − λ . (cf. (3.9))
β2

36. Let ν > 0. Given an initial value of β ∈ R p , we repeat

γ := β + ν X T (y − X β) (cf. (3.10))

νλ
β = 1− γ (cf. (3.11))
γ2 +

to obtain the optimal solution of β ∈ R p to Exercise 35, where (u)+ :=


max{0, u}. As νm, we chose the inverse of the maximum eigenvalue of X T X and
constructed the following procedure. Fill in the blanks and define the function
gr. Then, examine the function using the procedure that follows.
1 gr = function(X, y, lambda) {
2 nu = 1 / max(2 * eigen(t(X) %*% X)$values)
3 p = ncol(X)
4 beta = rep(1, p); beta.old = rep(0, p)
5 while (max(abs(beta - beta.old)) > 0.001) {
6 beta.old = beta
7 gamma = ## Blank(1) ##
8 beta = ## Blank(2) ##
9 }
10 return(beta)
11 }
12
13 ## Data Generation
102 3 Group Lasso

14 n = 100
15 p = 3
16 X = matrix(rnorm(n * p), ncol = p); beta = rnorm(p); epsilon = rnorm(
n)
17 y = 0.1 * X %*% beta + epsilon
18 ## Display the coefficients that change with lambda
19 lambda = seq(1, 50, 0.5)
20 m = length(lambda)
21 beta = matrix(nrow = m, ncol = p)
22 for (i in 1:m) {
23 est = gr(X, y, lambda[i])
24 for (j in 1:p) beta[i, j] = est[j]
25 }
26 y.max = max(beta); y.min = min(beta)
27 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
28 xlab = "lambda", ylab = "Coefficients", type = "n")
29 for (j in 1:p) lines(lambda, beta[, j], col = j + 1)
30 legend("topright", legend = paste("Coefficients", 1:p), lwd = 2, col
= 2:(p + 1))
31 segments(lambda[1], 0, lambda[m], 0)

37. We regard the constant L to be Lipschitz and to satisfy

(x − y)T ∇ 2 g(z)(x − y) ≤ Lx − y22 (cf. (3.15)) (3.34)

for arbitrary x, y, z ∈ R p . Show that when g(z) = 21 y − X z2 , the Lipschitz


constant is at most the maximum eigenvalue of X T X .
38. We modify Exercise 46 (ISTA) as follows to obtain a better performance
 (FISTA).
Let {αt } be the sequence such that α1 = 1 and αt+1 := (1 + 1 + 4αt2 )/2. We
generate {β} and {γ} such that β0 ∈ R p , γ1 = β0 , βt = p(γt )

αt − 1
γt+1 = βt + (βt − βt−1 ) (cf. (3.20))
αt+1

(Iterative Shrinkage-Thresholding Algorithm, ISTA).


(a) Prove αt ≥ (t + 1)/2 based on induction.
(b) Fill in the blanks in the program below, replace the function gr in Exercise 36
with fista, and examine whether the same execution result is obtained.
1 fista = function(X, y, lambda) {
2 nu = 1 / max(2 * eigen(t(X) %*% X)$values)
3 p = ncol(X)
4 alpha = 1
5 beta = rnorm(p); beta.old = rnorm(p)
6 gamma = beta
7 while (max(abs(beta - beta.old)) > 0.001) {
8 print(beta)
9 beta.old = beta
10 w = ## Blank(1) ##
11 beta = ## Blank(2) ##
12 alpha.old = alpha
13 alpha = (1 + sqrt(1 + 4 * alpha ^ 2)) / 2
Exercises 35–46 103

14 gamma = beta + (alpha.old - 1) / alpha * (beta - beta.old)


15 }
16 return(beta)
17 }

39. To compare the efficiency between the ISTA and FISTA, we construct the fol-
lowing program. Fill in the blanks, and execute the procedure to examine the
difference.
1 ## Data Generation
2 n = 100; p = 1 # p = 3
3 X = matrix(rnorm(n * p), ncol = p); beta = rnorm(p); epsilon = rnorm(
n)
4 y = 0.1 * X %*% beta + epsilon
5 lambda = 0.01
6 nu = 1 / max(eigen(t(X) %*% X)$values)
7 p = ncol(X)
8 m = 10
9 ## Performance of ISTA
10 beta = rep(1, p); beta.old = rep(0, p)
11 t = 0; val = matrix(0, m, p)
12 while (t < m) {
13 t = t + 1; val[t, ] = beta
14 beta.old = beta
15 gamma = ## Blank(1) ##
16 beta = ## Blank(2) ##
17 }
18 eval = array(dim = m)
19 val.final = val[m, ]; for (i in 1:m) eval[i] = norm(val[i, ] - val.
final, "2")
20 plot(1:m, ylim = c(0, eval[1]), type = "n",
21 xlab = "Repetitions", ylab = "L2 Error", main = "Comparison
bewteen ISTA and FISTA")
22 lines(eval, col = "blue")
23 ## Performance of FISTA
24 beta = rep(1, p); beta.old = rep(0, p)
25 alpha = 1; gamma = beta
26 t = 0; val = matrix(0, m, p)
27 while (t < m) {
28 t = t + 1; val[t, ] = beta
29 beta.old = beta
30 w = ## Blank(3) ##
31 beta = ## Blank(4) ##
32 alpha.old = alpha
33 alpha = ## Blank(5) ##
34 gamma = ## Blank(6) ##
35 }
36 val.final = val[m, ]; for (i in 1:m) eval[i] = norm(val[i, ] - val.
final, "2")
37 lines(eval, col = "red")
38 legend("topright", c("FISTA", "ISTA"), lwd = 1, col = c("red", "blue"
))

40. We consider extending the group Lasso procedure for one group to the case of
K groups, similar to the procedure for the coordinate decent method, to obtain
the solution of
104 3 Group Lasso

1  
N J K
(yi − z i,k θk )2 + λ θk 2 (cf. (3.1)) . (3.35)
2 i=1 k=1 k=1

Fill in the blanks and construct the group Lasso. Then, execute the function for
the following procedure to examine the behavior.
1 group.lasso = function(z, y, lambda = 0) {
2 J = length(z)
3 theta = list(); for (j in 1:J) theta[[j]] = rep(0, ncol(z[[j]]))
4 for (m in 1:10) {
5 for (j in 1:J) {
6 r = y; for (k in 1:J) {if (k != j) r = r - ## Blank(1) ##}
7 theta[[j]] = ## Blank(2) ##
8 }
9 }
10 return(theta)
11 }
12
13 ## Data Generation
14 N = 100; J = 2
15 u = rnorm(n); v = u + rnorm(n)
16 s = 0.1 * rnorm(n); t = 0.1 * s + rnorm(n); y = u + v + s + t + rnorm
(n)
17 z = list(); z[[1]] = cbind(u, v); z[[2]] = cbind(s, t)
18 ## Display of the Coefficients that change with lambda
19 lambda = seq(1, 500, 10); m = length(lambda); beta = matrix(nrow = m,
ncol = 4)
20 for (i in 1:m) {
21 est = group.lasso(z, y, lambda[i])
22 beta[i, ] = c(est[[1]][1], est[[1]][2], est[[2]][1], est[[2]][2])
23 }
24 y.max = max(beta); y.min = min(beta)
25 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
26 xlab = "lambda", ylab = "Coefficient", type = "n")
27 lines(lambda, beta[, 1], lty = 1, col = 2); lines(lambda, beta[, 2],
lty = 2, col = 2)
28 lines(lambda, beta[, 3], lty = 1, col = 4); lines(lambda, beta[, 4],
lty = 2, col = 4)
29 legend("topright", legend = c("Group1", "Group1", "Group2", "Group2"
),
30 lwd = 1, lty = c(1, 2), col = c(2, 2, 4, 4))
31 segments(lambda[1], 0, lambda[m], 0)

41. To contain sparsity not just within a group but also between groups, we extend
the formulation (3.35) to

1  
N K K
(yi − z i,k θk )2 + λ {(1 − α)θk 2 + αθk 1 } (0 < α < 1) (cf. (3.21)) (3.36)
2
i=1 k=1 k=1

(sparse group Lasso). While the active group contains only active variables in
the ordinary group Lasso, it may contain nonactive variables in the sparse group
Lasso. In other words, the sparse group Lasso extends the group Lasso and allows
the active group to contain nonactive variables.
Exercises 35–46 105

(a) Show that the minimum value except the second term of (3.36) is


N
Sλα z i,k ri,k .
i=1

(b) Show that the necessary and sufficient condition for θ j = 0 to be a solution
is  
 N 
 
λ(1 − α) ≥ Sλα z i,k ri,k  .
 
i=1 2

(c) Show that the update formula for the ordinary group Lasso
νλ
β ← 1− γ becomes
γ2 +

νλ(1 − α)
β ← 1− Sλα (γ) (cf. (3.23))
Sλα (γ)2 +

in the extended setting.


42. We construct a function that realizes the sparse group Lasso. Fill in the blanks,
and execute the procedure in Exercise 40 to examine the validity.
1 sparse.group.lasso = function(z, y, lambda = 0, alpha = 0) {
2 J = length(z)
3 theta = list(); for (j in 1:J) theta[[j]] = rep(0, ncol(z[[j]]))
4 for (m in 1:10) {
5 for (j in 1:J) {
6 r = y; for (k in 1:J) {if (k != j) r = r - z[[k]] %*% theta[[k
]]}
7 theta[[j]] = ## Blank(1) ##
8 }
9 }
10 return(theta)
11 }
12
13 sparse.gr = function(X, y, lambda, alpha = 0) {
14 nu = 1 / max(2 * eigen(t(X) %*% X)$values)
15 p = ncol(X)
16 beta = rnorm(p); beta.old = rnorm(p)
17 while (max(abs(beta - beta.old)) > 0.001) {
18 beta.old = beta
19 gamma = beta + nu * t(X) %*% (y - X %*% beta)
20 delta = ## Blank(2) ##
21 beta = ## Blank(3) ##
22 }
23 return(beta)
24 }

43. We consider the case where some groups overlap. For example, the groups
are {1, 2, 3}, {3, 4, 5} rather than {1, 2}, {3, 4}, {5}. We prepare the variables
106 3 Group Lasso

K
θ1 , . . . , θ K ∈ R p such that R p  β = k=1 θk (see Example 31) and find the β
value that minimizes

1 K K
y − X θk 22 + λ θk 2
2 k=1 k=1

from the data X ∈ R N × p , y ∈ R N . We write the first and last three columns of
X ∈ R5 as X 1 ∈ R3 and X 2 ∈ R3 and let
⎡ ⎤ ⎡ ⎤
β1 0
⎢ β2 ⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥
θ1 = ⎢ ⎥ ⎢ ⎥
⎢ 3,1 ⎥ 2 ⎢ β3,2 ⎥ (β3 = β3,1 + β3,2 ) .
β , θ =
⎣ 0 ⎦ ⎣ β4 ⎦
0 β5

Express the conditions equivalent to θ1 = 0, θ2 = 0 in terms of θ1 , θ2 , X 1 ,


X 2 , y, λ, respectively.
Hint Differentiate the equation to be minimized by β1 , β2 , β3,1 , β3,2 , β4 , β5 .
44. From the observations X ∈ R N × p , β ∈ R p×K , and y ∈ R N ×K , we find the β ∈
R p×K that minimizes


p
L(β) := L 0 (β) + λ β j 2
j=1

with
1  
N K p
L 0 (β) := (yi,k − xi, j β j,k )2 .
2 i=1 k=1 j=1

Show that

1 λ 
N
( j)
β̂ j =  N 1− N ( j)
xi, j ri
2
i=1 x i, j  i=1 x i, j ri 2 + i=1

is the solution.
45. From the observations (x1 , y1 ), . . . , (x N , y N ) ∈ R × R, we regress yi by
f 1 (xi ) + f 2 (xi ). Let

f 1 (x; α, β) = α + βx
f 2 (x; p, q, r ) = p cos x + q cos 2x + r cos 3x

and J = 2, p1 = 2, p2 = 3. To this end, we obtain θ1 = [α, β]T , θ2 = [ p, q, r ]T


from
Exercises 35–46 107
⎡ ⎤ ⎡ ⎤
1 x1 cos x1 cos 2x1 cos 3x1
⎢ ⎥ ⎢ ⎥
z 1 = ⎣ ... ... ⎦ , z 2 = ⎣ ... ..
.
..
. ⎦ .
1 xN cos x N cos 2x N cos 2x N

Fill in the blanks, and execute the procedure.


1 ## Data Generation
2 n = 100; J = 2; x = rnorm(n); y = x + cos(x)
3 z[[1]] = cbind(rep(1, n), x)
4 z[[2]] = cbind(## Blank(1) ##)
5 ## Display the coefficients that change with lambda
6 lambda = seq(1, 200, 5); m = length(lambda); beta = matrix(nrow = m,
ncol = 5)
7 for (i in 1:m) {
8 est = ## Blank(2) ##
9 beta[i, ] = c(est[[1]][1], est[[1]][2], est[[2]][1], est[[2]][2],
est[[2]][3])
10 }
11 y.max = max(beta); y.min = min(beta)
12 plot(lambda[1]:lambda[m], ylim = c(y.min, y.max),
13 xlab = "lambda", ylab = "Coefficients", type = "n")
14 lines(lambda, beta[, 1], lty = 1, col = 2); lines(lambda, beta[, 2],
lty = 2, col = 2)
15 lines(lambda, beta[, 3], lty = 1, col = 4); lines(lambda, beta[, 4],
lty = 2, col = 4)
16 lines(lambda, beta[, 5], lty = 3, col = 4)
17 legend("topright", legend = c("1", "x", "cos x", "cos 2x", "cos 3x"),
18 lwd = 1, lty = c(1, 2, 1, 2, 3), col = c(2, 2, 4, 4, 4))
19 segments(lambda[1], 0, lambda[m], 0)
20
21 i = 5 # the lambda[5] value is used
22 f.1 = function(x) beta[i, 1] + beta[i, 2] * x
23 f.2 = function(x) beta[i, 3] * cos(x) + beta[i, 4] * cos(2 * x) +
beta[i, 5] * cos(3 * x)
24 f = function(x) f.1(x) + f.2(x)
25 curve(f.1(x), -5, 5, col = "red", ylab = "Function Value")
26 curve(f.2(x), -5, 5, col = "blue", add = TRUE)
27 curve(f(x), -5, 5, add = TRUE)
28 legend("topleft", legend = c("f = f.1 + f.2", "f.1", "f.2"),
29 col = c(1, "red", "blue"), lwd = 1)

46. The following function gr.multi.lasso estimates the coefficients of logis-


tic regression with multiple values. Fill in the blanks, and examine whether the
procedure runs.
1 gr.multi.lasso = function(X, y, lambda) {
2 n = nrow(X); p = ncol(X); K = length(table(y))
3 beta = matrix(1, p, K)
4 gamma = matrix(0, p, K)
5 Y = matrix(0, n, K); for (i in 1:n) Y[i, y[i]] = 1
6 while (norm(beta - gamma, "F") > 10 ^ (-4)) {
7 gamma = beta
8 eta = X %*% beta
9 P = ## Blank(1) ##
10 t = 2 * max(P * (1 - P))
11 R = (Y - P) / t
108 3 Group Lasso

12 for (j in 1:p) {
13 r = R + as.matrix(X[, j]) %*% t(beta[j, ])
14 M = ## Blank(2) ##
15 beta[j, ] = sum(X[, j] ^ 2) ^ (-1) *
16 max(1 - lambda / t / sqrt(sum(M ^ 2)), 0) * M
17 R = r - as.matrix(X[, j]) %*% t(beta[j, ])
18 }
19 }
20 return(beta)
21 }
22 ## the procedure to execute
23 df = iris
24 X = cbind(df[[1]], df[[2]], df[[3]], df[[4]])
25 y = c(rep(1, 50), rep(2, 50), rep(3, 50))
26 lambda.seq = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 125, 150)
27 m = length(lambda.seq); p = ncol(X); K = length(table(y))
28 alpha = array(dim = c(m, p, K))
29 for (i in 1:m) {
30 res = gr.multi.lasso(X, y, lambda.seq[i])
31 for (j in 1:p) {for (k in 1:K) alpha[i, j, k] = res[j, k]}
32 }
33 plot(0, xlim = c(0, 150), ylim = c(min(alpha), max(alpha)), type = "n
",
34 xlab = "lambda", ylab = "Coefficient", main = "The coefficients
that change with lambda")
35 for (j in 1:p) {for (k in 1:K) lines(lambda.seq, alpha[, j, k], col =
j + 1)}
36 legend("topright", legend = c("Sepal Length", "Sepal Width", "Petal
Length", "Petal Width"),
37 lwd = 2, col = 2:5)
Chapter 4
Fused Lasso

Fused Lasso is the problem of finding the θ1 , . . . , θ N that minimize

N −1
1 
N
(yi − θi )2 + λ |θi − θi+1 | (4.1)
2 i=1 i=1

given observations y1 , . . . , y N and a constant λ > 0


The outputs θ1 , . . . , θ N are, respectively, close to y1 , . . . , y N , and θi , θi+1 share
the same value if yi , yi+1 have close values (Fig. 4.1). Then, for i = 1, . . . , N − 1,
the larger the value of λ > 0, the larger the penalty of θi = θi+1 , which means that
θi = θi+1 is more likely; in particular, yi , yi+1 are close. If λ is infinitely large,
we have θ1 = · · · = θ N . We extend the formulation to finding the θ1 , . . . , θ N that
minimize
N −1
1  
N N
(yi − θi )2 + μ |θi | + λ |θi − θi+1 | (4.2)
2 i=1 i=1 i=1

for given μ, λ ≥ 0 (sparse fused Lasso) . The extended setting penalizes the size
of θi as well as the difference θi − θi+1 . Another extension is that the observations
y1 , . . . , y N may be in the p ≥ 1 dimension in which the difference θi − θi+1 is
evaluated by the ( p-dimension) L2 norm.

4.1 Applications of Fused Lasso

Before looking at the inside of the fused Lasso procedure, we overview the application
of the CRAN package genlasso [2] to various data.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 109
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_4
110 4 Fused Lasso

Fig. 4.1 Fused Lasso. If the y1 y 2 y 3 y 4 y5 y 6


dimension is one, adjacent
observations yi are close,
and they share the same θi ,
such as {y2 , y3 , y4 } and θ4 = θ5
{y5 , y6 } above θ1
θ2 = θ3 = θ4

Example 35 We analyze the comparative genomic hybridization (CGH) data via


fused Lasso.
https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/cgh.html
CGH is a molecular cytogenetic method for analyzing copy number variations
(CNVs) relative to the ploidy level in the DNA of a test sample compared to
a reference sample without culturing cells. We make the CNVs smooth for λ =
0.1, 1, 10, 100 (Fig. 4.2).
1 library(genlasso)
2 df = read.table("cgh.txt"); y = df[[1]]; N = length(y)
3 out = fusedlasso1d(y)
4 plot(out, lambda = 0.1, xlab = "gene #", ylab = "Copy Number Variation",
5 main = "gene # 1-1000")

For the multidimensional case, let V = {1, 2, . . . , N } and E (a subset of {{i, j} |


i, j ∈ V, i = j}) be the vertex and edge sets, respectively, and assume that yi is at
the vertex i. We formulate the fused Lasso problem as minimizing the quantity

1 
N
(yi − θi )2 + λ |θi − θ j | (4.3)
2 i=1 {i, j}∈E

for a given λ ≥ 0. We consider whether a pair i, j ∈ V such that {i, j} ∈ E are


connected via the fused Lasso procedure. If the edges are connected in one dimension,
it is the original fused Lasso (Fig. 4.3).
Example 36 For the novel coronavirus that spread in Japan in 2020, the number
of infected people by prefecture up to June 9, 2020 is shown. First, consider the
connectivity of fused Lasso when prefectures are adjacent to each other. For λ = 50,
we observe the difference in the number of infected people, but for λ = 150 or more,
only the areas where the infection has spread, such as Tokyo and Hokkaido, are
emphasized. In the program, from the adjacency matrix (one if prefectures i, j are
concatenated and zero otherwise), the number of columns is 47, and a matrix D
with as many rows as the adjacent edges is generated in each line, of which one
of the adjacent prefectures has a value of one and the other has a value of −1 (the
multidimensional fused Lasso is described in detail in Sect. 4.4).
4.1 Applications of Fused Lasso 111

Created by tikzDevice version 0.12.3 on 2020-07-30 08:29:56


Copy Number Variation
Gene# 1-1000

2
-2

0 200 400 600 800 1000

Gene #

λ = 0.1
Copy Number Variation
2
-2

0 10 20 30 40 50

Gene #

λ=1
Copy Number Variation
2
-2

0 10 20 30 40 50

Gene #

λ = 10
Copy Number Variation
2
-2

0 10 20 30 40 50

Gene #

Fig. 4.2 Execution of Example 35. Applying fused Lasso, we make the CGH data smooth. For the
first graph, we draw the smoothing of the CGH for all genes, and for the three graphs below it, we
draw the smoothing for the first fifty genes and λ = 0.1, 1, 10. We observe that the greater λ is, the
more smooth (the less sensitive) the generated line
112 4 Fused Lasso

λ = 50 λ = 150

Fig. 4.3 The execution of Example 36. The number of people infected with the novel coronavirus in
2020 until June 9th is displayed in color. When λ = 150 (right), a difference between a small number
of areas can be observed, such as Tokyo and Hokkaido, where the state of emergency declared in
April lasted longer than in other prefectures. However, small differences can be observed when
λ = 50 (left)

1 library(genlasso)
2 library(NipponMap)
3 mat = read.table("adj.txt")
4 mat = as.matrix(mat) ## Adjacency matrix of 47 prefectures in
Japan
5 y = read.table("2020_6_9.txt")
6 y = as.numeric(y[[1]]) ## #infected with corona for each of the 47
prefectures
7
8 k = 0; u = NULL; v = NULL
9 for (i in 1:46) for (j in (i + 1):47) if (mat[i, j] == 1) {
10 k = k + 1; u = c(u, i); v = c(v, j)
11 }
12 m = length(u)
13 D = matrix(0, m, 47)
14 for (k in 1:m) {D[k, u[k]] = 1; D[k, v[k]] = -1}
15 res = fusedlasso(y, D = D)
16 z = coef(res, lambda = 50)$beta # lambda = 150
17 cc = round((10 - log(z)) * 2 - 1)
18 cols = NULL
19 for (k in 1:47) cols = c(cols, heat.colors(12)[cc[k]]) ## Colors for each
of 47 prefectures
20 JapanPrefMap(col = cols, main = "lambda = 50") ## a function to draw JP map

Example 37 Fused Lasso can show not only the difference between the adjacencies
of variables but also the second-order difference θi − 2θi+1 + θi+2 , the third-order
difference θi − 3θi+1 + 3θi+2 − θi+3 , etc. Differences can be introduced in multiple
layers, which is called trend filtering.
4.1 Applications of Fused Lasso 113

Trend filtering estimate k=1 k=2

Trend filtering estimate


2

2
1

1
0

0
-1

-1
-2

-2
0 10 20 30 40 50 0 10 20 30 40 50
Position Position

k=3 k=4
Trend filtering estimate

Trend filtering estimate


2

2
1

1
0

0
-1

-1
-2

-2

0 10 20 30 40 50 0 10 20 30 40 50
Position Position

Fig. 4.4 Execution of Example 37 (output of the trend filtering procedure), in which the higher the
order k, the more degrees of freedom the curve has, and the easier it is for the curve to follow the
data

Using the function trendfilter prepared by genlasso, we smooth the sin θ,


0 ≤ θ ≤ 2π, data with added noise via a trend filtering of the order k = 1, 2, 3, 4.
This output is shown in Fig. 4.4.
1 library(genlasso)
2 n = 50; y = sin(1:n / n * 2 * pi) + rnorm(n) ## Data Generation
3 out = trendfilter(y, ord = 3); k = 1 # k = 2, 3, 4
4 plot(out, lambda = k, main = paste("k = ", k))
114 4 Fused Lasso

4.2 Solving Fused Lasso via Dynamic Programming

There are several ways to solve the fused Lasso problem.


We consider the solution based on dynamic programming (N. Johnson, 2013) [14].
To find θ1 , . . . , θ N that minimize (4.1), we obtain the condition w.r.t. θ1 such that
the optimum θ2 , . . . , θ N is obtained. In this case, we need to minimize

1
h 1 (θ1 , θ2 ) := (y1 − θ1 )2 + λ|θ2 − θ1 |
2
but θ2 remains. Then, the value of θ1 when the value of θ2 is known is

⎨ y1 − λ, y1 > θ2 + λ
θ̂1 (θ2 ) = θ2 , |y1 − θ2 | ≤ λ .

y1 + λ, y1 < θ2 − λ

The optimum condition w.r.t. θ2 is the minimization of

1 1
(y1 − θ1 )2 + (y2 − θ2 )2 + λ|θ2 − θ1 | + λ|θ3 − θ2 | .
2 2

Then, the variables θ1 , θ3 remain, but if we replace θ1 with θ̂1 (θ2 ), the value of θ2
when we know θ3 , i.e., θ̂2 (θ3 ) that minimizes

1 1
h 2 (θ̂1 (θ2 ), θ2 , θ3 ) := (y1 − θ̂1 (θ2 ))2 + (y2 − θ2 )2 + λ|θ2 − θ̂1 (θ2 )| + λ|θ3 − θ2 |
2 2
can be expressed as a function of θ3 .
Since θ̂1 (θ2 ) can be expressed by a function θ̂2 (θ3 ) of θ3 , we write θ̂1 (θ3 ). If we
continue this process, then θ̂1 (θ N ), . . . , θ̂ N −1 (θ N ) are obtained as a function of θ N .
If we substitute them into (4.1), then the problem reduces to the minimization of

h N (θ̂1 (θ N ), . . . , θ̂ N −1 (θ N ), θ N , θ N )
N −1 −2
1  
N
1
:= (yi − θ̂i (θ N ))2 + (y N − θ N )2 + λ |θ̂i (θ N ) − θ̂i+1 (θ N )| + λ|θ̂ N −1 (θ N ) − θ N |
2 2
i=1 i=1
(4.4)

w.r.t. θ N . Suppose that we have successfully obtained the optimum solution θ N =


θ∗N of (4.4). Then, θ∗N −1 := θ̂ N −1 (θ∗N ) is obtained from the value of θ∗N , θ∗N −2 =
θ̂ N −2 (θ∗N −1 ) is obtained from θ∗N −1 , and so on, to obtain the θ1∗ , . . . , θ∗N that mini-
mize (4.1). We postpone solving the equations w.r.t. θ1 , . . . , θ N −1 forward and first
solve θ N −1 = θ∗N −1 , . . . , θ1 = θ1∗ backward after obtaining the solution θ N = θ∗N .
This general strategy is called dynamic programming.
4.2 Solving Fused Lasso via Dynamic Programming 115

It does not seem to be easy to find a function θ̂i (θi+1 ) except for i = 1. However,
there exist L i , Ui , i = 1, . . . , N − 1 such that

⎨ L i , θi+1 < L i
θ̂i (θi+1 ) = θi+1 , L i ≤ θi+1 ≤ Ui .

Ui , Ui < θi+1

In the actual procedure, we find the optimum value θ∗N of θ N and the remaining
θi∗ = θ̂i (θi+1

), i = N − 1, . . . , 1, in a backward manner. For the details, see the
Appendix. From the procedure, we have the following proposition.
Proposition 10 (N. Johnson, 2013 [14]) The algorithm that computes the fused
Lasso via dynamic programming completes in O(N ).
Based on the discussion above, we construct a function that computes the fused
Lasso.
1 clean = function(z) {
2 m = length(z)
3 j = 2; while (z[1] >= z[j] && j < m) j = j + 1
4 k = m - 1; while (z[m] <= z[k] && k > 1) k = k - 1
5 if (j > k) return(z[c(1, m)]) else return(z[c(1, j:k, m)])
6 }
7
8 fused = function(y, lambda = lambda) {
9 if (lambda == 0) return(y)
10 n = length(y)
11 L = array(dim = n - 1)
12 U = array(dim = n - 1)
13 G = function(i, theta) {
14 if (i == 1) theta - y[1]
15 else G(i - 1, theta) * (theta > L[i - 1] && theta < U[i - 1]) +
16 lambda * (theta >= U[i - 1]) - lambda * (theta <= L[i - 1]) + theta -
y[i]
17 }
18 theta = array(dim = n)
19 L[1] = y[1] - lambda; U[1] = y[1] + lambda; z = c(L[1], U[1])
20 if (n > 2) for (i in 2:(n - 1)) {
21 z = c(y[i] - 2 * lambda, z, y[i] + 2 * lambda); z = clean(z)
22 m = length(z)
23 j = 1; while (G(i, z[j]) + lambda <= 0) j = j + 1
24 if (j == 1) {L[i] = z[m]; j = 2}
25 else L[i] = z[j - 1] - (z[j] - z[j - 1]) * (G(i, z[j - 1]) + lambda) /
26 (-G(i, z[j - 1])+G(i, z[j]))
27 k = m; while (G(i, z[k]) - lambda >= 0) k = k - 1
28 if (k == m) {U[i] = z[1]; k = m - 1}
29 else U[i] = z[k] - (z[k + 1] - z[k]) * (G(i, z[k]) - lambda) /
30 (-G(i, z[k]) + G(i, z[k + 1]))
31 z = c(L[i], z[j:k], U[i])
32 }
33 z = c(y[n] - lambda, z, y[n] + lambda); z = clean(z)
34 m = length(z)
35 j = 1; while (G(n, z[j]) <= 0 && j < m) j = j + 1
36 if (j == 1) theta[n] = z[1]
37 else theta[n] = z[j - 1] - (z[j] - z[j - 1]) * G(n, z[j - 1]) /(-G(n, z[j
- 1]) + G(n, z[j]))
116 4 Fused Lasso

38 for (i in n:2) {
39 if (theta[i] < L[i - 1]) theta[i - 1] = L[i - 1]
40 if (L[i - 1] <= theta[i] && theta[i] <= U[i - 1]) theta[i - 1] = theta[i
]
41 if (theta[i] > U[i - 1]) theta[i - 1] = U[i - 1]
42 }
43 return(theta)
44 }

We also consider minimizing (4.2) rather than (4.1) (sparse Fused Lasso), which
not only smooths the adjacent data but also regularizes θ1 , . . . , θ N to suppress the
sizes.
In reality, we often consider the fused Lasso under μ = 0. Although this is large
because we do not care about the size of θ = [θ1 , . . . , θ N ], we have another reasonable
motivation: the solution under the extended setting can be obtained from that under
the nonextended setting.

Proposition 11 (Friedman et al., 2007 [10]) Let θ(0) be the solution θ ∈ R N under
μ = 0. Then, the solution θ = θ(μ) under μ > 0 is given by S μN (θ(0)), where SμN :
R N → R N is such that the i = 1, . . . , N -th element is Sμ (z i ) for z = [z 1 , . . . , z N ]T
(see (1.11)).

For the proof, see the Appendix.

4.3 LARS

To understand the Lasso duality algorithm introduced in the next section, consider
the processing of sparse estimation called LARS (least angle regression; B. Efron
et al., 2004 [9]) . LARS is a sparse estimation algorithm that performs the same
processing as Lasso, but since the amount of calculation is O( p 2 ) for the number of
variables p, Lasso is often used in practice. However, LARS performs processing
similar to Lasso, is easy to analyze theoretically and is said to be rich in suggestions.
In linear Lasso, we identify the largest λ such that at least one coefficient is
nonzero. Then, such a λ is the largest absolute value λ0 of x ( j) , y for a variable j,
if the squared loss y − X β 2 (the first term) is not divided by N , where x ( j) is the
j-th column of X ∈ R N × p , and y ∈ R N .
LARS defines a piecewise linear vector β(λ) (Figure 4.5): suppose we have
obtained λ0 , λ1 , λk−1 , β0 = 0, β1 , βk−1 , and S = { j1 , . . . , jk−1 } for k ≥ 1. Then,
1. Define k−1 such that the first k and the rest p − k are nonzeros and zeros.
2. Define

β(λ) = βk−1 + (λk−1 − λ)k−1 . (4.5)

and
r (λ) = y − X β(λ) (4.6)
4.3 LARS 117

Coef. Coef.
βk−1

βk
β0 = 0 λ λ

λ0 λk λk−1

Fig. 4.5 LARS: k−1 with k nonzero and p − k zero elements defines the piecewise linear vector
β(λ) = βk−1 + (λk−1 − λ)k−1 , given β(λk−1 ) = βk−1

for λ ≤ λk−1 .
3. Seek the largest λk ≤ λk−1 and jk ∈ / S such that that the absolute value of
x ( jk ) , r (λk ) is λk , and join jk to the acyive set S.
4. Retrict the ranges of β(λ) and r (λ) to λk ≤ λ, and define βk := β(λk ).
Note that for the third step, if we define u j := x j , rk−1 − λk−1 X k−1 and v j :=
uj
x j , X k−1 , where rk−1 := y − X βk−1 , we choose the largest t j := among
vj ± 1
j∈ / S as λk .
If we choose  T 
(X S X S )−1 X ST rk−1 /λk−1
k−1 := ,
0

then we have x ( jk ) , y = λ for λ ≤ λk , for k = 0, 1, . . . , p − 1, where X S ∈ R N ×k


is the sub-matrix of X that consists of x ( j) with j ∈ S.
In fact, from (4.6), if we define HS := X S (X ST X S )−1 X ST , then we have that

λ λ
r (λ) = rk − 1 − HS rk = (I − HS )rk + HS rk .
λk−1 λk−1

holds. Moreover, from x ( j) , rk = ±λk and (x ( j) )T HS = (HS x ( j) )T = (x ( j) )T , we


have
λ λ
x ( j) , r (λ) = x ( j) , (I − HS )rk + HS rk = 0 + x ( j) , rk = ±λ .
λk λk

In other words, in LARS, once one variable joins the active set, they share the
same relation x ( j) , r (λ) = ±λ until λ = 0.
For example, in the R language, we can construct the following procedure.
1 lars = function(X, y) {
2 X = as.matrix(X); n = nrow(X); p = ncol(X); X.bar = array(dim = p)
3 for (j in 1:p) {X.bar[j] = mean(X[, j]); X[, j] = X[, j] - X.bar[j]}
4 y.bar = mean(y); y = y - y.bar
5 scale = array(dim = p)
118 4 Fused Lasso

6 for (j in 1:p) {scale[j] = sqrt(sum(X[, j] ^ 2) / n); X[, j] = X[, j] /


scale[j]}
7 beta = matrix(0, p + 1, p); lambda = rep(0, p + 1)
8 for (i in 1:p) {
9 lam = abs(sum(X[, i] * y))
10 if (lam > lambda[1]) {i.max = i; lambda[1] = lam}
11 }
12 r = y; index = i.max; Delta = rep(0, p)
13 for (k in 2:p) {
14 Delta[index] = solve(t(X[, index]) %*% X[, index]) %*%
15 t(X[, index]) %*% r / lambda[k - 1]
16 u = t(X[, -index]) %*% (r - lambda[k - 1] * X %*% Delta)
17 v = -t(X[, -index]) %*% (X %*% Delta)
18 t = u / (v + 1)
19 for (i in 1:(p - k + 1)) if (t[i] > lambda[k]) {lambda[k] = t[i]; i.max
= i}
20 t = u / (v - 1)
21 for (i in 1:(p - k + 1)) if (t[i] > lambda[k]) {lambda[k] = t[i]; i.max
= i}
22 j = setdiff(1:p, index)[i.max]
23 index = c(index, j)
24 beta[k, ] = beta[k - 1, ] + (lambda[k - 1] - lambda[k]) * Delta
25 r = y - X %*% beta[k, ]
26 }
27 for (k in 1:(p + 1)) for (j in 1:p) {beta[k, j] = beta[k, j] / scale[j]}
28 return(list(beta = beta, lambda = lambda))
29 }

Example 38 We apply LARS to the U.S. crime data to which we applied Lasso in
Chapter 1. Although the scales of λ are different, they share a similar shape (Fig.
4.6). The figure is displayed via the above function and the following procedure.
1 df = read.table("crime.txt"); X = as.matrix(df[, 3:7]); y = df[, 1]
2 res = lars(X, y)
3 beta = res$beta; lambda = res$lambda
4 p = ncol(beta)
5 plot(0:8000, ylim = c(-7.5, 15), type = "n",
6 xlab = "lambda", ylab = "beta", main = "LARS (USA Crime Data)")
7 abline(h = 0)
8 for (j in 1:p) lines(lambda[1:(p)], beta[1:(p), j], col = j)
9 legend("topright",
10 legend = c("Annual Police Funding in \$/Resident", "25 yrs.+ with 4
yrs. of High School",
11 "16 to 19 yrs. not in High School ...",
12 "18 to 24 yrs. in College",
13 "25 yrs.+ in College"),
14 col = 1:p, lwd = 2, cex = .8)

Among the five relations, the first three hold for any probability model. The fourth
is true for the Gaussian distributions from Proposition 16. We only show that the last
relation holds for the Gaussian distribution. The relation if either k ∈ A or k ∈ B, then
the claim is apparent. If neither X A ⊥⊥ X k |C nor X B ⊥⊥ X k |C, then it contradicts
X A ⊥⊥ A B |X C because the distribution is Gaussian.
4.4 Dual Lasso Problem and Generalized Lasso 119

LARS (USA Crime Data)


Annual Police Funding in $/Resident
25 yrs.+ with 4 yrs. of High School

15
16 to 19 yrs. not in High School ...
18 to 24 yrs. in College
25 yrs.+ in College
10
5
β
0
-5

0 2000 4000 6000 8000


λ

Fig. 4.6 We apply LARS to the U.S. crime data and observe that they share a similar shape

4.4 Dual Lasso Problem and Generalized Lasso

In this section, we solve fused Lasso in a manner different from the method based on
dynamic programming. Although the method takes O(N 2 ) time for N observations,
it can solve more general fused Lasso problems. First, we introduce the notion of the
dual Lasso problem [29].

Example 39 Given observations X ∈ R N × p , y ∈ R N , and parameter λ > 0, we find


the β ∈ R p that minimizes

1
y − Xβ 2
2 +λ β 1 (4.7)
2
(linear regression Lasso), where the first term is not divided by N (consider that N
is contained in λ), which is the minimization of

1
r 2
2 +λ β 1
2

with r := y − X β. Moreover, if we regard α ∈ R N as a Lagrange coefficient vector,


the problem reduces to the minimization of
120 4 Fused Lasso

1
L(β, r, α) := r 2
2 +λ β 1 − αT (r − y + X β)
2
w.r.t. β, r . Thus, if we minimize it w.r.t. β, r , we have

0, XT α ∞ ≤ λ
minp {−αT X β + λ β 1 } =
β∈R −∞, otherwise
1 1
min r 2
2 − αT r = − αT α ,
r 2 2

where X T α ∞ is the maximum of |x Tj α|, in which x j is the j-th column of X .


Therefore, the minimization of (4.7) w.r.t. β, r and the maximization of

1
{ y 2
2 − y − α 22 } (4.8)
2

over α ∈ R N under X T α ∞ ≤ λ, i.e., the minimization of 21 y − α 22 , are equiva-


lent. In the following, the former and latter are said to be the prime and dual problems.
This dual problem involves choosing the α value such that X T α ∞ ≤ λ (α sur-
rounded by a polyhedron consisting of p pairs of parallel surfaces) and that minimizes
the distance from y.

Fused Lasso gives an example. For y1 , . . . , y N ∈ R and λ > 0, we wish to mini-


mize
1
y − θ 22 + λ Dθ 1 , (4.9)
2

where D = (Di, j ) ∈ R(N −1)×N is



⎨ 1, j = i
Di, j = −1, j = i + 1 (4.10)

0, otherwise

for i = 1, . . . , N − 1.
The same idea can be applied not just to the one-dimensional fused Lasso but also
to trend filtering
N −2
1 
N
(yi − θi )2 + |θi − 2θi+1 + θi+2 |
2 i=1 i=1

and its extension


N −2
1  θi+2 − θi+1
N
θi+1 − θi
(yi − θi )2 + − (4.11)
2 i=1 i=1
xi+2 − xi+1 xi+1 − xi

by modifying D = R(N −2)×N as


4.4 Dual Lasso Problem and Generalized Lasso 121

⎪ 1
⎧ ⎪
⎪ , j =i

⎪ xi+1 − xi

⎪ 1, j =i ⎪

⎨ ⎨ 1 1
−2, j =i +1 − + , j =i +1
Di, j = , Di, j = xi+1 − xi xi+2 − xi+1 .

⎪ 1, j =i +2 ⎪
⎪ 1
⎩ ⎪
⎪ − , j =i +2
0, otherwise ⎪


⎩ x i+2 − xi+1
0, otherwise
(4.12)
Moreover, for the multidimensional case, if the edges in (4.3) consist of m ele-
ments {i 1 , j1 }, . . . , {i m , jm }, then we can construct the matrix D ∈ Rm×N such that
Dk,ik = 1, Dk, jk = −1 for some 1 ≤ i k < i j ≤ N and the other N − 2 are zero for
k = 1, . . . , m.
Example 40 We transform the fused Lasso problem into its dual version to solve it.
If we let D ∈ Rm× p and γ = Dθ, then (4.9) becomes the minimization of

1
y−θ 2
2 +λ γ 1
2
over θ ∈ R p . If we introduce a Lagrange multiplier α, then we have

1
y−θ 2
2 +λ γ 1 + αT (Dθ − γ) .
2
If we minimize this via θ, γ, we have

1 1
min y−θ 2
2 + αT Dθ = y − D T α 22 − y 22 (4.13)
θ 2 2
0, α ∞≤λ
min{λ γ 1 − αT γ} = .
γ −∞, otherwise

Thus, the dual problem minimizes

1
y − DT α 2
2
2

under α ∞ ≤ λ. If we obtain the solution α̂ of α, then by substituting α̂ into the left-


hand side of (4.13) to be minimized and differentiating by θ, we have θ̂ = y − D T α̂
and obtain the value of θ.
The prime and dual problems of the generalized Lasso can be extended to the
following:
Proposition 12 For X ∈ R N × p , y ∈ R N , and D ∈ Rm× p (m ≥ 1), if the prime prob-
lem involves finding the β ∈ R p that minimizes

1
y − Xβ 2
2 + λ Dβ 1 , (4.14)
2
122 4 Fused Lasso

then the dual problem involves finding the α ∈ Rm that minimizes

1
X (X T X )−1 X T y − X (X T X )−1 X T D T α 2
2 (4.15)
2

under α ∞ ≤ λ, where we assume that X T X is nonsingular. If we obtain the solution


α̂ of α, we obtain β as β̂ = y − D T α̂.
For the proof, see the Appendix.
We consider solving those problems via the path algorithm (R. Tibshirani and J.
Taylor, 2013). The procedure solves the dual problem of the generalized Lasso and
makes the CRAN package genlasso execute in the first section.
For simplicity, we assume that X in (4.15) is the unit matrix and that D D T is
nonsingular. Then, for each λ ≥ 0, we consider the procedure to find the value of α
that minimizes
1
y − D T α 22 (4.16)
2
under α ∞ ≤ λ.
To this end, we note the mathematical property below.
Proposition 13 (R. Tibshirani and J. Taylor, 2013) If D ∈ Rm× p satisfies

(D D T )i,i ≥ |(D D T )i, j | (i, j = 1, . . . , m) , (4.17)
j=i

we have
αi (λ) ∞ =λ =⇒ αi (λ ) ∞ = λ , λ < λ . (4.18)

For the proof, consult the paper (R. Tibshirani and J. Taylor, 2011 [29]).
For each λ, each element of the solution α̂(λ) is at most λ. The path algorithm uses
Proposition 13. The condition (4.17) is met for the ordinary fused Lasso as in (4.10)
but does not hold for (4.12). However, the original paper considers a generalization
for dealing with the case, and the package genlasso implements the idea. This
book considers only the case where the matrix D satisfies (4.17).
First, let λ1 and i 1 be the maximum absolute value among the elements and
the element i in the least squares solution α̂(1) of (4.16). We define the sequences
{λk }, {i k }, {sk } for k = 2, . . . , m as follows: For S := {i 1 , . . . , i k−1 }, let λk and i k = i
be the maximum absolute value among the elements and the element i ∈ / S in α−S
that minimizes
1
y − λDST s − D−ST
α−S 22 . (4.19)
2

If α̂ik = λλ , then we have sik = 1; otherwise, sik = −1. Here, DS ∈ Rk× p , αS ∈ Rk


consists of the rows of D, α that correspond to S, and D−S ∈ R(m−k)× p , α−S ∈ Rm−k
consists of the other rows of D, α.
If we differentiate (4.19) by α−S to minimize it, we have
4.4 Dual Lasso Problem and Generalized Lasso 123

αk (λ) := {D−S D−S


T
}−1 D−S (y − λDST s) . (4.20)

We let ai − λbi be the i ∈


/ S-th element of the right-hand side of (4.20), λk be the
λ such that ai − λbi = ±λ, and i k be i. If ai − λk bi = λk , then sk = 1; otherwise,
sk = −1. If we let

ai the i-th element of {D−S D−S


T
}−1 D−S y
ti := = ,
bi ± 1 T
the i-th element of{D−S D−S }−1 D−S DST s ± 1

then the maximum value and its i are λk and i k , respectively, where the sign of
α̂(λk )ik is sk . Then, we add i k to S. For j ∈
/ S, we can compute αk (λk ) from (4.20).
For j ∈ S, we have αk (λk ) = λk .
For example, we can construct the following procedure.
1 fused.dual = function(y, D) {
2 m = nrow(D)
3 lambda = rep(0, m); s = rep(0, m); alpha = matrix(0, m, m)
4 alpha[1, ] = solve(D %*% t(D)) %*% D %*% y
5 for (j in 1:m) if (abs(alpha[1, j]) > lambda[1]) {
6 lambda[1] = abs(alpha[1, j])
7 index = j
8 if (alpha[1, j] > 0) s[j] = 1 else s[j] = -1
9 }
10 for (k in 2:m) {
11 U = solve(D[-index, ] %*% t(as.matrix(D[-index, , drop = FALSE])))
12 V = D[-index, ] %*% t(as.matrix(D[index, , drop = FALSE]))
13 u = U %*% D[-index, ] %*% y
14 v = U %*% V %*% s[index]
15 t = u / (v + 1)
16 for (j in 1:(m - k + 1)) if (t[j] > lambda[k]) {lambda[k] = t[j]; h = j;
r = 1}
17 t = u / (v - 1)
18 for (j in 1:(m - k + 1)) if (t[j] > lambda[k]) {lambda[k] = t[j]; h = j;
r = -1}
19 alpha[k, index] = lambda[k] * s[index]
20 alpha[k, -index] = u - lambda[k] * v
21 h = setdiff(1:m, index)[h]
22 if (r == 1) s[h] = 1 else s[h] = -1
23 index = c(index, h)
24 }
25 return(list(alpha = alpha, lambda = lambda))
26 }

If we choose the following setting, we have the ordinary (one-dimensional) fused


Lasso.
1 m = p - 1; D = matrix(0, m, p); for (i in 1:m) {D[i, i] = 1; D[i, i + 1] =
-1}

Moreover, the solution of the primary problem can be obtained via

β̂(λ) = y − D T α̂(λ)
124 4 Fused Lasso

1 fused.prime = function(y, D) {
2 res = fused.dual(y, D)
3 return(list(beta = t(y - t(D) %*% t(res$alpha)), lambda = res$lambda))
4 }

Example 41 Using the path algorithm introduced above, we draw the graph of the
coefficients α(λ)/β̂(λ) of the dual/primary problems with λ in the horizontal axes.
(Fig. 4.7).
1 p = 8; y = sort(rnorm(p)); m = p - 1; s = 2 * rbinom(m, 1, 0.5) - 1
2 D = matrix(0, m, p); for (i in 1:m) {D[i, i] = s[i]; D[i, i + 1] = -s[i]}
3 par(mfrow = c(1, 2))
4 res = fused.dual(y, D); alpha = res$alpha; lambda = res$lambda
5 lambda.max = max(lambda); m = nrow(alpha)
6 alpha.min = min(alpha); alpha.max = max(alpha)
7 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(alpha.min, alpha.max),
type = "n",
8 xlab = "lambda", ylab = "alpha", main = "Dual Problem")
9 u = c(0, lambda); v = rbind(0, alpha); for (j in 1:m) lines(u, v[, j], col =
j)
10 res = fused.prime(y, D); beta = res$beta
11 beta.min = min(beta); beta.max = max(beta)
12 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(beta.min, beta.max),
type = "n",
13 xlab = "lambda", ylab = "beta", main = "Prime Problem")
14 w = rbind(0, beta); for (j in 1:p) lines(u, w[, j], col = j)
15 par(mfrow = c(1, 1))

We now consider the general problem (4.15) that contains the design matrix
X ∈ R N × p with the rank p when the matrix D satisfies (4.17).
If we let ỹ := X (X T X )−1 X T y, D̃ := D(X T X )−1 X T , the problem reduces to
minimizing
1
ỹ − D̃α 22 .
2
For example, we can construct the following R language function.
1 fused.dual.general = function(X, y, D) {
2 X.plus = solve(t(X) %*% X) %*% t(X)
3 D.tilde = D %*% X.plus
4 y.tilde = X %*% X.plus %*% y
5 return(fused.dual(y.tilde, D.tilde))
6 }
7 fused.prime.general = function(X, y, D) {
8 X.plus = solve(t(X) %*% X) %*% t(X)
9 D.tilde = D %*% X.plus
10 y.tilde = X %*% X.plus %*% y
11 res = fused.dual.general(X, y, D)
12 m = nrow(D)
13 beta = matrix(0, m, p)
14 for (k in 1:m) beta[k, ] = X.plus %*% (y.tilde - t(D.tilde) %*% res$alpha[
k, ])
15 return(list(beta = beta, lambda = res$lambda))
16 }
4.4 Dual Lasso Problem and Generalized Lasso 125

Dual Problem Primary Problem


2

1.0
1

0.5
α

β
0
-1

0.0
-2

-0.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
λ λ

Fig. 4.7 Execution of Example 41: the solution paths of the dual (left) and primary (right) problems
w.r.t. p = 8 and m = 7. We choose the one-dimensional fused Lasso for the matrix D. In both paths,
the solution paths merge as λ decreases. The solutions α ∈ Rm and β ∈ R p of the dual and primary
problems are shown by the lines of seven and eight colors, respectively

Example 42 When D is the unit matrix and expresses the one-dimensional fused
Lasso (4.10) , we show the solution paths of the dual and primary problems (Fig. 4.8).
For the unit matrix, we observe the linear Lasso path that we have seen thus far.
However, for (4.10), the design matrix is not the unit matrix, and the nature of the
path is different. Those executions are due to the following problem.
1 n = 20; p = 10; beta = rnorm(p + 1)
2 X = matrix(rnorm(n * p), n, p); y = cbind(1, X) %*% beta + rnorm(n)
3 # D = diag(p) ## Use one of the two D
4 D = array(dim = c(p - 1, p))
5 for (i in 1:(p - 1)) {D[i, ] = 0; D[i, i] = 1; D[i, i + 1] = -1}
6 par(mfrow = c(1, 2))
7 res = fused.dual.general(X, y, D); alpha = res$alpha; lambda = res$lambda
8 lambda.max = max(lambda); m = nrow(alpha)
9 alpha.min = min(alpha); alpha.max = max(alpha)
10 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(alpha.min, alpha.max),
type = "n",
11 xlab = "lambda", ylab = "alpha", main = "Dual Problem")
12 u = c(0, lambda); v = rbind(0, alpha); for (j in 1:m) lines(u, v[, j], col =
j)
13 res = fused.prime.general(X, y, D); beta = res$beta
14 beta.min = min(beta); beta.max = max(beta)
15 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(beta.min, beta.max),
type = "n",
16 xlab = "lambda", ylab = "beta", main = "Primary Problem")
17 w = rbind(0, beta); for (j in 1:p) lines(u, w[, j], col = j)
18 par(mfrow = c(1, 1))

In general, suppose that we minimize a function as


126 4 Fused Lasso

(a) D Unit matrix


10 Dual Problem Primary Problem

1
0

0
-10
α

β
-20

-1
-30

-2
-40

0 10 20 30 40 0 10 20 30 40
λ λ

(b) D One-dimensional fused Lasso


Dual Problem Primary Problem
2
0 50 100 150 200 250 300

0
-2
α

β
-4
-6
-10 -8

0 50 150 250 0 50 150 250


λ λ

Fig. 4.8 Execution of Example 42. The solution paths of the dual and primary paths when the
design matrix is not the unit matrix. a When D is the unit matrix, the problem reduces to the linear
Lasso. b When D expresses the one-dimensional fused Lasso, a completely different shape appears
in the solution path
4.4 Dual Lasso Problem and Generalized Lasso 127

g(β1 , . . . , β p ) + λh(β1 , . . . , β p )

for convex g, h with g being differentiable. Then, if the second


 p term is separated as
convex p one-variable functions such as h(β1 , . . . , β p ) = j=1 h j (β j ), it is known
that the update obtained by differentiating the objective function by β converges to
the optimum value (Tseng, 1988).
However, the primary problem of fused Lasso does not satisfy separability, and
we cannot apply the coordinate descent method. Thus, we consider the dynamic
programming method and the dual problems. We consider another way to solve the
fused Lasso problem.

4.5 ADMM

This section considers a different approach that is applicable to general problems to


solve the fused Lasso.
Let A ∈ Rd×m , B ∈ Rd×n , and c ∈ Rd . We assume that f : Rm → R and g :
R → R are convex and that f is differentiable. We formulate the problem of finding
n

α ∈ Rm , β ∈ Rn such that

Aα + Bβ = c (4.21)
f (α) + g(β) that minimizes (4.22)

via the Lagrange multiplier

L(α, β, γ) := f (α) + g(β) + γ T (Aα + Bβ − c) → min (γ ∈ Rd : undeterminedconstant) .

In general, we have

inf L(α, β, γ) ≤ L(α, β, γ) ≤ sup L(α, β, γ) .


α,β γ

Then, the minimum and maximum values of the primary and dual problems coin-
cide, respectively. Although in general, the equality in

sup inf L(α, β, γ) ≤ inf sup L(α, β, γ)


γ α,β α,β γ

may or may not hold, in this case, it is known that if the objective function and the
constraints are positive, the equality holds (Slater condition) .
In this book, we introduce a constant ρ > 0 to define the extended Lagrangian
ρ
L ρ (α, β, γ) = f (α) + g(β) + γ T (Aα + Bβ − c) + Aα + Bβ − c 2
. (4.23)
2
128 4 Fused Lasso

Setting the initial values α0 ∈ Rm , β0 ∈ Rn , γ0 ∈ Rd , we update the following steps


for t = 1, 2, . . . (ADMM, Alternating Direction Method of Multipliers):
1. let αt+1 be the α that minimizes L ρ (α, βt , γt )
2. let βt+1 be the β that minimizes L ρ (αt+1 , β, γt )
3. γt+1 ← γt + ρ(Aαt+1 + Bβt+1 − c)
We show that the solution of (4.21), (4.22) can be obtained by repeating the above
three steps. The ADMM is a general method used to solve convex optimization
problems, not just for fused Lasso, and in the following chapters, we continue to
apply the ADMM to other classes of sparse estimation problems.

Example 43 If we apply as a generalized Lasso

1 ρ
L ρ (α, β, γ) := y−α 2
2 +λ β 1 + μT (Dα − β) + Dα − β 2
2 2
then we have
∂ Lρ
= α − y + D T γt + ρD T (Dα − βt )
∂α ⎧
⎨ 1, β>0
∂ Lρ
= −γt + ρ(β − Dαt+1 ) + λ [−1, 1], β = 0 ,
∂β ⎩
−1, β<0

which means that the update formula is as follows:



⎨ αt+1 ← (I + ρD T D)−1 (y + D T (ρβt − γt ))
βt+1 ← Sλ (ρDαt+1 + γt )/ρ ,

γt+1 ← γt + ρ(Dαt+1 − βt+1 )

where A ∈ Rd×m , B ∈ Rd×n , c ∈ Rd , f : Rm → R, and g : Rn → R are, respec-


1
tively, A = D, B = −I , c = 0, f = y − α 2 , and g = β 1 .
2
To realize the generalized Lasso, we can, for example, construct the following
function.
1 admm = function(y, D, lambda) {
2 K = ncol(D); L = nrow(D)
3 theta.old = rnorm(K); theta = rnorm(K); gamma = rnorm(L); mu = rnorm(L)
4 rho = 1
5 while (max(abs(theta - theta.old) / theta.old) > 0.001) {
6 theta.old = theta
7 theta = solve(diag(K) + rho * t(D) %*% D) %*% (y + t(D) %*% (rho * gamma
- mu))
8 gamma = soft.th(lambda, rho * D %*% theta + mu) / rho
9 mu = mu + rho * (D %*% theta - gamma)
10 }
11 return(theta)
12 }
4.5 ADMM 129

Fig. 4.9 Execution of The Solution Path of Fused Lasso


Example 44. We obtain the

1.0
solution path of the CGH
data in Example 35

0.5
Coefficients
0.0
-0.5 0.0 0.1 0.2 0.3 0.4 0.5
λ

Example 44 We apply the CGH data in Example 35 to the ADMM to obtain the
solution path of the primary problem (Fig. 4.9). For the ADMM, unlike the path
algorithm in the previous section, we obtain θ ∈ R N for a specific λ ≥ 0.
1 df = read.table("cgh.txt"); y = df[[1]][101:110]; N = length(y)
2 D = array(dim = c(N - 1, N)); for (i in 1:(N - 1)) {D[i, ] = 0; D[i, i] = 1;
D[i, i + 1] = -1}
3 lambda.seq = seq(0, 0.5, 0.01); M = length(lambda.seq)
4 theta = list(); for (k in 1:M) theta[[k]] = admm(y, D, lambda.seq[k])
5 x.min = min(lambda.seq); x.max = max(lambda.seq)
6 y.min = min(theta[[1]]); y.max = max(theta[[1]])
7 plot(lambda.seq, xlim = c(x.min, x.max), ylim = c(y.min, y.max), type = "n",
8 xlab = "lambda", ylab = "Coefficients", main = "Fused Lasso Solution
Path")
9 for (k in 1:N) {
10 value = NULL; for (j in 1:M) value = c(value, theta[[j]][k])
11 lines(lambda.seq, value, col = k)
12 }

It is known that the ADMM satisfies the following theoretical properties.


Proposition 14 Suppose that the functions f, g are convex, that f is differentiable,
and that the ranks of A, B are m, n. If the optimum solution (α∗ , β∗ ) of (4.21), (4.22)
exists, the sequences {αt }, {βt }, {γt } satisfy the following properties:
1. The sequence { pt } with pt := f (αt ) + g(βt ) converges to the optimum p∗ .
2. The sequences {αt }, {βt } respectively converge to the unique values α∗ , β∗ .
3. The sequence {γt } converges to a unique γ∗ that is the dual solution of (4.21),
(4.22).
For the proof of Proposition 14, see the paper by Joao F. C. Mota, 2011 [21]. Similar
proofs can be found in the following: S. Boyd et al., Distributed optimization and
130 4 Fused Lasso

statistical learning via the alternating method of multipliers (2011) [5] does not
satisfy the full rank of A, B and does not prove the second property. D. Bertsekas et al.,
Convex analysis and optimization (2003) [4] proves the second property, assuming
that B is the unit matrix. They do not contain anything complicated, although the
derivation is lengthy. In particular, S. Boyd (2011) [5] can be easily understood.

Appendix: Proof of Proposition

Proposition 10 (N. Johnson, 2013 [14]) The algorithm that computes the fused
Lasso via dynamic programming completes in O(N ).

Proof In the following, for i = 1, . . . , N − 1, we consider how to obtain θi = θi∗



when we know θi+1 = θi+1 is optimum. For the quantity

1 
i−1 i−2
1
{y j − θ̂ j (θi )}2 + (yi − θi )2 + λ |θ̂ j (θi ) − θ̂ j+1 (θi )| + λ|θ̂i−1 (θi ) − θi |
2 j=1 2 j=1


that is obtained by removing the last term λ|θi − θi+1 | in


h i (θ̂1 (θi ), . . . , θ̂i−1 (θi ), θi , θi+1 )

1 
i−1 i−2
1 ∗
= {y j − θ̂ j (θi )}2 + (yi − θi )2 + λ |θ̂ j (θi ) − θ̂ j+1 (θi )| + λ|θ̂i−1 (θi ) − θi | + λ|θi − θi+1 |
2 2
j=1 j=1

we differentiate by θi to define


i−1
d θ̂ j (θi )  d
i−2
d
gi (θi ) := − {y j − θ̂ j (θi )} − (yi − θi ) + λ |θ̂ j (θi ) − θ̂ j+1 (θi )| + λ |θ̂i−1 (θi ) − θi | .
dθi dθi dθi
j=1 j=1

As we notice later, we will have either θ̂ j (θ) = θ + a or θ̂ j (θ) = b for the con-
d θ̂ j (θi )
stants a, b, which means that is either zero or one for j = 1, . . . , i − 1. For
dθi
the other terms that contain absolute values, we cannot differentiate and apply the
subderivative.
Thus, for each of the cases, we can obtain θi∗ = θ̂i (θi+1 ∗
):
∗ ∗
1. For θi > θi+1 , if we differentiate λ|θi − θi+1 | by θi , it becomes λ. Then, the θi

value does not depend on θi+1 . We let the solution θi of

d ∗
h i (θ̂1 (θi ), . . . , θ̂i−1 (θi ), θi , θi+1 ) = gi (θi ) + λ = 0
dθi


be θ̂i (θi+1 ) := L i .
Appendix: Proof of Proposition 131

∗ ∗
2. For θi < θi+1 , if we differentiate λ|θi − θi+1 | by θi , it becomes −λ. Then, the θi

value dose not depend on θi+1 . We let the solution of θi

d ∗
h i (θ̂1 (θi ), . . . , θ̂i−1 (θi ), θi , θi+1 ) = gi (θi ) − λ = 0
dθi


be θ̂i (θi+1 ) := Ui .
∗ ∗ ∗
3. For the other cases, i.e., for L i < θi+1 < Ui , we have θ̂i (θi+1 ) = θi+1 .
For example, if i = 1, then we have g1 (θ1 ) = θ1 − y1 . If θ1 > θ2∗ , then solving
g1 (θ1 ) + λ = 0, we have θ̂1 (θ2∗ ) = y1 − λ = L 1 . If θ1 < θ2∗ , then solving g1 (θ1 ) −
λ = 0, we have θ̂1 (θ2∗ ) = y1 + λ = U1 . Furthermore, if L 1 < θ2∗ < U1 , we have
θ̂1 (θ2∗ ) = θ2∗ . Thus, we have

L 1 = θ̂1 (θ2∗ ) > θ2∗ =⇒ λ|θ2∗ − θ1 (θ2∗ )| = −λ(θ2∗ − L 1 )


U1 = θ̂1 (θ2∗ ) < θ2∗ =⇒ λ|θ2∗ − θ1 (θ2∗ )| = λ(θ2∗ − U1 )
L 1 ≤ θ2∗ ≤ U1 ⇐⇒ θ̂1 (θ2∗ ) = θ2∗ ,

which means that



⎨ L 1 = y1 − λ, θ2 < L 1
θ̂1 (θ2 ) := θ2 , L 1 ≤ θ2 ≤ L 2 .

U1 = y1 + λ, U1 < θ2

Therefore, we can write g2 (θ2 ) as follows:

d 1 1
g2 (θ2 ) = (y1 − θ̂(θ2 ))2 + (y2 − θ2 )2 + λ|θ̂1 (θ2 ) − θ2 |
dθ 2 2
⎧2
⎨ θ2 − y2 − λ, θ2 < L 1
= 2θ2 − y2 − y1 , L 1 ≤ θ2 ≤ U1

θ2 − y2 + λ, U1 < θ2
= g1 (θ2 )I [L 1 ≤ θ2 ≤ U1 ] + λI [θ2 > U1 ] − λI [θ2 < L 1 ] + θ2 − y2 ,

where if the condition A is true, then I [A] = 1; otherwise, I [A] = 0. Similarly, we


solve
d g2 (θ2 ) + λ, θ2 > θ3∗
0= h 2 (θ̂1 (θ2 ), θ2 , θ3∗ ) =
dθ2 g2 (θ2 ) − λ, θ2 < θ3∗

and we obtain L 2 , U2 . Moreover, for i = 2, . . . , N , we have

gi (θi ) = gi−1 (θi )I [L i−1 ≤ θi ≤ Ui−1 ] + λI [θi > Ui−1 ] (4.24)


− λI [θi < L i−1 ] + θi − yi .
132 4 Fused Lasso

On the other hand, gi (θi ) is a line that has a piecewise nonnegative slope, and the
knots are
L 1 , . . . , L i−1 , U1 , . . . , Ui−1 .

Then, the first three terms of gi (θi ) are respectively between −λ and λ, and the
solution of gi (θi ) ± λ = 0 ranges over yi − 2λ ≤ θi ≤ yi + 2λ. In fact, from (4.24),
we have
> λ, θi > yi + 2λ
gi (θi ) .
< −λ, θi < yi − 2λ

There are at most 2i knots containing the two, which we express as x1 < · · · < x2i .
If gi (xk ) + λ ≤ 0 and gi (xk+1 ) + λ ≥ 0, then

|gi (xk ) + λ|
L i := xk + (xk+1 − xk )
|gi (xk ) + λ| + |gi (xk+1 ) + λ|

is the solution. Similarly, we let Ui be the θi such that gi (θi ) − λ = 0. In particular,


for i = N , we have θ N = θ N +1 , and no last term exists as in h 2 , . . . , h N −1 . Thus, we
obtain θ N such that

d
0= h N (θ̂1 (θ N ), . . . , θ̂ N −1 (θ N ), θ N , θ N ) = g N (θ N ) = 0
dθ N

rather than g N (θ N ) ± λ = 0.
Finally, we evaluate the efficiency of the procedure. If

x1 < · · · < xr ≤ L i−1 < xr +1 < · · · < xs−1 < Ui−1 ≤ xs < · · · < x2i ,

we may exclude x1 , . . . , xr and xs , . . . , x2i from the search of L i , Ui . In fact, from


(4.24), we find that the solution of gi (θi ) ± λ = 0 does not depend on U j above Ui−1
and L j below L i−1 , j = 1, . . . , i − 2, which means that gi is constructed without
them and that yi − 2λ, L i−1 , xr +1 , . . . , xs−1 , Ui−1 , yi + 2λ are the knots.
Moreover, the knots used to search for L i , Ui are used to search for L j , U j ( j =
i + 1, . . . , N − 1). However, if we remove a knot, then it will not be used for future
searches. In addition, at most four knots are added each time, which means that at
most 4N knots are added in total. Thus, the number of knots that are excluded is
at most 4N in total as well if we search for L i , Ui from the outside. Hence, the
computation is proportional to at most N .
Finally, if we apply the following step
⎧ ∗
⎨ Ui , θi+1 > Ui
∗ ∗ ∗
θi := θi+1 , L i ≤ θi+1 ≤ Ui
⎩ ∗
L i , θi+1 < L i

for i = N − 1, . . . , 1, we obtain {θi∗ }i=1


N
. 
Appendix: Proof of Proposition 133

Proposition 11 (Friedman et al., 2007 [10]) Let θ(0) be the solution θ ∈ R N under
μ = 0. Then, the solution θ = θ(μ) under μ > 0 is given by SμN (θ(0)), where SμN :
R N → R N is such that the i = 1, . . . , N -th element is Sμ (z i ) for z = [z 1 , . . . , z N ]T
(see (1.11)).

Proof Differentiating (4.2) by θi (i = 2, . . . , N − 1) and equating it to zero, we


have

−yi + θi (μ) + μ∂|θi | + λ∂|θi − θi−1 | + λ∂|θi − θi+1 | = 0 (i = 2, . . . , N − 1) .

Thus, it is sufficient to show that for μ ≥ 0 and each i = 2, . . . , N − 1, there exist


⎧ ⎧ ⎧
⎨ 1, θi > 0 ⎨ 1, θi > θi−1 ⎨ 1, θi > θi+1
si (μ) = [−1, 1], θi = 0 , ti (μ) = [−1, 1], θi = θi−1 , u i (μ) = [−1, 1], θi = θi+1
⎩ ⎩ ⎩
−1, θi < 0 −1, θi < θi−1 −1, θi < θi+1

such that

f i (μ) := −yi + θi (μ) + μsi (μ) + λti (μ) + λu i (μ) = 0 (4.25)


θi (μ) = Sμ (θi (0)) , (4.26)

where we write θi (μ) simply as θi .


Notably, we may assume that (si (0), ti (0), u i (0)) exists such that f i (0) = 0. In
fact, θi (0) satisfies f i (0) = 0, and

f i (0) = −yi + θi (0) + λti (0) + λu i (0) = 0 . (4.27)

Moreover, since from the definition of (4.26), we observe that Sμ (x) monotonically
decreases with x ∈ R, we have

θi (0) = θ j (0) =⇒ θi (μ) = Sμ (θi (0)) = Sμ (θ j (0)) = θ j (μ)


θi (0) < θ j (0) =⇒ θi (μ) = Sμ (θi (0)) ≤ Sμ (θ j (0)) = θ j (μ)

for μ ≥ 0 and each i, j = 1, . . . , N (i = j). Hence,



⎨ 1, θi (0) > θi−1 (0)
ti (0) = [−1, 1], θi (0) = θi−1 (0)

−1, θi (0) < θi−1 (0)

⎨ 1 or [−1.1], θi (0) > θi−1 (0)
=⇒ ti (0) = [−1, 1], θi (0) = θi−1 (0)

−1 or [−1, 1], θi (0) < θi−1 (0)
=⇒ ti (μ) = [−1, 1] .

Hence, if we regard ti (0), ti (μ) as a set similar to {1}, [−1, 1], {−1}, we have
134 4 Fused Lasso

ti (0) ⊆ ti (μ) . (4.28)

Similarly, we have
u i (0) ⊆ u i (μ) . (4.29)

From (4.26), we have θi (μ) = θi (0) ± μ and θi (μ) = 0 for the i such that |θi (0)| >
μ and for the i such that |θi (0)| ≤ μ, respectively.
For the first case, we have

θi (0) > μ ⇐⇒ θi (μ) = θi (0) − μ ⇐⇒ si (0) = 1


.
θi (0) < −μ ⇐⇒ θi (μ) = θi (0) + μ ⇐⇒ si (0) = −1

Moreover, from si (0) ⊆ si (μ), we may assume si (μ) = si (0). From (4.28), (4.29),
if we substitute the ti (0), u i (0) values into ti (μ), u i (μ), the relation holds. Hence,
substituting θi (μ) = θi (0) − si (0)μ and (si (μ), ti (μ), u i (μ)) = (si (0), ti (0), u i (0))
into (4.25), from (4.25), (4.27), we have

f i (μ) = −yi + θi (0) − si (0)μ + si (0)μ + λti (0) + λu i (0) = 0 .

For the latter case, the subderivative of si (0) should be [−1, 1] for the i such
that |θi (0)| ≤ μ. Thus, if we input si (μ) = θi (0)/μ ∈ [−1, 1], si (μ) becomes a
subdifferential of |θi (μ)|. Also from (4.28), (4.29), the relation holds even if we
substitute the ti (0), u i (0) values into ti (μ), u i (μ). Thus, substituting θi (μ) = 0
and (si (μ), ti (μ), u i (μ)) = (θi (0)/μ, ti (0), u i (0)) f i (μ) = 0, from (4.25), (4.27), we
have
θi (0)
f i (μ) = −yi + μ · + λti (0) + λu i (0) = 0 .
μ

Proposition 12 For X ∈ R N × p , y ∈ R N , and D ∈ Rm× p (m ≥ 1), if the prime prob-


lem finds the β ∈ R p that minimizes

1
y − Xβ 2
2 + λ Dβ 1 , (4.14)
2
then the dual problem finds the α ∈ Rm that minimizes

1
X (X T X )−1 X T y − X (X T X )−1 X T D T α 2
2 (4.15)
2

under α ∞ ≤ λ, where we assume that X T X is nonsingular. If we obtain the solution


α̂ of α, we obtain β as β̂ = y − D T α̂.
Proof: If we let D ∈ Rm× p , γ = Dβ, then the primary problem can be regarded as
the minimization of
Exercises 47–61 135

1
y − Xβ 2
2 +λ γ 1
2
w.r.t. β ∈ R p . Introducing the Lagrange multiplier α, we have

1
y − Xβ 2
2 +λ γ 1 + αT (Dβ − γ),
2
and if we take the minimization of this over β, γ, then we have

1 1
min y − Xβ 2
2 + αT Dβ = y − X (X T X )−1 (X T y − D T α) 2
2 + αT D(X T X )−1 (X T y − D T α)
β 2 2
0, α ∞≤λ
min{λ γ 1 − αT γ} = .
γ −∞, otherwise

From −X T (y − X β) + D T α = 0, we have β = (X T X )−1 (X T y − D T α). Thus, the


minimum value can be written as
1
(I − X (X T X )−1 X T )y + X (X T X )−1 D T α) 22 + αT D(X T X )−1 (X T y − D T α)
2
1
∼ {X (X T X )−1 D T α}T X (X T X )−1 D T α + αT D(X T X )−1 X T y − αT D(X T X )−1 D T α
2
1
∼ − X (X T X )−1 X T y − X (X T X )−1 D T α 22 ,
2

where A ∼ B is the equivalent relation A − B, which is a constant irrespective of α.


The dual problem amounts to the maximization of this value and the minimization
of
1
X (X T X )−1 X T y − X (X T X )−1 D T α 22
2
under α ∞ ≤ λ. 

Exercises 47–61

47. Execute the following fused Lasso procedures.


(a) We wish to obtain the smooth output θ = (θ1 , . . . , θ N ) that minimizes

1 
N N
(yi − θi )2 + λ |θi − θi−1 | (cf. (4.1)) (4.30)
2 i=1 i=2

from the observations y = (y1 , . . . , y N ), where λ ≥ 0 is a parameter that


controls the size of the output. The following procedure smooths the data
of the copy number ratio of a gene in the chromosomal region between
136 4 Fused Lasso

case and control patients that are measured by CGH (comparative genomic
hybridization). Download the data from
https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/cgh.html
fill in the blank, and execute the procedure to observe what smoothing pro-
cedure can be obtained for λ = 0.1, 1, 10, 100, 1000.
1 library(genlasso)
2 df = read.table("cgh.txt"); y = df[[1]]; N = length(y)
3 theta = ## Blank ##
4 plot(1:N, theta, lambda = 0.1, xlab = "gene #", ylab = "Copy Number
Variant",
5 col = "red", type = "l")
6 points(1:N, y, col = "blue")

(b) The fused Lasso can consider the differences w.r.t. the second-order θi −
2θi+1 + θi+2 , w.r.t. the third-order θi − 3θi+1 + 3θi+2 − θi+3 , etc., as well as
w.r.t. the first order θi − θi+1 (trend filtering). Using the function
trendfilter prepared in the genlasso package, we wish to execute
trend filtering for the data sin θ (0 ≤ θ ≤ 2π) with random noise added.
Determine the value of λ appropriately, and then execute trendfilter
of the k = 1, 2, 3, 4-th order to show the graph similarly to (a).
1 library(genlasso)
2 ## Data Generation
3 N = 100; y = sin(1:N/(N * 2 * pi)) + rnorm(N, sd = 0.3)
4 out = ## Blank ##
5 plot(out, lambda = lambda) ## Smoothing and Output

48. We wish to obtain the θ1 , . . . , θ N that minimize (4.30) via dynamic programming
from the observations y1 , . . . , y N .
(a) The condition of minimization w.r.t. θ1

1
h 1 (θ1 , θ2 ) := (y1 − θ1 )2 + λ|θ2 − θ1 |
2
contains another variable θ2 . The optimum θ1 when the θ2 value is known
can be written as

⎨ y1 − λ, y1 ≥ θ2 + λ
θ̂1 (θ2 ) = θ2 , |y1 − θ2 | < λ .

y1 + λ, y1 ≤ θ2 − λ

The condition of minimization w.r.t. θ2

1 1
(y1 − θ1 )2 + (y2 − θ2 )2 + λ|θ2 − θ1 | + λ|θ3 − θ2 |
2 2
Exercises 47–61 137

contains other variables θ1 , θ3 . However, if we replace θ1 with θ̂1 (θ2 ), it can


be written as

1 1
h 2 (θ̂1 (θ2 ), θ2 , θ3 ) := (y1 − θ̂1 (θ2 ))2 + (y2 − θ2 )2 + λ|θ2 − θ̂1 (θ2 )| + λ|θ3 − θ2 | ,
2 2

which is the θ̂2 (θ3 ) value when the θ3 value can be written as a function of
θ3 . Show that the problem of finding θ1 , . . . , θ N reduces to finding the θ N
value that minimizes
N −1 −2
1  
N
1
(yi − θ̂i (θ N ))2 + (y N − θ N )2 + λ |θ̂i (θ N ) − θ̂i+1 (θ N )| + λ|θ̂ N −1 (θ N ) − θ N |
2 2
i=1 i=1
(cf. (4.4)) .
(b) When we obtain θ N in (a), how can we obtain the θ1 , . . . , θ N −1 values?
49. When we solve the fused Lasso via dynamic programming, the computation
time is proportional to the length N of y ∈ R N . Explain the fundamental reason
for this fact.
50. When we find the value of θi that minimizes

1  
N N N
(yi − θi )2 + λ |θi | + μ |θi − θi−1 | (cf. (4.2)) , (4.31)
2 i=1 i=1 i=2

without loss of generality, it is sufficient to obtain only {θi } under λ = 0. In fact,


for the solution θ(0) when λ = 0, Sλ (θ(0)) is the solution for general λ. In the
following, for y1 , y2 , . . . , y N −1 , y N ∈ R, λ, μ ≥ 0, we show the existence of
⎧ ⎧ ⎧
⎨ 1, θ>0 ⎨ 1, θi > θi−1 ⎨ 1, θi > θi+1
si (λ) = [−1, 1], θ = 0 , ti (λ) = [−1, 1], θi = θi−1 , u i (λ) = [−1, 1], θi = θi+1
⎩ ⎩ ⎩
−1, θ<0 −1, θi < θi+1 −1, θi < θi+1

such that

f i (λ) := −yi + θi (λ) + λsi (λ) + μti (λ) + μu i (λ) = 0 (i = 2, . . . , N − 1)


(cf. (4.25)),

θi (λ) = Sλ (θi (0)) (cf. (4.26))

where we assume that (si (0), ti (0), u i (0)) exists such that f i (0) = 0.
(a) Show that −yi + θi (0) + μti (0) + μu i (0) = 0.
(b) For i, j = 1, . . . , N (i = j), show that θi (0) = θ j (0) =⇒ θi (λ) = θ j (λ)
and θi (0) < θ j (0) =⇒ θi (λ) ≤ θ j (λ).
(c) If we regard the solution ti (0), ti (λ) as the set ({1}, [−1, 1], {−1}), show that
ti (0) ⊆ ti (λ).
138 4 Fused Lasso

(d) When |θi (0)| ≥ λ, show that (si (λ), ti (λ), u i (λ)) = (si (0), ti (0), u i (0)) is
the solution of f i (λ) = 0.
(e) When |θi (0)| < λ, show that (si (λ), ti (λ), u i (λ)) = (θi (0)/λ, ti (0), u i (0))
is the solution of f i (λ) = 0.
(f) Let θ̂i (λ1 , λ2 ) (i = 1, . . . , N ) be the θi that minimizes (4.31). Show that
θ̂i (λ1 , λ2 ) = Sλ1 (θ̂i (0, λ2 )).
51. When we minimize (4.31) based on Exercise 50, what process is required to
obtain the solutions for λ = 0 after obtaining the solution for λ = 0? Specify
the R language function.
52. In linear regression Lasso, there exists a smallest λ such that no variables are
active. Let λ0 be such a λ. In LARS, it is the maximum value of r0 , x j when
r0 := y. Let S be the set of indices of the active variables. Initially, we have
S = j. In LARS, if we decrease the λ value, only the coefficients of the active
variables increase. Show that if an index j becomes active at λ,

< x ( j) , r (λ) >= ±λ

for λ ≤ λ, where r (λ) is the residual for λ and x ( j) expresses the j-th column
of the matrix X .
53. We construct the LARS function as follows. Fill in the blanks, and execute the
procedure.
1 lars = function(X, y) {
2 X = as.matrix(X); n = nrow(X); p = ncol(X); X.bar = array(dim = p)
3 for (j in 1:p) {X.bar[j] = mean(X[, j]); X[, j] = X[, j] - X.bar[j]}
4 y.bar = mean(y); y = y - y.bar
5 scale = array(dim = p)
6 for (j in 1:p) {scale[j] = sqrt(sum(X[, j] ^ 2) / n); X[, j] = X[, j]
/ scale[j]}
7 beta = matrix(0, p + 1, p); lambda = rep(0, p + 1)
8 for (i in 1:p) {
9 lam = abs(sum(X[, i] * y))
10 if (lam > lambda[1]) {i.max = i; lambda[1] = lam}
11 }
12 r = y; index = i.max; Delta = rep(0, p)
13 for (k in 2:p) {
14 Delta[index] = solve(t(X[, index]) %*% X[, index]) %*%
15 t(X[, index]) %*% r / lambda[k - 1]
16 u = t(X[, -index]) %*% (r - lambda[k - 1] * X %*% Delta)
17 v = -t(X[, -index]) %*% (X %*% Delta)
18 t = ## Blank(1) ##
19 for (i in 1:(p - k + 1)) if (t[i] > lambda[k]) {lambda[k] = t[i]; i.
max = i}
20 t = u / (v - 1)
21 for (i in 1:(p - k + 1)) if (t[i] > lambda[k]) {lambda[k] = t[i]; i.
max = i}
22 j = setdiff(1:p, index)[i.max]
23 index = c(index, j)
24 beta[k, ] = ## Blank(2) ##
25 r = y - X %*% beta[k, ]
26 }
27 for (k in 1:(p + 1)) for (j in 1:p) beta[k, j] = beta[k, j] / scale[j]
Exercises 47–61 139

28 return(list(beta = beta, lambda = lambda))


29 }
30 df = read.table("crime.txt"); X = as.matrix(df[, 3:7]); y = df[, 1]
31 res = lars(X, y)
32 beta = res$beta; lambda = res$lambda
33 p = ncol(beta)
34 plot(0:8000, ylim = c(-7.5, 15), type = "n",
35 xlab = "lambda", ylab = "beta", main = "LARS(USA Crime Data)")
36 abline(h = 0) for (j in 1:p)
37 lines(lambda[1:(p)], beta[1:(p), j], col = j)
38 legend("topright",
39 legend = c("annual police funding", "people 25 years+ with 4 yrs.
of high school",
40 "16--19 year-olds not in highschool and not highschool
graduates",
41 "18--24 year-olds in college",
42 "people 25 years+ with at least 4 years of college"),
43 col = 1:p, lwd = 2, cex = .8)

54. From fused Lasso, we consider the extended formulation (generalized Lasso):
for X ∈ R N × p , y ∈ R N , β ∈ R p , and D ∈ Rm× p (m ≤ p), we minimize

1
y − Xβ 2
+ λ Dβ 1 . (4.32)
2
Why is the ordinary linear regression Lasso a particular case of (4.32)? How
about the ordinary fused Lasso? What D should be given for the two cases
below (trend filtering)?

N −2
1 
N
i. (yi − θi )2 + |θi − 2θi+1 + θi+2 |
2 i=1 i=1
N −2
1  θi+2 − θi+1
N
θi+1 − θi
ii. (yi − θi )2 + − (cf. (4.13)) (4.33)
2 i=1 i=1
x i+2 − x i+1 x i+1 − x i

55. Derive the dual problem from each of the primary problems below.
(a) For X ∈ R N × p , y ∈ R N , and λ > 0, find the β ∈ R p that minimizes

1
y − Xβ 2
2 +λ β 1 (cf. (4.7))
2

(Primary). Under X T α ∞ ≤ λ, find the α ∈ R N that maximizes

1
{ y 2
2 − y − α 22 }
2
(Dual).
(b) For λ ≥ 0 and D ∈ Rm×N , find the θ ∈ R N that minimizes
140 4 Fused Lasso

1
y−θ 2
2 + λ Dθ 1
2

(Primary). Under α ∞ ≤ λ, find the α ∈ R N that minimizes

1
y − DT α 2
2
2
(Dual).
(c) For X ∈ R N × p , y ∈ R N , and D ∈ Rm× p (m ≥ 1), find the β ∈ R p that min-
imizes
1
y − X β 22 + λ Dβ 1
2
(Primary). Under α ∞ ≤ λ, find the α ∈ Rm that minimizes

1
X (X T X )−1 y − X (X T X )−1 D T α 2
2 (cf. (4.15))
2
(Dual).
56. Suppose that we solve fused Lasso as a dual problem. Let λ1 and i 1 be the
largest absolute value and its element i among the elements of the solution α̂(1)
of y − D T α 22 . For k = 2, . . . , m, we execute the following procedure to define
the sequences {λk }, {i k }, {sk }. For S := {i 1 , . . . , i k−1 }, let λk and i k be the largest
absolute value and its element i among the elements of the solution α−S that
minimizes
1
y − λDST s − D−S T
α−S 22 (cf. (4.19)) .
2

If α̂ik = λk , then sik = 1 and sik = −1 otherwise, where DS ∈ Rk× p , αS ∈ Rk


consists of the rows that correspond to S and D−S ∈ R(m−k)× p , α−S ∈ Rm−k
consists of the other rows. If we differentiate it by α−S to minimize it, the solution
is
αk (λ) := {D−S D−S T
}−1 D−S (y − λDST s) (cf. (4.20)).

If we set the i ∈/ S-th element as ai − λbi , what are ai , bi ? Moreover, how can
we obtain i k , λk ?
57. We construct a program that obtains the solutions of the dual and primary fused
Lasso problems. Fill in the blanks, and execute the procedure.
1 fused.dual = function(y, D) {
2 m = nrow(D)
3 lambda = rep(0, m); s = rep(0, m); alpha = matrix(0, m, m)
4 alpha[1, ] = solve(D %*% t(D)) %*% D %*% y
5 for (j in 1:m) if (abs(alpha[1, j]) > lambda[1]) {
6 lambda[1] = abs(alpha[1, j])
7 index = j
8 if (alpha[1, j] > 0) ## Blank(1) ##
9 }
Exercises 47–61 141

10 for (k in 2:m) {
11 U = solve(D[-index, ] %*% t(as.matrix(D[-index, , drop = FALSE])))
12 V = D[-index, ] %*% t(as.matrix(D[index, , drop = FALSE]))
13 u = U %*% D[-index, ] %*% y
14 v = U %*% V %*% s[index]
15 t = u / (v + 1)
16 for (j in 1:(m - k + 1)) if (t[j] > lambda[k]) {lambda[k] = t[j]; h
= j; r = 1}
17 t = u / (v - 1)
18 for (j in 1:(m - k + 1)) if (t[j] > lambda[k]) {lambda[k] = t[j]; h
= j; r = -1}
19 alpha[k, index] = ## Blank(2) ##
20 alpha[k, -index] = ## Blank(3) ##
21 h = setdiff(1:m, index)[h]
22 if (r == 1) s[h] = 1 else s[h] = -1
23 index = c(index, h)
24 }
25 return(list(alpha = alpha, lambda = lambda))
26 }
27 m = p - 1; D = matrix(0, m, p); for (i in 1:m) {D[i, i] = 1; D[i, i + 1]
= -1}
28 fused.prime = function(y, D){
29 res = fused.dual(y, D)
30 return(list(beta = t(y - t(D) %*% t(res$alpha)), lambda = res$lambda))
31 }
32 p = 8; y = sort(rnorm(p)); m = p - 1; s = 2 * rbinom(m, 1, 0.5) - 1
33 D = matrix(0, m, p); for (i in 1:m) {D[i, i] = s[i]; D[i, i + 1] = -s[i
]}
34 par(mfrow = c(1, 2)) res = fused.dual(y, D); alpha =
35 res$alpha; lambda = res$lambda
36 lambda.max = max(lambda); m =
37 nrow(alpha) alpha.min = min(alpha); alpha.max = max(alpha)
38 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(alpha.min, alpha.
max),
39 type = "n", xlab = "lambda", ylab = "alpha", main = "Dual Problem")
40 u = c(0, lambda); v = rbind(0, alpha); for (j in 1:m) lines(u, v[,
41 j], col = j) res = fused.prime(y, D); beta = res$beta
42 beta.min = min(beta); beta.max = max(beta)
43 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(beta.min, beta.max
),
44 type = "n", xlab = "lambda", ylab = "beta", main = "Primary Problem
")
45 w = rbind(0, beta); for (j in 1:p) lines(u, w[, j], col = j)
46 par(mfrow = c(1, 1))

58. If the design matrix is not the unit matrix for the generalized Lasso, then for
X + := (X T X )−1 X T , ỹ := X X + y, D̃ := D X + , the problem reduces to mini-
mizing
1
ỹ − D̃α 22 .
2
Extend the functions fused.dual and fused.prime to fused.dual.
general and fused.prime.general, and execute the following
procedure.
142 4 Fused Lasso

1 n = 20; p = 10; beta = rnorm(p + 1)


2 X = matrix(rnorm(n * p), n, p); y = cbind(1, X) %*% beta + rnorm(n)
3 # D = diag(p) ## Use either
4 D = array(dim = c(p - 1, p))
5 for (i in 1:(p - 1)) {D[i, ] = 0; D[i, i] = 1; D[i, i + 1] = -1}
6 par(mfrow = c(1, 2))
7 res = fused.dual.general(X, y, D); alpha = res$alpha; lambda = res$
lambda
8 lambda.max = max(lambda); m = nrow(alpha)
9 alpha.min = min(alpha); alpha.max = max(alpha)
10 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(alpha.min, alpha.
max),
11 type = "n", xlab = "lambda", ylab = "alpha", main = "Dual Problem")
12 u = c(0, lambda); v = rbind(0, alpha); for (j in 1:m) lines(u, v[, j],
col = j)
13 res = fused.prime.general(X, y, D); beta = res$beta
14 beta.min = min(beta); beta.max = max(beta)
15 plot(0:lambda.max, xlim = c(0, lambda.max), ylim = c(beta.min, beta.max
),
16 type = "n", xlab = "lambda", ylab = "beta", main = "Primary Problem
")
17 w = rbind(0, beta); for (j in 1:p) lines(u, w[, j], col = j)
18 par(mfrow = c(1, 1))

59. Set A ∈ Rd×m , B ∈ Rd×n , and c ∈ Rd . We assume that f : Rm → R and g :


Rn → R are convex and that f is differentiable. We formulate the problem that
minimizes f (α) + g(β) under Aα + Bβ = c by adding a constant ρ > 0 to the
Lagrange multiplier

f (α) + g(β) + γ T (Aα + Bβ − c) → min (γ ∈ Rd )

to write
ρ
L ρ (α, β, γ) := f (α) + g(β) + γ T (Aα + Bβ − c) + Aα + Bβ − c 2
2
(cf. (4.23)) .
If we repeat the updates expressed by the three equations, we can obtain the
optimum solution under some conditions (ADMM, alternating direction method
of multipliers).

⎨ αt+1 ← α ∈ Rm that minimizes L ρ (α, βt , γt )
βt+1 ← β ∈ Rn that minimizes L ρ (αt+1 , β, γt )

γt+1 ← γt + ρ(Aαt+1 + Bβt+1 − c)

Suppose that we apply similar updates to the generalized Lasso:

1 ρ
L ρ (θ, γ, μ) := y−θ 2
2 +λ γ 1 + μT (Dθ − γ) + Dθ − γ 2
2 2
Answer the following questions.
(a) What are A ∈ Rd×m , B ∈ Rd×n , c ∈ Rd , f : Rm → R, and g : Rn → R?
Exercises 47–61 143

(b) Show that the updates are given as follows:



⎨ θt+1 ← (I + ρD T D)−1 (y + D T (ργt − μt ))
γt+1 ← Sλ (ρDθt+1 + μt )/ρ

μt+1 ← μt + ρ(Dθt+1 − γt+1 )

Hint
∂ Lρ
= θ − y + D T μt + ρD T (Dθ − γt )
∂θ ⎧
⎨ 1, γ>0
∂ Lρ
= −μt + ρ(γ − Dθt+1 ) + λ [−1, 1], γ = 0
∂γ ⎩
−1, γ<0

60. Fill in the blanks below, and construct the function admm that realizes the gen-
eralized Lasso.
1 admm = function(y, D, lambda) {
2 K = ncol(D); L = nrow(D)
3 theta.old = rnorm(K); theta = rnorm(K); gamma = rnorm(L); mu = rnorm(L
)
4 rho = 1
5 while (max(abs(theta - theta.old) / theta.old) > 0.001) {
6 theta.old = theta
7 theta = ## Blank(1) ##
8 gamma = ## Blank(2) ##
9 mu = mu + ## Blank(3) ##
10 }
11 return(theta)
12 }

61. Using Exercise 60, for each of the following cases, fill in the blanks to execute
the procedure.
(a) Lasso in (4.33) (trend filtering). The data can be downloaded from
https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/
airPollution.txt

1 ## Vector Input df = read.table("airpolution.txt", header = TRUE)


2 index = order(df[[3]])
3 y = df[[1]][index]; N = length(y)
4 x = df[[3]] + rnorm(N) * 0.01
5 # The original data contains the same integer values and some
purterbation was added
6 x = x[index] ## Setting Matrix D
7 D = matrix(0, ncol = N, nrow = N - 2)
8 for (i in 1:(N - 2)) D[i, ] = 0
9 for (i in 1:(N - 2)) D[i, i] = 1 / (x[i + 1] - x[i])
10 for (i in 1:(N - 2)) D[i, i + 1] = -1 / (x[i + 1] - x[i]) - 1 / (x[i
+ 2] - x[i + 1])
11 for (i in 1:(N - 2)) D[i, i + 2] = ## Blank(1) ##
12 ## Computation of theta
144 4 Fused Lasso

13 theta = ## Blank(2) ##
14 plot(x, theta, xlab = "Temperature (F)", ylab = "Ozon", col = "red",
type = "l")
15 points(x, y, col = "blue")

(b) See the solution path merge in the fused Lasso, where we use the CGH data
in Exercise 47 (a).
1 df = read.table("cgh.txt"); y = df[[1]][101:110]; N = length(y)
2 D = array(dim = c(N - 1, N)) for (i in 1:(N - 1)) {D[i, ] = 0; D[i,
i] =
3 1; D[i, i + 1] = -1}
4 lambda.seq = seq(0, 0.5, 0.01); M = length(lambda.seq)
5 theta = list(); for (k in 1:M) theta[[k]] = ## Blank(3) ##
6 x.min = min(lambda.seq); x.max = max(lambda.seq)
7 y.min = min(theta[[1]]); y.max = max(theta[[1]])
8 plot(lambda.seq, xlim = c(x.min, x.max), ylim = c(y.min, y.max),
type = "n",
9 xlab = "lambda", ylab = "Coefficient", main = "Fused Lasso
Solution Path")
10 for (k in 1:N) {
11 value = NULL; for (j in 1:M) value = c(value, theta[[j]][k])
12 lines(lambda.seq, value, col = k)
13 }
Chapter 5
Graphical Models

In this chapter, we examine the problem of estimating the structure of the graphical
model from observations. In the graphical model, each vertex is regarded as a variable,
and edges express the dependency between them (conditional independence ). In
particular, assume a so-called sparse situation where the number of vertices is larger
than the number of variables . Consider the problem of connecting vertex pairs with
a certain degree of dependency or more as edges. In the structural estimation of
a graphical model assuming sparsity, the so-called undirected graph with no edge
orientation is often used.
In this chapter, we first introduce the theory of graphical models, especially the
concept of conditional independence and the separation of undirected graphs. Then,
we learn the algorithms of graphical Lasso, structural estimation of the graphical
model using the pseudo-likelihood, and joint graphical Lasso.

5.1 Graphical Models

The graphical model dealt with in this book is related to undirected graphs. For
p ≥ 1, we define V := {1, . . . , p}, and E is a subset of {{i, j} | i = j, i, j ∈ V }.1
In particular, V is called a vertex set, its elements are called vertices, E is called an
edge set, and its elements are called edges. The undirected graph consisting of them
is written as (V, E). For subsets A, B, C of V , when all paths connecting A, B pass
through some vertex of C, C is said to separate Aand B and is written as A ⊥⊥ E B | C
(Fig. 5.1).

1 We do not distinguish between {1, 2} and {2, 1}.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 145
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_5
146 5 Graphical Models

The red verteces separate the blue and green verteces.

The red verteces do not separate the blue and green verteces.

Fig. 5.1 Vertex sets A, B are separated by another vertex set. The blue and green vertices are
separated and not separated by red ones above and below the three undirected graphs, respectively.
There is a red vertex on any path connecting green and blue vertices in each of the three above
graphs, while there is at least one path connecting green and blue vertices that do not contain any
red vertex in each of the three below graphs. For example, in the left-below graph, the two left blue
vertices can reach the green vertex without passing through the red vertex. The two blue and two
green vertices communicate with each other without passing through the red vertex in the center-
below graph. In the right-below graph, the two blue and one green vertices communicate without
passing through the red vertex

In the following, we identify p random variables X 1 , . . . , X p and vertices 1, . . . , p


and express their conditional independence properties by the undirected graph
(V, E).
For variables X, Y , we write the probabilities of the events associated with
X, Y, (X, Y ) as P(X ), P(Y ), and P(X, Y ), respectively. We say that X, Y are inde-
pendent and write X ⊥⊥ P Y if P(X )P(Y ) = P(X, Y ). Similarly, we write the prob-
ability of the event associated with (X, Y, Z ) as P(X, Y, Z ). We say that X, Y are
conditionally independent given Z and write X ⊥⊥ P Y | Z if P(X, Z )P(Y, Z ) =
P(X, Y, Z )P(Z ).
A graphical model expresses multiple conditional independence (CI) relations by
the following graph: for disjoint subsets X A , X B , X C of {X 1 , . . . , X p }, the edges are
connected such that

X A ⊥⊥ P X B | X C ⇐⇒ A ⊥⊥ E B | C . (5.1)

However, in general, such an edge set E of (V, E) may not exist for some probabilities
P.
Example 45 Let X, Y be random variables that take zeros and ones equiprobably,
and assume that Z is the residue of X + Y when divided by two. Because (X, Y )
takes four values equiprobably, we have
5.1 Graphical Models 147

P(X )P(Y ) = P(X, Y )

and find that X, Y are independent. However, each of (X, Z ), (Y, Z ), (X, Y, Z ) takes
four values equiprobably, and Z takes two values equiprobably. Thus, we have

1 1 1 1
P(X, Z )P(Y, Z ) = · = · = P(X, Y, Z )P(Z ) ,
4 4 4 2
which means that X, Y are not conditionally independent given Z . From the impli-
cation ⇐= in (5.1), we find that X, Z , Y are not connected in this order. If we know
the value of Z , then we also know whether X = Y or X = Y , which means that they
are not independent given Z . In addition, both (X, Z ), (Y, Z ) are not independent,
and they should be connected as edges. Hence, we need to connect all three edges,
although the graph cannot express X ⊥⊥ P Y . Therefore, we cannot distinguish by
any undirected graph between the CI relation in this example and the case where no
nontrivial CI relation exists.
However, if we connect all the edges, the implication ⇐= in (5.1) trivially holds.
We say that an undirected graph for probability P is a Markov network of P if the
number of edges is minimal and the implication ⇐= in (5.1) holds.
However, for the Gaussian variables, we can obtain the CI properties from the
covariance matrix  ∈ R p× p . For simplicity, we assume that the means of the vari-
ables are zeros. Let  be the inverse matrix of , and write the probability density
matrix as   
det  1 T
f (x) = exp − x x (x ∈ R p ) .
(2π) p 2

Then, the sufficient and necessary condition for disjoint subsets A, B, C of {1, . . . , p}
to be conditionally independent is

f A∪C (x A∪C ) f B∪C (x B∪C ) = f A∪B∪C (x A∪B∪C ) f C (xC ) (5.2)

for arbitrary x ∈ R p , which is equivalent to

− log det  A∪C − log det  B∪C + log det  A∪B∪C + log det C
= x A∪C
T
 A∪C x A∪C + x B∪C
T
 B∪C x B∪C − x A∪B∪C
T
 A∪B∪C x A∪B∪C − xCT C xC
(5.3)

for arbitrary x ∈ R p , where  S , x S are elements of , x that correspond to a set


S ⊆ {1, . . . , p}, respectively. Among x A∪B∪C
T
 A∪B∪C x A∪B∪C on the right-hand side,
the values of xi i, j x j (i ∈ A, j ∈ B) should be zeros because they are missing
on the left-hand side. Thus, we require the elements  AB ,  B A of  indexed by
i ∈ A, j ∈ B and i ∈ B, j ∈ A to be zero. On the other hand, if  AB =  B A = 0,
then both sides of (5.3) are zero (Proposition 15).
148 5 Graphical Models

Proposition 15 (Lauritzen, 1996 [18])

 AB = TB A = 0 =⇒ det  A∪C det  B∪C = det  A∪B∪C det C

For the proof, see the Appendix.


⎡ ⎤
θ1,1 0 θ1,3
Example 46 Because  = ⎣ 0 θ2,2 θ2,3 ⎦ implies
θ1,3 θ2,3 θ3,3
⎡ ⎤
θ2,2 θ3,3 − θ2,3
2
θ1,3 θ2,3 −θ1,3 θ2,2
1 ⎣
= θ1,3 θ2,3 θ1,1 θ3,3 − θ1,3
2
−θ2,3 θ1,1 ⎦ ,
det  −θ1,3 θ2,2 −θ2,3 θ1,1 θ1,1 θ2,2

we have

det {1,3} det {2,3}


1 1
= {(θ2,2 θ3,3 − θ2,3
2
)θ1,1 θ2,2 − θ1,3
2
θ2,2
2
}· {(θ1,1 θ3,3 − θ1,3
2
)θ1,1 θ2,2 − θ2,3
2
θ1,1
2
}
(det )2 (det )2
1
= θ1,1 θ2,2 (θ1,1 θ2,2 θ2,3 − θ1,1 θ2,3
2
− θ2,2 θ1,3
2 2
)
(det )4
θ1,1 θ2,2
= = det {3} det {1,2,3}
(det )2

Hence, if X 1 , . . . , X p are Gaussian, we have  AB = TB A = 0 ⇐⇒ (5.3). Thus, we


have the following proposition.

Proposition 16 If X 1 , . . . , X p are Gaussian, we have

 AB = TB A = 0 ⇐⇒ C ⊆ {1, 2, . . . , p} exists such thatX A ⊥⊥ P X B | X C .

In particular, for D ∩ (A ∪ B) = φ, we have

X A ⊥⊥ P X B | X C , C ⊆ D =⇒ X A ⊥⊥ P X B | X D .

As we have seen in Example 45, X ⊥⊥ P Y does not mean that X ⊥⊥ P Y | Z .


However, for the Gaussian distributions, from Proposition 16, no such example exists.
The proposition is useful in this chapter (Fig. 5.2).

Proposition 17 (Pearl and Paz, 1985) Suppose that we construct an undirected


graph (V, E) such that {i, j} ∈ E ⇐⇒ X i ⊥⊥ P X j | X S for i, j ∈ V and S ⊆ V .
Then, X A ⊥⊥ P X B | X C ⇐⇒ A ⊥⊥ E B | C for all disjoint subsets A, B, C of V if
and only if the probability P satisfies the following conditions:

X A ⊥⊥ P X B | X C ⇐⇒ X B ⊥⊥ P X A |X C (5.4)
5.1 Graphical Models 149

Markov network Precision Matrix Θ

5 4 4

3 3

1 2 2

1 2 3 4 5

Fig. 5.2 In Proposition 17, for the Gaussian variables, the existence of each edge in the Markov
network is equivalent to the nonzero of the corresponding element in the precision matrix. The
colors red, blue, green, yellow, black, and pink in the left and right correspond

X A ⊥⊥ P (X B ∪ X D ) | X C =⇒ X A ⊥⊥ P X B |X C , X A ⊥⊥ P X D | X C (5.5)

XA ⊥
⊥ P (X C ∪ X D ) | X B , X A ⊥
⊥ E X D | (X B ∪ X C ) =⇒ X A ⊥
⊥ E (X B ∪ X D ) | X C
(5.6)

X A ⊥⊥ P X B | X C =⇒ X A ⊥⊥ P X B | (X C ∪ X D ) (5.7)

X A ⊥⊥ P X B | X C =⇒ X A ⊥⊥ P X k | X C or X k ⊥⊥ E X B |X C (5.8)

for k ∈
/ A ∪ B ∪ C, where A, B, C, D are disjoint subsets of V for each equation.

For the proof, see the Appendix.


We claim that the Gaussian P satisfies (5.4)–(5.8), which means that if we con-
struct an undirected graph (V, E) such that {i, j} ∈ E ⇐⇒ X i ⊥⊥ P X j | X S for
i, j ∈ V and S ⊆ V , then X A ⊥⊥ P X B | X C ⇐⇒ A ⊥⊥ E B | C.
Among the five relations, the first three hold for any probability model. The fourth
is true for the Gaussian distributions from Proposition 16. We only show that the last
relation holds for the Gaussian distribution. The relation if either k ∈ A or k ∈ B, then
the claim is apparent. If neither X A ⊥⊥ X k |C nor X B ⊥⊥ X k |C, then it contradicts
X A ⊥⊥ A B |X C because the distribution is Gaussian.
In this chapter, we consider estimating the edge set E of the Markov network
from the observation X ∈ R N × p consisting of N tuples of the p variable X 1 , . . . , X p .
According to Proposition 17, we need to estimate only the precision matrix , the
inverse of the covariance matrix. If we write each element of the observation X as
150 5 Graphical Models

N
1
xi, j , then for x̄ j := xi, j , one might consider computing the sample covariance
N i=1
N
1
matrix S = (si, j ) whose elements are si, j := (xi, j − x̄ j )2 and testing whether
N i=1
each is zero or not.
However, we immediately find that the process causes problems. First, the matrix
S does not always have the inverse. In particular, under the sparse situation, we
assume that the number p of variables is large relative to the number N , such as
when finding what genes among p = 10, 000 affect the sick from p gene expression
data of N = 100 cases/controls. S is nonnegative definite and can be expressed as
A T A using a matrix A ∈ R N × p . If p > N , then the rank of S ∈ R p× p is at most
N (< p), and S is not nonsingular.
Even if p < N , because we estimate  from samples, it is unlikely that all the
elements of S AB that correspond to  AB happen to be zero, even if X A ⊥⊥ X B | X C .
Thus, we cannot infer the correct conclusion without applying anything similar to
statistical testing.

5.2 Graphical Lasso

In the previous section, under p > N , we observe that no correct estimation of the
CI relations among X 1 , . . . , X p is obtained even if we correctly estimate the sample
covariance matrix S. In this section, using Lasso, we consider distinguishing whether
each element i, j is zero or not (graphical Lasso [12]).
We first note that if we express the Gaussian density function of p variables as
f  (x1 , . . . , x p ), the log-likelihood can be computed as

N
1
log f  (xi,1 , . . . , xi, p )
N i=1
N p p
1 p 1
= log det  − log(2π) − xi, j θ j,k xi,k
2 2 2N i=1 j=1 k=1
1
= {log det  − p log(2π) − trace(S)} , (5.9)
2
where we have used
N p p p p N p p
1 1
xi, j θ j,k xi,k = θ j,k xi, j xi,k = s j,k θ j,k .
N i=1 j=1 k=1 j=1 k=1
N i=1 j=1 k=1

Next, we L1-regularize (5.9) with λ ≥ 0. Note that the maximization of


5.2 Graphical Lasso 151

N
1 1
log f  (xi ) − λ |θ j,k |
N i=1
2 j=k

w.r.t. nonnegative definite  is equivalent to

log det  − trace(S) − λ |θ j,k | . (5.10)


j=k

In the following, for simplicity, we write the determinant of a matrix A as either


|A| or det A.
In general, if we denote by Ai, j the matrix that subtracts the i-th row and j-th
column from matrix A ∈ R p× p , then we have
p 
|A|, i = k
(−1) k+ j
ai, j |Ak, j | = . (5.11)
0, i = k
j=1

Suppose |A| = 0. If B is the matrix whose ( j, k) element is b j,k = (−1)k+ j


|Ak, j |/|A|, then AB is the unit matrix. In fact, from (5.11), we have

p p 
|Ak, j | 1, i = k
ai, j b j,k = ai, j (−1) k+ j
= .
|A| 0, i = k
j=1 j=1

Hereafter, we denote such a matrix B as A−1 . Then, we note that the differentiation
of |A| w.r.t. ai, j is (−1)i+ j |Ai, j |. In fact, we observe that if we let i = k in (5.11),
then the coefficient of ai, j in |A| is (−1)i+ j |Ai, j |. In addition, if we differentiate
p p
trace(S) = i=1 j=1 si, j θi, j with the (k, h)-th element θk,h of , it becomes the
(k, h)-th element sk,h of S. Moreover, if we differentiate log det  by θk,h , we have

∂ log || 1 ∂|| 1


= = (−1)k+h |k,h | ,
∂θk,h || ∂θk,h ||

which is the (h, k)-th element of −1 . Because  is symmetric, −1 is also sym-
metric:
T =  =⇒ (−1 )T = (T )−1 = −1 .

Thus, the  that maximizes (5.10) is the solution of

−1 − S − λ = 0 , (5.12)

where  = (ψ j,k ) is ψ j,k = 0 when j = k and


152 5 Graphical Models

⎨ 1, θ j,k > 0
ψ j,k = [−1, 1], θ j,k = 0

−1, θ j,k < 0

otherwise.
The solution can be obtained from formulating the Lagrange multiplier of the
nonnegative definite problem and applying the KKT condition:

− log det  + trace(S) + λ |θ j,k |+ < ,  >→ min


j=k

1.  0
2.  0, −1 − S − λ+ < , ∂ >= 0
3. < ,  >= 0 ,
where A 0 indicates that A is nonnegative definite and < A, B > denotes the
inner product trace(A T B) of matrices A, B. Because ,  0, < ,  >= 0
means trace() = trace( 1/2  1/2 ) = 0, which further means that  = 0 because
 1/2  1/2 0. Thus, we obtain (5.7).
Next, we obtain the solution  of (5.12). Let W be the inverse matrix of .
Decomposing the matrix , , S, W ∈ R p× p , with the size of the upper-left elements
being ( p − 1) × ( p − 1), we write
     
1,1 ψ1,2 S1,1 s1,2 1,1 θ1,2
= , S= , =
ψ2,1 ψ2,2 s2,1 s2,2 θ2,1 θ2,2
    
W1,1 w1,2 1,1 θ1,2 I p−1 0
= , (5.13)
w2,1 w2,2 θ2,1 θ2,2 0 1

where we assume θ2,2 > 0. If  is positive definite, the eigenvalues of  are the
inverse of those of , and  is positive definite. Thus, if we multiply the vector
with the first p − 1 and the p-th zero and one from the front and end, then the value
(quadratic form) should be positive.
Then, from the upper-right part of both sides of (5.12), we have

w1,2 − s1,2 − λψ1,2 = 0 (5.14)

where the upper-right part of −1 is w1,2 . From the upper-right part of both sides in
(5.13), we have
W1,1 θ1,2 + w1,2 θ2,2 = 0 . (5.15)
⎡ ⎤
β1
⎢ ⎥ θ1,2
If we put β = ⎣ ... ⎦ := − , then from (5.14), (5.15), we have the following
θ2,2
β p−1
equations:
5.2 Graphical Lasso 153

W1,1 s1,2
W1,1 s1,2
s1,2 W1,1
s1,2 W1,1

Fig. 5.3 The core part of graphical Lasso (Friedman et al., 2008)[12]. Update W1,1 , w1,2 , where
the yellow and green elements are W1,1 and w1,2 , respectively. The same value as that of w1,2 is
stored in w2,1 . In this case ( p = 4), we repeat the four steps. The diagonal elements remain the
same as those of S

W1,1 β − s1,2 + λφ1,2 = 0 (5.16)

w1,2 = W1,1 β, (5.17)



⎨ 1, βj > 0
where the j-th element of φ1,2 ∈ R p−1 is [−1, 1], β j = 0 . In fact, each element

−1, βj < 0
of ψ1,2 ∈ R p−1 is determined by the sign (positive/negative/zero) of θ1,2 . Moreover,
the values included in [−1, 1] are included in it even if we flip the sign [−1, 1].
From θ2,2 > 0, the signs of the elements in β and θ1,2 are opposite, and we have
ψ1,2 = −φ1,2 .
We find the solution of (5.12) as follows: First, we find the solution β of (5.16),
substitute it into (5.17), and obtain w1,2 , which is the same as w2,1 due to symmetry.
We repeat the process, changing the positions of W1,1 , w1,2 . For j = 1, . . . , p, W1,1
is regarded as W , except in the j-th row and j-th column; w1,2 is the W of the j-th
column, except for the j-th element (Fig. 5.3).
If we input A := W1,1 ∈ R( p−1)×( p−1) , b = s1,2 ∈ R p−1 , and c j := b j −
k= j a j,k βk , then each β j that satisfies (5.16) is computed as


⎪ cj − λ

⎪ a , cj > λ

⎨ j, j
β j = 0, −λ < c j < λ . (5.18)



⎪ c j + λ
⎩ , c j < −λ
a j, j

In the last stage, for j = 1, . . . , p, if we take the following one cycle, we obtain
the estimate of 

θ2,2 = [w2,2 − w1,2 β]−1 (5.19)


θ1,2 = −βθ2,2 , (5.20)
154 5 Graphical Models

where (5.19) is derived from (5.13) and β = −θ1,2 /θ2,2 .

θ2,2 (w2,2 − w1,2 β) = θ2,2 w2,2 − θ2,2 w1,2 (−θ1,2 /θ2,2 ) = 1

For example, we may construct the following procedure.


1 inner.prod = function(x, y) return(sum(x * y)) ## already appeared
2 soft.th = function(lambda, x) return(sign(x) * pmax(abs(x) - lambda,
0)) ## already appeared
3 graph.lasso = function(s, lambda = 0) {
4 W = s; p = ncol(s); beta = matrix(0, nrow = p - 1, ncol = p)
5 beta.out = beta; eps.out = 1
6 while (eps.out > 0.01) {
7 for (j in 1:p) {
8 a = W[-j, -j]; b = s[-j, j]
9 beta.in = beta[, j]; eps.in = 1
10 while (eps.in > 0.01) {
11 for (h in 1:(p - 1)) {
12 cc = b[h] - inner.prod(a[h, -h], beta[-h, j])
13 beta[h, j] = soft.th(lambda, cc) / a[h, h]
14 }
15 eps.in = max(beta[, j] - beta.in); beta.in = beta[, j]
16 }
17 W[-j, j] = W[-j, -j] %*% beta[, j]
18 }
19 eps.out = max(beta - beta.out); beta.out = beta
20 }
21 theta = matrix(nrow = p, ncol = p)
22 for (j in 1:p) {
23 theta[j, j] = 1 / (W[j, j] - W[j, -j] %*% beta[, j])
24 theta[-j, j] = -beta[, j] * theta[j, j]
25 }
26 return(theta)
27 }

Example 47 We construct an undirected graph by connecting s, t such that θs,t = 0


is an edge. To this end, we generate data consisting of N = 100 tuples of p = 5
variables based on  = (θs,t ), the values of which are known a priori.
1 library(MultiRNG)
2 Theta = matrix(c( 2, 0.6, 0, 0, 0.5, 0.6, 2, -0.4, 0.3,
0, 0, -0.4, 2,
3 -0.2, 0, 0, 0.3, -0.2, 2, -0.2, 0.5, 0,
0, -0.2, 2),
4 nrow = 5)
5 Sigma = solve(Theta)
6 meanvec = rep(0, 5)
7 dat = draw.d.variate.normal(no.row = 20, d = 5, mean.vec = meanvec,
cov.mat = Sigma)
8 # average: mean.vec, cov matric: cov.mat, sample #: no.row, variable
#: d
9 s = t(dat) %*% dat / nrow(dat)
5.2 Graphical Lasso 155

For the data, we detect whether an edge exists using the R language function
graph.lasso, which recognizes whether each element of the matrix Theta is
zero.
1 Theta

1 [,1] [,2] [,3] [,4] [,5]


2 [1,] 2.0 0.6 0.0 0.0 0.5
3 [2,] 0.6 2.0 -0.4 0.3 0.0
4 [3,] 0.0 -0.4 2.0 -0.2 0.0
5 [4,] 0.0 0.3 -0.2 2.0 -0.2
6 [5,] 0.5 0.0 0.0 -0.2 2.0

1 graph.lasso(s)

1 [,1] [,2] [,3] [,4] [,5]


2 [1,] 2.1854523 0.49532753 -0.1247222 -0.04480949 0.46042784
3 [2,] 0.4962869 1.75439746 -0.2578557 0.17696543 -0.04818041
4 [3,] -0.1258535 -0.25806533 2.0279864 -0.47076829 0.13461843
5 [4,] -0.0435247 0.17742249 -0.4710045 2.22329991 -0.45533867
6 [5,] 0.4595971 -0.04867342 0.1350065 -0.45560256 2.20517785

1 graph.lasso(s, lambda = 0.015)

1 [,1] [,2] [,3] [,4] [,5]


2 [1,] 2.12573980 0.4372959 -0.06516687 0.0000000 0.39343910
3 [2,] 0.43823782 1.7088250 -0.17041729 0.1011818 0.00000000
4 [3,] -0.06538542 -0.1708783 1.97559922 -0.3728077 0.04907338
5 [4,] 0.00000000 0.1010347 -0.37528158 2.1517477 -0.34531269
6 [5,] 0.39473905 0.0000000 0.04930174 -0.3453602 2.14001693

1 graph.lasso(s, lambda = 0.03)

1 [,1] [,2] [,3] [,4] [,5]


2 [1,] 2.0764031 0.37682657 -0.0002649865 0.00000000 0.3254736
3 [2,] 0.3769137 1.67661814 -0.0963712162 0.03264331 0.0000000
4 [3,] 0.0000000 -0.09711355 1.9435147152 -0.29372726 0.0000000
5 [4,] 0.0000000 0.03199325 -0.2935709011 2.10377862 -0.2659885
6 [5,] 0.3271177 0.00000000 0.0000000000 -0.26606666 2.0965524

1 graph.lasso(s, lambda = 0.05)


156 5 Graphical Models

1 [,1] [,2] [,3] [,4] [,5]


2 [1,] 2.0259641 0.30684893 0.00000000 0.0000000 0.2421536
3 [2,] 0.3068663 1.64928815 -0.03096349 0.0000000 0.0000000
4 [3,] 0.0000000 -0.03157345 1.91889676 -0.2112410 0.0000000
5 [4,] 0.0000000 0.00000000 -0.21143915 2.0642104 -0.1808793
6 [5,] 0.2423494 0.00000000 0.00000000 -0.1809381 2.0549861

For the data generated above, we draw the undirected graphs λ = 0, 0.015, 0.03,
0.05 in Fig. 5.4. We observe that the larger the value of λ is, the greater the number
of edges in the graph.

In the actual data processing w.r.t. graphical Lasso, the glasso R package is
often used (the option rho is used to specify λ). In this book, we construct the
function graphical.lasso, but the process is not optimized for finer details. In
the rest of this section, we use glasso.

Example 48 We can execute the same procedure as in Example 47 by glasso.


1 library(glasso)
2 solve(s); glasso(s, rho = 0); glasso(s, rho = 0.015)
3 glasso(s, rho = 0.030); glasso(s, rho = 0.045)

When we use a graph of the numerical data output of the R language, for both
directed and undirected graphs, the igraph R package is used. Using the package,
we draw an undirected graph such that (i, j) are connected if and only if ai, j = 0
from a symmetric matrix A = (ai, j ) of size p. Moreover, we use it to draw an
undirected graph from a precision matrix . For example, we can construct the
following function adj.
1 library(igraph)
2 adj = function(mat) {
3 p = ncol(mat); ad = matrix(0, nrow = p, ncol = p)
4 for (i in 1:(p - 1)) for (j in (i + 1):p) {
5 if (mat[i, j] == 0) ad[i, j] = 0 else ad[i, j] = 1
6 }
7 g = graph.adjacency(ad, mode = "undirected")
8 plot(g)
9 }

λ=0 λ = 0.15 λ = 0.30 λ = 0.45


3 3 3 3

2 4 2 4 2 4 2 4

1 5 1 5 1 5 1 5

Fig. 5.4 In Example 47, we fix an undirected graph and generate data based on the graph. From
the data, we estimate the undirected graph from the samples. We observe that the larger the value
of λ is, the greater the number of edges in the graph
5.2 Graphical Lasso 157

Fig. 5.5 We draw the output of glasso w.r.t. the first 200 genes for λ = 0.75. For display, we
used a drawing application (Cytoscape) rather than igraph

However, it is not easy to draw a graph with hundreds of vertices in a display at


once. In particular, in the igraph package, the vertices and edges overlap even if
the vertex size is shrunk. Thus, it may be appropriate to use the R procedure only
for obtaining the adjacency matrix and to use another drawing software program for
displaying the graph.2 Using this, application software is even better if the display
is used for paper submission, an important presentation, etc. It may be appropriate
to say that the task of structure estimation of a graphical model is to obtain the
mathematical description of the graph rather than to display the graph itself.

Example 49 (Breast Cancer) We execute graphical Lasso for the dataset breast-
cancer.csv (N = 250, p = 1000). The purpose of the procedure is to find the relation
among the gene expressions of breast cancer patients. When we execute the follow-
ing program, we observe that the smaller the value of λ is, the longer the execution
time. We draw the output of glasso w.r.t. the first 200 genes out of p = 1000 for
λ = 0.75 (Fig. 5.5). The execution is carried out using the following code.

2 Cytoscape (https://cytoscape.org/), etc.


158 5 Graphical Models

1 library(glasso); library(igraph)
2 df = read.csv("breastcancer.csv")
3 w = matrix(nrow = 250, ncol = 1000)
4 for (i in 1:1000) w[, i] = as.numeric(df[[i]])
5 x = w; s = t(x) %*% x / 250
6 fit = glasso(s, rho = 0.75); sum(fit$wi == 0)
7 y = NULL; z = NULL
8 for (i in 1:999) for (j in (i + 1):1000) if (fit$wi[i, j] != 0) {y = c
(y, i); z = c(z, j)}
9 edges = cbind(y, z)
10 write.csv(edges,"edges.csv")

The edges of the undirected graph are stored in edges.csv. We input the file to
Cytoscape to obtain the output.
Although thus far, we have addressed the algorithm of graphical Lasso , we have
not mentioned how to set the value of λ and the correctness of graphical Lasso for a
large sample size N (missing/extra edges and parameter estimation).
Ravikumar et al. 2011 [23] presented a result that theoretically guarantees the
exist a parameter α and function
correctness of the graphical Lasso procedure: there √
f, g determined w.r.t.  such that when λ is f (α)/ N ,
1. the probability that the maximum element of  −  ˆ is upper bounded by

g(α)/ N can be arbitrarily small and
ˆ coincide can be arbitrarily large
2. the probability that the edge sets in  and 
as the sample size N grows, where ˆ is the estimate of  that is obtained by graphical
Lasso. √
Although the paper suggests that λ should be O(1/ N ), the parameter α is not
known and cannot be estimated from the N samples, which means that the λ that
guarantees the correctness cannot be obtained. In addition, the theory assumes a
range of α, and we cannot know from the samples that the condition is met. Even
if the λ value is known, the paper does not prove that the choice of λ is optimum.
However, Ref. [23] claims that consistency can be obtained for a broad√ range of α
for large N and that the parameter estimation improves with O(1/ N ).

5.3 Estimation of the Graphical Model based on the


Quasi-likelihood

If the response is continuous and discrete variables, we can infer the relevant covari-
ates via linear and logistic regressions. Moreover, each covariate may be either
discrete or continuous.
For a response X i , if the covariates are X k (k ∈ πi ⊆ {1, . . . , p}\{i}), we consider
the subset πi to be the parent set of X i . Even if we obtain the optimum parent set, we
cannot maximize the (regularized) likelihood of the data. If the graph has a tree/forest
structure, and if we seek the parent set from the root to the leaves, it may be possible
5.3 Estimation of the Graphical Model based on the Quasi-likelihood 159

38 22 11
13 10 42
46 37 43
39 2634 33 49
36 16
10 8 17
25 45 6 40 32 2718
29 2 41 1 12
23 4035 4
48 3050 31 28 41
241 32 245 3
185 13 28 2 9
44 9 47 2127 163533 23 22 6 21 14 19
7 12
49 4 42 37 263439 31 15 44 20
19 20
3 11 36 38 30 7 48
14 8 50 29 47
43 17 45 25
15 46
Fig. 5.6 The graphical model obtained via the quasi-likelihood method from the Breast Cancer
Dataset. When all the variables are continuous (left, Example 50), and when all the variables are
binary (right, Example 51)

to obtain the exact likelihood. However, we may wonder where the root should be,
why we may artificially assume the structure, etc.
Thus, to approximately realize maximizing the likelihood, we obtain the parent
set πi (i = 1, . . . , p) of each vertex (response) independently. We may choose one
of the following:
1. j ∈ πi and i ∈ π j =⇒ Connect i, j as an edge (AND rule)
2. j ∈ πi or i ∈ π j =⇒ Connect i, j as an edge (OR rule),
where the choice of λ may be different for each response.
Although this quasi-likelihood method [20] may not be theoretical, it can be
applied to general situations, in contrast to graphical Lasso:
1. the distribution over the p variables may not be Gaussian
2. some of the p variables may be discrete.
A demerit of the method is that when λ is small, particularly for large p, it is rather
time-consuming compared with graphical Lasso.
If we realize it using the R language, we may use the glmnet package: the options
family = "gaussian" (default), family = "binomial", and family
= "multinomial" are available for continuous, binary, and multiple variables,
respectively.

Example 50 We generate a graph using the quasi-likelihood method via glmnet


for the Breast Cancer Dataset. In the following, we examine the difference between
the AND and OR rules. Although the data contain 1,000 variables, we execute the
first p = 50 variables. Then, the difference cannot be seen between the AND and OR
rules. For λ = 0.1, there are 243 edges out of the p( p − 1)/2 = 1,225 candidates
(Fig. 5.6, left).
160 5 Graphical Models

1 library(glmnet)
2 df = read.csv("breastcancer.csv")
3 n = 250; p = 50; w = matrix(nrow = n, ncol = p)
4 for (i in 1:p) w[, i] = as.numeric(df[[i]])
5 x = w[, 1:p]; fm = rep("gaussian", p); lambda = 0.1
6 fit = list()
7 for (j in 1:p) fit[[j]] = glmnet(x[, -j], x[, j], family = fm[j],
lambda = lambda)
8 ad = matrix(0, p, p)
9 for (i in 1:p) for (j in 1:(p - 1)) {
10 k = j
11 if (j >= i) k = j + 1
12 if (fit[[i]]$beta[j] != 0) ad[i, k] = 1 else ad[i, k] = 0
13 }
14 ## AND
15 for (i in 1:(p - 1)) for (j in (i + 1):p) {
16 if (ad[i, j] != ad[i, j]) {ad[i, j] = 0; ad[j, i] = 0}
17 }
18 u = NULL; v = NULL
19 for (i in 1:(p - 1)) for (j in (i + 1):p) {
20 if (ad[i, j] == 1) {u = c(u, i); v = c(v, j)}
21 }
22 u

1 [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
2 [23] 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5
3 ......................................................................
4 [199] 30 30 31 31 31 31 31 31 31 32 32 33 33 34 34 34 35 35 36 36 36 37
5 [221] 37 37 38 38 38 38 38 40 40 42 42 43 43 43 44 44 45 45 47 47 47 49

1 v

1 [1] 2 12 13 18 22 23 24 28 41 46 48 50 4 5 10 16 18 21 23 26 27 34
2 [23] 37 43 9 10 15 19 20 21 24 29 38 50 5 8 17 18 21 23 46 9 20 21
3 ......................................................................
4 [199] 48 50 34 36 39 44 47 49 50 33 45 37 49 35 36 45 41 42 38 39 45 38
5 [221] 42 46 39 45 46 49 50 41 42 49 50 47 49 50 48 50 46 50 48 49 50 50

1 adj(ad)
2 ## OR
3 for (i in 1:(p - 1)) for (j in (i + 1):p) {
4 if (ad[i, j] != ad[i, j]) {ad[i, j] = 1; ad[j, i] = 1}
5 }
6 adj(ad)

Example 51 We construct an undirected graph from N samples of p binary variables


(the sign ± of the values in breastcancer.csv).
5.3 Estimation of the Graphical Model based on the Quasi-likelihood 161

1 library(glmnet)
2 df = read.csv("breastcancer.csv")
3 w = matrix(nrow = 250, ncol = 1000); for (i in 1:1000) w[, i] = as.
numeric(df[[i]])
4 w = (sign(w) + 1) / 2 ## transforming it to binary
5 p = 50; x = w[, 1:p]; fm = rep("binomial", p); lambda = 0.15
6 fit = list()
7 for (j in 1:p) fit[[j]] = glmnet(x[, -j], x[, j], family = fm[j],
lambda = lambda)
8 ad = matrix(0, nrow = p, ncol = p)
9 for (i in 1:p) for (j in 1:(p - 1)) {
10 k = j
11 if (j >= i) k = j + 1
12 if (fit[[i]]$beta[j] != 0) ad[i, k] = 1 else ad[i, k] = 0
13 }
14 for (i in 1:(p - 1)) for (j in (i + 1):p) {
15 if (ad[i, j] != ad[i, j]) {ad[i, j] = 0; ad[j, i] = 0}
16 }
17 sum(ad); adj(ad)

We may mix continuous and discrete variables by setting one of the options
"gaussian","binomial","multinomial" for each response in the array
fm of Examples 50 and 51.

5.4 Joint Graphical Lasso

Thus far, we have generated undirected graphs from the data X ∈ R N × p . Next, we
consider generating a graph from each of X 1 ∈ R N1 × p and X 2 ∈ R N2 × p (N1 + N2 =
N ) (joint graphical Lasso). We may assume a supervised learning with y ∈ {1, 2} N
such that X 1 , X 2 are generated from the rows i of X with yi = 1 and yi = 2. The
idea is to utilize the N samples when the two graphs share similar properties: if we
generate two graphs from X 1 and X 2 separately, the estimates will be poorer than
those obtained by applying the joint graphical Lasso. For example, suppose that we
generate two gene regulatory networks for the case and control expression data of
p genes. In this case, the two graphs would be similar, although some edges may be
different.
The joint graphical Lasso (JGL; Danaher-Witten, 2014) [8] extends the maxi-
mization of (5.10) to that of

K
Nk log det k − trace(Sk ) − P(1 , . . . ,  K ), (5.21)
i=1

for y ∈ {1, 2, . . . , K } (K ≥ 2).


For the P(1 , . . . ,  K ) term, we choose one of the fused Lasso and group Lasso
options. The penalties of them are, respectively,
162 5 Graphical Models


P(1 , . . . ,  K ) := λ1 |θi,(k)j | + λ2 |θi,(k)j − θi,(kj ) | (5.22)
k i= j k<k  i, j

2
P(1 , . . . ,  K ) := λ1 |θi,(k)j | + λ2 θi,(k)j , (5.23)
k i= j i= j k

where the index of ranges over k = 1, . . . , K , i, j = 1, . . . , p. The Eqs. (5.22),


(5.23) share the first term that makes the small absolute value terms zero, similar to
the Lasso procedures that we have seen.
In JGL, to obtain the solution of (5.21), we apply the ADMM: We define the
extended Lagrange as

K
L ρ (, Z , U ) := − Nk {log det k − trace(Sk )} + P(Z 1 , . . . , Z K )
k=1
K K
ρ
+ρ Uk , k − Z k  + k − Z k 2F
k=1
2 k=1

and repeat the following three steps:


i. (t) ← argmin L ρ (, Z (t−1) , U (t−1) )
ii. Z (t) ← argmin Z L ρ ((t) , Z , U (t−1) )
iii. U (t) ← U (t−1) + ρ((t) − Z (t) ).
More precisely, setting ρ > 0 and letting k and Z k , Uk be the unit and zero matrices
for k = 1, . . . , K , we repeat the above three steps. For the first step, if differentiating
L ρ (, Z , U ) by , then from a similar discussion as for graphical Lasso, we have

−Nk (−1
k − Sk ) + ρ(k − Z k + Uk ) = 0 .

If we decompose both sides (symmetric matrix) of

ρ Zk Uk
−1
k − k = Sk − ρ +ρ
Nk Nk Nk

as V DV T (D: a diagonal matrix), since k and V DV T share the same eigenvector,


we can write k = V D̃V T , −1 −1 T
k = V D̃ V . Thus, for the diagonal elements, we
have
1 ρ
− D̃ j, j = D j, j ,
D̃ j, j Nk

i.e., 
Nk  
D̃ j, j = −D j, j + D 2j, j + 4ρ/Nk .

Therefore, the optimum k can be written as V D̃V T using this D̃.


5.4 Joint Graphical Lasso 163

The second step executes fused Lasso among different classes k. For K = 2,
genlasso and other fused Lasso procedures do not work; thus, we set up the
function b.fused. If the difference between y1 and y2 is at most 2λ, then θ1 =
θ2 = (y1 + y2 )/2. Otherwise, we subtract λ from the larger of y1 , y2 and add λ to
the smaller one. The procedure to obtain λ1 from λ2 is due to the sparse fused Lasso
discussed in Sect. 4.2.
We construct the JGL for fused Lasso as follows.
1 # genlasso works only when the size is at least three
2 b.fused = function(y, lambda) {
3 if (y[1] > y[2] + 2 * lambda) {a = y[1] - lambda; b = y[2] + lambda}
4 else if (y[1] < y[2] - 2 * lambda) {a = y[1] + lambda; b = y[2] - lambda}
5 else {a = (y[1] + y[2]) / 2; b = a}
6 return(c(a, b))
7 }
8 # fused Lasso that compares not only the adjacency terms but also all
adjacency values
9 fused = function(y, lambda.1, lambda.2) {
10 K = length(y)
11 if (K == 1) theta = y
12 else if (K == 2) theta = b.fused(y, lambda.2)
13 else {
14 L = K * (K - 1) / 2; D = matrix(0, nrow = L, ncol = K)
15 k = 0
16 for (i in 1:(K - 1)) for (j in (i + 1):K) {
17 k = k + 1; D[k, i] = 1; D[k, j] = -1
18 }
19 out = genlasso(y, D = D)
20 theta = coef(out, lambda = lambda.2)
21 }
22 theta = soft.th(lambda.1, theta)
23 return(theta)
24 }
25 # Joint Graphical Lasso
26 jgl = function(X, lambda.1, lambda.2) { # X is given as a list
27 K = length(X); p = ncol(X[[1]]); n = array(dim = K); S = list()
28 for (k in 1:K) {n[k] = nrow(X[[k]]); S[[k]] = t(X[[k]]) %*% X[[k]] / n[k]}
29 rho = 1; lambda.1 = lambda.1 / rho; lambda.2 = lambda.2 / rho
30 Theta = list(); for (k in 1:K) Theta[[k]] = diag(p)
31 Theta.old = list(); for (k in 1:K) Theta.old[[k]] = diag(rnorm(p))
32 U = list(); for (k in 1:K) U[[k]] = matrix(0, nrow = p, ncol = p)
33 Z = list(); for (k in 1:K) Z[[k]] = matrix(0, nrow = p, ncol = p)
34 epsilon = 0; epsilon.old = 1
35 while (abs((epsilon - epsilon.old) / epsilon.old) > 0.0001) {
36 Theta.old = Theta; epsilon.old = epsilon
37 ## Update (i)
38 for (k in 1:K) {
39 mat = S[[k]] - rho * Z[[k]] / n[k] + rho * U[[k]] / n[k]
40 svd.mat = svd(mat)
41 V = svd.mat$v
42 D = svd.mat$d
43 DD = n[k] / (2 * rho) * (-D + sqrt(D ^ 2 + 4 * rho / n[k]))
44 Theta[[k]] = V %*% diag(DD) %*% t(V)
45 }
46 ## Update (ii)
47 for (i in 1:p) for (j in 1:p) {
48 A = NULL; for (k in 1:K) A = c(A, Theta[[k]][i, j] + U[[k]][i, j])
164 5 Graphical Models

49 if (i == j) B = fused(A, 0, lambda.2) else B = fused(A, lambda.1,


lambda.2)
50 for (k in 1:K) Z[[k]][i, j] = B[k]
51 }
52 ## Update (iii)
53 for (k in 1:K) U[[k]] = U[[k]] + Theta[[k]] - Z[[k]]
54 ## Test Convergence
55 epsilon = 0
56 for (k in 1:K) {
57 epsilon.new = max(abs(Theta[[k]] - Theta.old[[k]]))
58 if (epsilon.new > epsilon) epsilon = epsilon.new
59 }
60 }
61 return(Z)
62 }

For the JGL for the group Lasso, we may replace the update (ii). Let Ak [i, j] :=
k [i, j] + Uk [i, j]. Then, no update is required for i = j, and
⎛ ⎞
λ2
Z k [i, j] = S λ1 /ρ (Ak [i, j]) ⎝1 −  ⎠
K
ρ k=1 Sλ1 /ρ (Ak [i, j])2
+

for i = j.
We construct the following code, in which no functions b.fused and genlasso
are required.
1 ## Replace the update (ii) by the following
2 for (i in 1:p) for (j in 1:p) {
3 A = NULL; for (k in 1:K) A = c(A, Theta[[k]][i, j] + U[[k]][i, j])
4 if (i == j) B = A
5 else {B = soft.th(lambda.1 / rho,A) *
6 max(1 - lambda.2 / rho / sqrt(norm(soft.th(lambda.1 / rho, A), "2"
) ^ 2), 0)}
7 for (k in 1:K) Z[[k]][i, j] = B[k]
8 }

Example 52 For K = 2, adding noise to the data X ∈ R N × p , we generate similar


data X  ∈ R N × p . For the data X, X  and the parameters (λ1 , λ) = (10, 0.05), (10,
0.10), (3, 0.03), we execute the fused Lasso JGL with the following code.
1 ## Data Generation and Execution
2 p = 10; K = 2; N = 100; n = array(dim = K); for (k in 1:K) n[k] = N /
K
3 X = list(); X[[1]] = matrix(rnorm(n[k] * p), ncol = p)
4 for (k in 2:K) X[[k]] = X[[k - 1]] + matrix(rnorm(n[k] * p) * 0.1,
ncol = p)
5 ## Change the lambda.1,lambda.2 values to execute
6 Theta = jgl(X, 3, 0.01)
7 par(mfrow = c(1, 2)); adj(Theta[[1]]); adj(Theta[[2]])

We observe that the larger the value of λ1 is, the more sparse the graphs. On the other
hand, the larger the value of λ2 is, the more similar the graphs (Fig. 5.7).
Appendix: Proof of Propositions 165

(a) λ1 = 10, λ2 = 0.05

8 4 2 3
9 9

6 10 7 1 5 4 2 3 8 6 10 7 1 5

(b) λ1 = 10, λ2 = 0.10

4 2 3
9 8 9

6 10 7 1 5 4 2 3 8 6 10 7 1 5

(c) λ1 = 3, λ2 = 0.03
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0
2 1 1 1 1 0 0 1 0 2 0 1 1 1 0 0 1 0
3 0 0 0 1 0 0 1 3 0 0 0 1 0 1 1
4 1 1 1 0 0 1 4 1 1 1 0 0 0
5 0 1 1 1 1 5 1 0 1 0 1
6 0 1 1 0 6 0 0 1 1
7 0 1 1 7 0 1 1
8 1 1 8 1 1
9 1 9 1

Fig. 5.7 Execution of Example 52 (fused Lasso, K = 2) with (λ1 , λ) =


(10, 0.05), (10, 0.10), (3, 0.03). We observe that the larger the value of λ1 is, the more sparse the
graphs, and that the larger the value of λ2 is, the more similar the graphs

3 5 1 10 3 5 1 10

4 6 7 8 2 9 4 6 7 8 2 9

Fig. 5.8 Execution of Example 52 (group Lasso, K = 2) with (λ1 , λ) = (10, 0.01)

We execute the group Lasso JGL (Fig. 5.8). From the example, little difference
can be observed from the fused Lasso JGL.

Appendix: Proof of Propositions

Proposition 15 (Lauritzen, 1996 [18])

 AB = TB A = 0 =⇒ det  A∪C det  B∪C = det  A∪B∪C det C


166 5 Graphical Models
   
 A A  AC 0  
Proof: Applying Proof a = , b= , c = 0  BC , d =
C A CC C B
B B ,
   
 A A  AC 0  
e = a − bd −1
c= − −1
B B 0  BC
C A CC C B
 
AA  AC
= ,
C A CC − C B ( B B )−1  BC

in  −1  
a b e−1 −e−1 bd −1
= , (5.24)
c d −ed c d + d −1 ce−1 bd −1
−1 −1

we have
⎡ ⎤−1
   A A  AC 0
 A∪C ∗
= −1 = ⎣ C A CC C B ⎦
∗ ∗
0  BC  B B
⎡ −1 ⎤
AA  AC
∗⎦
= ⎣ C A CC − C B ( B B )−1  BC ,
∗ ∗

which means
 
AA  AC
( A∪C )−1 = , (5.25)
C A CC − C B ( B B )−1  BC

where (5.24) is due to


  
a b e−1 −e−1 bd −1
c d −d −1 ce−1 d −1 (I + ce−1 bd −1 )
 −1   
ae − bd −1 ce−1 −ae−1 bd −1 + bd −1 (I + ce−1 bd −1 ) I 0
= = .
ce−1 − ce−1 −ce−1 bd −1 + I + ce−1 bd −1 0 I

Similarly, we have
 −1  
  AA 0  AC
(C )−1 = CC − C A C B
0 B B  BC
= CC − C A ( A A )−1  AC − C B ( B B )−1  BC .

Thus, we have
Appendix: Proof of Propositions 167
 
I ( A A )−1  AC
det = det(C )−1 . (5.26)
C A CC − C B ( B B )−1  BC

Furthermore, from the identity


    
−1
AA 0 AA  AC I ( A A )−1  AC
=
0 I C A CC − C B ( B B )−1  BC C A CC − C B ( B B )−1  BC

and (5.25), (5.26), we have

det(C )
det  A∪C = . (5.27)
det  A A

On the other hand, from


     
ab ab I O
det = det det = det(a − bd −1 c) det d ,
c d c d −d −1 c I

we have
⎡  ⎤
0 ⎡  ⎤
⎢  ⎥ 0
 C A ⎥ = det ⎣  B∪C
AA
det  = det ⎢


⎦   C A

0
 B∪C 0 C A AA
C A
   
0   det  A A
= det  B∪C −  A A 0 C A · det  A A = . (5.28)
C A det  B∪C

By multiplying both sides of (5.27), (5.28), we have

det C
det  = .
det  A∪C det  B∪C

From det  = (det )−1 , this proves the proposition. 

Proposition 17 (Pearl and Paz, 1985) Suppose that we construct an undirected


graph (V, E) such that {i, j} ∈ E ⇐⇒ X i ⊥⊥ P X j | X S for i, j ∈ V and S ⊆ V .
Then, X A ⊥⊥ P X B | X C ⇐⇒ A ⊥⊥ E B | C for all disjoint subsets A, B, C of V if
and only if the probability P satisfies the following conditions:

X A ⊥⊥ P X B | X C ⇐⇒ X B ⊥⊥ P X A |X C (5.4)

X A ⊥⊥ P (X B ∪ X D ) | X C =⇒ X A ⊥⊥ P X B |X C , X A ⊥⊥ P X D | X C (5.5)

X A ⊥⊥ P (X C ∪ X D ) | X B , X A ⊥⊥ E X D | (X B ∪ X C ) =⇒ X A ⊥⊥ E (X B ∪ X D ) | X C (5.6)
168 5 Graphical Models

X A ⊥⊥ P X B | X C =⇒ X A ⊥⊥ P X B | (X C ∪ X D ) (5.7)

X A ⊥⊥ P X B | X C =⇒ X A ⊥⊥ P X k | X C or X k ⊥⊥ E X B |X C (5.8)

for k ∈
/ A ∪ B ∪ C, where A, B, C, D are disjoint subsets of V for each equation.

Proof: Note that the first four conditions imply

X A ⊥⊥ P X B | X C ⇐⇒ X i ⊥⊥ P X j | X C i ∈ A, j ∈ B (5.29)

because the converse of (5.5) subsequently holds:

XA ⊥
⊥ P X B | X C , X A ⊥⊥ P X D | X C =⇒ X A ⊥
⊥ P X B | (X C ∪ X D ), X A ⊥
⊥ P X D | (X C ∪ X B )
=⇒ X A ⊥
⊥ P (X B ∪ X D ) | X C ,

where (5.6) and (5.7) have been used. Thus, it is sufficient to show

X i ⊥⊥ P X j | X S ⇐⇒ i ⊥⊥ E j | S

for i, j ∈ U and S ⊆ U . From the construction of the undirected graph, ⇐= is


obvious.
For =⇒, if |S| = p − 2, the theorem holds because (V, E) was constructed so.
Suppose that the theorem holds for |S| = r ≤ p − 2 and that the size of S  is |S  | =
/ S  ∪ {i, j}.
r − 1, and let k ∈
1. From (5.7), we have i ⊥⊥ P j | (S  ∪ k).
2. From (5.8), we have either i ⊥⊥ P k | S  or j ⊥⊥ P k|S  , which means that we have
i ⊥⊥ P k|(S  ∪ j) from (5.7).
3. Since the size of S  ∪ j and S  ∪ k is r , by an induction hypothesis, from the first
and second, we have i ⊥⊥ E j | (S  ∪ k) and i ⊥⊥ E k | (S  ∪ j).
4. From (5.6), from the third item, we have i ⊥⊥ E j | S  ,
which establish =⇒. 

Exercises 62–75

62. In what graph among (a) through (f) do the red vertices separate the blue and
green vertices?
Exercises 62–75 169

(a) (b) (c)


z z z z z z j z z j j j
@@ z @@ j @@ z
z z z z z z

(d) (e) (f)


z j z z z j j z z j j z
@@ z @@ j @@ z
z z z z z z

63. Let X, Y be binary variables that take zeros and ones and are independent, and
let Z be the residue of the sum when divided by two. Show that X, Y are not
conditionally independent given Z . Note that the probabilities p and q of X = 1
and Y = 1 are not necessarily 0 and 5 and that we do not assume p = q.
64. For the precision matrix  = (θi, j )i, j=1,2,3 , with θ12 = θ21 = 0, show

det {1,2,3} det {3} = det {1,3} det {2,3} ,

where by  S , we mean the submatrix of  that consists of rows and columns


with indices S ⊆ {1, 2, 3}.
65. Suppose that we express the probability density function of the p Gaussian
variables with mean zero and precision matrix  by
  
det  1 T
f (x) = exp − x x (x ∈ R p )
(2π) p 2

and let A, B, C be disjoint subsets of {1, . . . , p}. Show that if

f A∪C (x A∪C ) f B∪C (x B∪C ) = f A∪B∪C (x A∪B∪C ) f C (xC ) (cf. (5.2))

for arbitrary x ∈ R p , then θi, j = 0 with i ∈ A, j ∈ B and i ∈ B, j ∈ A.


66. Suppose that X ∈ R N × p are N samples, each of which has been generated
according to the p variable Gaussian distribution N (0, ) ( ∈ R p× p ), and
let S := N1 X T X ,  :=  −1 . Show the following statements.

(a) Suppose that λ = 0 and N < p. Then, no inverse matrix exists for S

−1 − S − λψ = 0 (5.30)

and no maximum likelihood estimate of  exists.


Hint The rank of S is at most N , and S ∈ R p× p .
(b) The trace of S with  = (θs,t ) can be written as
170 5 Graphical Models

N
 p p

1
xi,s θs,t xi,t
N i=1 s=1 t=1

Hint If the multiplications AB and B A of matrices A, B are defined, they


have an equal trace, and

1 1
trace(S) = trace(X T X ) = trace(X X T )
N N

when A = X T and B = X .
(c) If we express the probability density function of the p Gaussian variables as
N
f  (x1 , . . . , x p ), then the log-likelihood N1 i=1 log f  (xi,1 , . . . , xi, p ) can
be written as
1
{log det  − p log(2π) − trace(S)} . (cf. (5.9))
2
N
1 1
(d) For λ ≥ 0, the maximization of log f  (xi ) − λ |θs,t | w.r.t.  is
N i=1
2 s=t
that of

log det  − trace(S) − λ |θs,t | (cf. (5.10)) . (5.31)


s=t

67. Let Ai, j and |B| be the submatrix that excludes the i-th row and j-th column
from matrix A ∈ R p× p and the determinant of B ∈ Rm×m (m ≤ p), respectively.
Then, we have
p 
|A|, i = k
(−1)k+ j ai, j |Ak, j | = . (cf. (5.11))
0, i = k
j=1

(a) When |A| = 0, let B be the matrix whose ( j, k) element is b j,k = (−1)k+ j |
Ak, j |/|A|. Show that AB is the unit matrix. Hereafter, we write this matrix
B as A−1 .
(b) Show that if we differentiate |A| by ai, j , then it becomes (−1)i+ j |Ai, j |.
(c) Show that if we differentiate log |A| by ai, j , then it becomes (−1)i+ j |
Ai, j |/|A|, the ( j, i)-th element of A−1 .
(d) Show that if we differentiate the trace of S by the (s, t)-element θs,t of ,
it becomes the (t, s)-th element of S.
p p
Hint Differentiate trace(S) = i=1 j=1 si, j θ j,i by θs,t .
(e) Show that the  that maximizes (5.31) is the solution of (5.12), where
 = (ψs,t ) is ψs,t = 0 if s = t and
Exercises 62–75 171

⎨ 1, θs,t > 0
ψs,t = [−1, 1], θs,t = 0

−1, θs,t < 0

otherwise.
Hint If we differentiate log det  by θs,t , it becomes the (t, s)-th element
of −1 from (d). However, because  is symmetric, −1 : T =  =⇒
(−1 )T = (T )−1 = −1 is symmetric as well.

68. Suppose that we have


     
1,1 ψ1,2 S1,1 s1,2 1,1 θ1,2
= , S= , =
ψ2,1 ψ2,2 s2,1 s2,2 θ2,1 θ2,2
    
W1,1 w1,2 1,1 θ1,2 I 0
= p−1 (cf. (5.13)) (5.32)
w2,1 w2,2 θ2,1 θ2,2 0 1

for , , S ∈ R p× p and W such that W  = I , where the upper-left part is


( p − 1) × ( p − 1) and we assume that θ2,2 > 0.
(a) Derive
w1,2 − s1,2 − λψ1,2 = 0 (cf. (5.14)) (5.33)

and
W1,1 θ1,2 + w1,2 θ2,2 = 0 (cf. (5.15)) (5.34)

from the upper-right part of (5.30) and (5.32), respectively.


Hint The ⎡ upper-right
⎤ part of −1 is w1,2 .
β1
⎢ .. ⎥ θ1,2
(b) Let β = ⎣ . ⎦ := − . Show that from (5.33), (5.34), the two equa-
θ2,2
β p−1
tions are obtained as

W1,1 β − s1,2 + λφ1,2 = 0 (cf. (5.16)) (5.35)

w1,2 = W1,1 β (cf. (5.17)) , (5.36)



⎨ 1, βj > 0
where the j-th element of φ1,2 ∈ R p−1 is [−1, 1], β j = 0 .

−1, βj < 0
Hint The sign of each element of ψ2,2 ∈ R p−1
is determined by the sign of
θ1,2 . From θ2,2 > 0, the signs of β and θ1,2 are opposite; thus, ψ1,2 = −φ1,2 .
69. We obtain the solution of (5.30) in the following way: find the solution of (5.35)
w.r.t. β, substitute it into (5.36), and obtain w1,2 , which is the same as w2,1 due
to symmetry.
172 5 Graphical Models

We repeat the process, changing the positions of W1,1 , w1,2 . For j = 1, . . . , p,


W1,1 is regarded as W except in the j-th row and j-th column, and w1,2 is
regarded as W in the j-th column except for the j-th element.

In the last stage, for j = 1, . . . , p, if we take the following one cycle, we obtain
the estimate of :

θ2,2 = [w2,2 − w1,2 β]−1 (cf. (5.19)) (5.37)


θ1,2 = −βθ2,2 (cf. (5.20)) (5.38)

(a) Let A := W1,1 ∈ R( p−1)×( p−1) , b = s1,2 ∈ R p−1 , and c j := b j −


k= j a j,k βk . Show that each β j that satisfies (5.35) can be computed via


⎪ cj − λ

⎪ , cj > λ

⎨ a j, j
β j = 0, −λ < c j < λ (cf. (5.18)) (5.39)



⎪ cj + λ
⎩ , c j < −λ
a j, j

(b) Derive (5.37) from (5.32).

70. We construct the graphical Lasso as follows.


1 library(glasso)
2 solve(s); glasso(s, rho=0); glasso(s, rho = 0.01)

Execute each step, and compare the results.


1 inner.prod = function(x, y) return(sum(x * y))
2 soft.th = function(lambda, x) return(sign(x) * pmax(abs(x) -
lambda, 0))
3 graph.lasso = function(s, lambda = 0) {
4 W = s; p = ncol(s); beta = matrix(0, nrow = p - 1, ncol = p)
5 beta.out = beta; eps.out = 1
6 while (eps.out > 0.01) {
7 for (j in 1:p) {
8 a = W[-j, -j]; b = s[-j, j]
9 beta.in = beta[, j]; eps.in = 1
10 while (eps.in > 0.01) {
11 for (h in 1:(p - 1)) {
12 cc = b[h] - inner.prod(a[h, -h], beta[-h, j])
13 beta[h, j] = soft.th(lambda, cc) / a[h, h]
14 }
15 eps.in = max(beta[, j] - beta.in); beta.in = beta[, j]
16 }
17 W[-j, j] = W[-j, -j] %*% beta[, j]
18 }
19 eps.out = max(beta - beta.out); beta.out = beta
20 }
21 theta = matrix(nrow = p, ncol = p)
22 for (j in 1:p) {
23 theta[j, j] = 1 / (W[j, j] - W[j, -j] %*% beta[, j])
24 theta[-j, j] = -beta[, j] * theta[j, j]
Exercises 62–75 173

25 }
26 return(theta)
27 }

What rows of the function definition of graph.lasso are the Eqs. (5.36)–
(5.39)?
Then, we can construct an undirected graph G by connecting each s, t such that
θs,t = 0 as an edge. Generate the data for p = 5 and N = 20 based on a matrix
 = (θs,t ) known a priori.
1 library(MultiRNG)
2 Theta = matrix(c( 2, 0.6, 0, 0, 0.5, 0.6, 2, -0.4,
0.3, 0, 0,
3 -0.4, 2, -0.2, 0, 0, 0.3, -0.2, 2,
-0.2, 0.5, 0,
4 0, -0.2, 2), nrow = 5)
5 Sigma = ## Blank ##; meanvec = rep(0, 5)
6 # mean: mean.vec, cov matrix: cov.mat, samples #: no.row, variable
#: d
7 # Generate the sample matrix
8 dat = draw.d.variate.normal(no.row = 20, d = 5, mean.vec = meanvec
, cov.mat = Sigma)

Moreover, execute the following code, and examine whether the precision matrix
 is correctly estimated.
1 s = t(dat) %*% dat / nrow(dat); graph.lasso(s); graph.lasso(s,
lambda = 0.01)

71. The function adj defined below connects each (i, j) as an edge if and only if the
element is nonzero given a symmetric matrix of size p to construct an undirected
graph. Execute it for the breastcancer.csv data.
1 library(igraph)
2 adj = function(mat) {
3 p = ncol(mat); ad = matrix(0, nrow = p, ncol = p)
4 for (i in 1:(p - 1)) for (j in (i + 1):p) {
5 if (## Blank ##) {ad[i, j] = 0} else {ad[i, j] = 1}
6 }
7 g = graph.adjacency(ad, mode = "undirected")
8 plot(g)
9 }
10
11 library(glasso)
12 df = read.csv("breastcancer.csv")
13 w = matrix(nrow = 250, ncol = 1000)
14 for (i in 1:1000) w[, i] = as.numeric(df[[i]])
15 x = w; s = t(x) %*% x / 250
16 fit = glasso(s, rho = 1); sum(fit$wi == 0); adj(fit$wi)
17 fit = glasso(s, rho = 0.5); sum(fit$wi == 0); adj(fit$wi)
18 y = NULL; z = NULL
19 for (i in 1:999) for (j in (i + 1):1000) {
20 if (mat[i, j] != 0) {y = c(y, i); z = c(z, j)}
21 }
174 5 Graphical Models

22 cbind(y,z)
23 ## Restrict to the first 100 genes
24 x = x[, 1:100]; s = t(x) %*% x / 250
25 fit = glasso(s, rho = 0.75); sum(fit$wi == 0); adj(fit$wi)
26 y = NULL; z = NULL
27 for (i in 1:99) for (j in (i + 1):100) {
28 if (fit$wi[i, j] != 0) {y = c(y, i); z = c(z, j)}
29 }
30 cbind(y, z)
31 fit = glasso(s, rho = 0.25); sum(fit$wi == 0); adj(fit$wi)

72. The following code generates an undirected graph via the quasi-likelihood
method and the glmnet package. We examine the difference between using
the AND and OR rules, where the original data contains p = 1,000 but the exe-
cution is for the first p = 50 genes to save time. Execute the OR rule case as
well as the AND case by modifying the latter.
1 library(glmnet)
2 df = read.csv("breastcancer.csv")
3 n = 250; p = 50; w = matrix(nrow = n, ncol = p)
4 for (i in 1:p) w[, i] = as.numeric(df[[i]])
5 x = w[, 1:p]; fm = rep("gaussian", p); lambda = 0.1
6 fit = list()
7 for (j in 1:p) fit[[j]] = glmnet(x[, -j], x[, j], family = fm[j],
lambda = lambda)
8 ad = matrix(0, p, p)
9 for (i in 1:p) for (j in 1:(p - 1)) {
10 k = j
11 if (j >= i) k = j + 1
12 if (fit[[i]]$beta[j] != 0) {ad[i, k] = 1} else {ad[i, k] = 0}
13 }
14 ## AND
15 for (i in 1:(p - 1)) for (j in (i + 1):p) {
16 if (ad[i, j] != ad[i, j]) {ad[i, j] = 0; ad[j, i] = 0}
17 }
18 u = NULL; v = NULL
19 for (i in 1:(p - 1)) for (j in (i + 1):p) {
20 if (ad[i, j] == 1) {u = c(u, i); v = c(v, j)}
21 }
22 ## OR

We execute the quasi-likelihood method for the signs ± of breastcancer.csv.


Exercises 62–75 175

1 library(glmnet)
2 df = read.csv("breastcancer.csv")
3 w = matrix(nrow = 250, ncol = 1000); for (i in 1:1000) w[, i] = as.
numeric(df[[i]])
4 w = (sign(w) + 1) / 2 ## Transform to Binary
5 p = 50; x = w[, 1:p]; fm = rep("binomial", p); lambda = 0.15
6 fit = list()
7 for (j in 1:p) fit[[j]] = glmnet(x[, -j], x[, j], family = fm[j],
lambda = lambda)
8 ad = matrix(0, nrow = p, ncol = p)
9 for (i in 1:p) for (j in 1:(p - 1)) {
10 k = j
11 if (j >= i) k = j + 1
12 if (fit[[i]] beta[j] != 0) {ad[i, k] = 1} else {ad[i, k] = 0}
13 }
14 for (i in 1:(p - 1)) for (j in (i + 1):p) {
15 if (ad[i, j] != ad[i, j]) {ad[i, j] = 0; ad[j, i] = 0}
16 }
17 sum(ad); adj(ad)

How can we deal with data that contain both continuous and discrete values?
73. Joint graphical Lasso (JGL) finds 1 , . . . ,  K that maximize

K
Nk {log det k − trace(Sk )} − P(1 , . . . ,  K ) (cf. (5.21)) (5.40)
i=1

given X ∈ R N × p , y ∈ {1, 2, . . . , K } N (K ≥ 2). Suppose that P(1 , . . . ,  K )


expresses the fused Lasso penalty

P(1 , . . . ,  K ) := λ1 |θi,(k)j | + λ2 |θi,(k)j − θi,(kj ) | , (cf. (5.22))
k i= j k<k  i, j

where the indices in range over k = 1, . . . , K , i, j = 1, . . . , p. In JGL, to


obtain the solution of (5.40), we apply the ADMM that was introduced in Chap. 4.
We define the extended Lagrangian as
K
L ρ (, Z , U ) := − Nk {log det k − trace(Sk )}
k=1
K K
ρ
+P(Z 1 , . . . , Z K ) + ρ < U k , k − Z k > + k − Z k 2F
k=1
2 k=1
(cf. (5.23))
and repeat the following steps:
i. (t) ← argmin L ρ (, Z (t−1) , U (t−1) )
ii. Z (t) ← argmin Z L ρ ((t) , Z , U (t−1) )
176 5 Graphical Models

iii. U (t) ← U (t−1) + ρ((t) − Z (t) )


More precisely, setting ρ > 0 and letting k and Z k , Uk be the unit and zero
matrices k = 1, . . . , K , respectively, repeat the above three steps.
(a) Show that if we differentiate (5.40) by k in the first step, then we have

−Nk (−1
k − Sk ) + ρ(k − Z k + Uk ) = 0 .

(b) We wish to obtain the optimum k in the first step. To this end, we decompose
both sides of the symmetric matrix

ρ Zk Uk
−1
k − k = Sk − ρ +ρ
Nk Nk Nk

as V DV T and obtain D̃ such that


Nk 
D̃ j, j = (−D j, j + D 2j, j + 4ρ/Nk )

from the diagonal matrix D. Show that V D̃V T is the optimum .


(c) In the second step, for K = 2, we require a fused Lasso procedure for two
values. Let y1 , y2 be the two sets of data. Derive θ1 , θ2 that minimize

1 1
(y1 − θ1 )2 + (y2 − θ2 )2 + |θ1 − θ2 | .
2 2
74. We construct the fused Lasso JGL. Fill in the blanks, and execute the procedure.

1 ## Fused Lasso for size 2


2 b.fused = function(y, lambda) {
3 if (y[1] > y[2] + 2 * lambda) {a = y[1] - lambda; b = y[2] +
lambda}
4 else if (y[1] < y[2] - 2 * lambda) {a = y[1] + lambda; b = y[2]
- lambda}
5 else {a = (y[1] + y[2]) / 2; b = a}
6 return(c(a, b))
7 }
8 ## Fused Lasso that compares all the adjacency values
9 fused = function(y, lambda.1, lambda.2) {
10 K = length(y)
11 if (K == 1) {theta = y}
12 else if (K == 2) {theta = b.fused(y, lambda.2)}
13 else {
14 L = K * (K - 1) / 2; D = matrix(0, nrow = L, ncol = K)
15 k = 0
16 for (i in 1:(K - 1)) for (j in (i + 1):K) {
17 k = k + 1; ## Blank (1) ##
18 }
19 out = genlasso(y, D = D)
Exercises 62–75 177

20 theta = coef(out, lambda = lambda.2)


21 }
22 theta = soft.th(lambda.1, theta)
23 return(theta)
24 }
25 ## Joint Graphical Lasso
26 jgl = function(X, lambda.1, lambda.2) { # X is given as a list
27 K = length(X); p = ncol(X[[1]]); n = array(dim = K); S = list()
28 for (k in 1:K) {n[k] = nrow(X[[k]]); S[[k]] = t(X[[k]]) %*% X[[k
]] / n[k]}
29 rho = 1; lambda.1 = lambda.1 / rho; lambda.2 = lambda.2 / rho
30 Theta = list(); for (k in 1:K) Theta[[k]] = diag(p)
31 Theta.old = list(); for (k in 1:K) Theta.old[[k]] = diag(rnorm(p
))
32 U = list(); for (k in 1:K) U[[k]] = matrix(0, nrow = p, ncol = p
)
33 Z = list(); for (k in 1:K) Z[[k]] = matrix(0, nrow = p, ncol = p
)
34 epsilon = 0; epsilon.old = 1
35 while (abs((epsilon - epsilon.old) / epsilon.old) > 0.0001) {
36 Theta.old = Theta; epsilon.old = epsilon
37 ## Update (i)
38 for (k in 1:K) {
39 mat = S[[k]] - rho * Z[[k]] / n[k] + rho * U[[k]] / n[k]
40 svd.mat = svd(mat)
41 V = svd.mat$v
42 D = svd.mat$d
43 DD = ## Blank (2) ##
44 Theta[[k]] = ## Blank (3) ##
45 }
46 ## Update (ii)
47 for (i in 1:p) for (j in 1:p) {
48 A = NULL; for (k in 1:K) A = c(A, Theta[[k]][i, j] + U[[k]][
i, j])
49 if (i == j) {B = fused(A, 0, lambda.2)} else {B = fused(A,
lambda.1, lambda.2)}
50 for (k in 1:K) Z[[k]][i,j] = B[k]
51 }
52 ## Update (iii)
53 for (k in 1:K) U[[k]] = U[[k]] + Theta[[k]] - Z[[k]]
54 ## test convergence
55 epsilon = 0
56 for (k in 1:K) {
57 epsilon.new = max(abs(Theta[[k]] - Theta.old[[k]]))
58 if (epsilon.new > epsilon) epsilon = epsilon.new
59 }
60 }
61 return(Z)
62 }
63
64 ## Data Generation and Execute
65 p = 10; K = 2; N = 100; n = array(dim = K); for (k in 1:K) n[k] =
N / K
66 X = list(); X[[1]] = matrix(rnorm(n[k] * p), ncol = p)
67 for (k in 2:K) X[[k]] = X[[k - 1]] + matrix(rnorm(n[k] * p) * 0.1,
ncol = p)
178 5 Graphical Models

68 ## Change lambda.1, lambda.2 and Execute


69 Theta = jgl(X, 3, 0.01)
70 par(mfrow = c(1, 2)); adj(Theta[[1]]); adj(Theta[[2]])

75. For the group Lasso, only the second step should be replaced. Let Ak [i, j] =
k [i, j] + Uk [i, j]. Then, no update is required for i = j, and
⎛ ⎞
λ2
Z k [i, j] = Sλ1 /ρ (Ak [i, j]) ⎝1 −  ⎠
K
ρ k=1 Sλ1 /ρ (Ak [i, j])2
+

for i = j. We construct the code. Fill in the blank, and execute the procedure as
in the previous exercise.
1 for (i in 1:p) for (j in 1:p) {
2 A = NULL; for (k in 1:K) A = c(A, Theta[[k]][i, j] + U[[k]][i, j
])
3 if (i == j) {B = A} else {B = ## Blank ##}
4 for (k in 1:K) Z[[k]][i, j] = B[k]
5 }
Chapter 6
Matrix Decomposition

Thus far, by Lasso, we mean making the estimated coefficients of some variables
zero if the absolute values are significantly small for regression, classification, and
graphical models. In this chapter, we consider dealing with matrices. Suppose that
the given data take the form of a matrix, such as in image processing. We wish to
approximate an image by a low-rank matrix after singular decomposition. However,
the function needed to obtain the rank from a matrix is not convex. In this chapter,
we introduce matrix norms w.r.t. the singular values, such as the nuclear norm and
spectral norm, and formulate the problem as convex optimization, as we did for other
Lasso classes.
We list several formulas of matrix arithmetic, where we write the unit matrix of
size n × n as In .
i. By Z  F , we denote the Frobenius norm of matrix Z , the square root of the
square sum of the elements in Z ∈ Rm×n . In general, we have

Z 2F = trace(Z T Z ) . (6.1)


n
In fact, because the (i, j)-th element wi, j of Z T Z is k=1 z k,i z k, j , we have


n 
n 
m
trace(Z T Z ) = wi,i = 2
z k,i = Z 2F .
i=1 i=1 k=1

ii. For l, m, n ≥ 1 and matrices A = (ai, j ) ∈ Rl×m , B = (bi, j ) ∈ R m , Cn =


m×n

(ci, j ) ∈ R , the (i, j)-th elements of ABC, BC A are u i, j := k=1 h=1


n×l
 
ai,k bk,h ch, j and vi, j := nk=1 lh=1 bi,k ck,h ah, j . If we take their traces, we have

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 179
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_6
180 6 Matrix Decomposition


l 
l 
m 
n 
m
trace(ABC) = u i,i = ai,k bk,h ch,i = vi,i = trace(BC A) .
i=1 i=1 k=1 h=1 i=1
(6.2)
iii. We can choose mutually orthogonal vectors as the basis of a vector space. If the
lengths of the vectors in the mutually orthogonal basis are one, then the basis is
said to be orthonormal. The columns u 1 , · · · , u n ∈ Rn of U ∈ Rn×n making an
orthonormal basis of the vector space Rn are equivalent to U T U = UU T =
In . Moreover, if U = [u 1 , · · · , u n ] consists of orthonormal column vectors,
for Ur = [u i(1) , · · · , u i(r ) ] ∈ Rn×r with r ≤ n and 1 ≤ i(1) ≤ · · · ≤ i(r ) ≤ n,
UrT Ur ∈ Rr ×r is the unit matrix, but Ur UrT ∈ Rn×n is not.

6.1 Singular Decomposition

Let m ≥ n and Z ∈ Rm×n . If we write the eigenvalue and eigenvector of nonnega-


tive definite matrix Z T Z ∈ Rn×n as λ1 ≥ · · · ≥ λn ∈ R and v1 , . . . , vn ∈ Rn , respec-
tively, we have
Z T Z vi = λi vi ,

where v1 , . . . , vn ∈ Rn have the unit length. We assume that if some eigenvalues


coincide, we orthogonalize them, which means that v1 , . . . , vn ∈ Rn comprise an
orthonormal basis. Then, we have

Z T Z V = V D2 ,

where
√ V = [v1 , .√. . , vn ] ∈ Rn×n and D is a diagonal matrix with the elements d1 :=
λ1 , . . . , dn := λn . From V V T = In , we have

Z T Z = V D2 V T . (6.3)

Suppose dn > 0, which means that D is the inverse matrix. We define U ∈ Rm×n
as
U := Z V D −1 .

Then, Z can be written as Z = U DV T , which we refer to as the singular decompo-


sition of Z . From (6.3) and V T V = In , we have

U T U = (Z V D −1 )T Z V D −1 = D −1 V T Z T Z V D −1 = D −1 V T (V D 2 V T )V D −1 = In .

The ranks of Z T Z and Z coincide. In fact, for arbitrary x ∈ Rn , we have


6.1 Singular Decomposition 181

Z x = 0 =⇒ Z T Z x = 0

Z T Z x = 0 =⇒ x T Z T Z x = 0 =⇒ Z x2 = 0 =⇒ Z x = 0

the kernels of Z ∈ Rm×n , Z T Z ∈ Rn×n coincide, and the numbers of columns are
equal. Thus, the r value such that dr > 0, dr +1 = 0 coincides with the rank of Z .
We now consider a general case r ≤ n. Let Vr := [v1 , . . . , vr ] ∈ Rn×r and Dr ∈
r ×r
R be a diagonal matrix with the elements d1 , . . . , dr . Then, if we regard the
row vectors d1 v1T , . . . , dr vrT ∈ Rn of Dr VrT ∈ Rr ×n (rank: r ) as a basis of the rows
zi ∈ R r (i = 1, . . .T, m) of Z , we find that um×r
n
i, j ∈ R exists and is unique such that
z i = j=1 u i, j d j v j . Thus, Ur = (u i, j ) ∈ R such that Z = Ur Dr VrT is unique.
Moreover, (6.3) means that UrT Ur = Ir . In fact, let Vn−r be the submatrix of V
that excludes Vr . From VrT Vr = Ir and VrT Vn−r = O ∈ Rr ×(n−r ) , we have

Z T Z = V D 2 V T ⇐⇒ Vr Dr UrT Ur Dr VrT = V D 2 V T
⇐⇒ V T Vr Dr UrT Ur Dr VrT V = D 2
   2 
Dr   Dr O
⇐⇒ UrT Ur Dr O =
O O O
⇐⇒ UrT Ur = Ir .

Thus, the decomposition is unique


 up to the choice of the basis of the eigenspace
V (we often write this as Z = ri=1 di u i viT ).
If Z ∈ Rm×n (m < n), after obtaining Ū ∈ Rn×m , D ∈ Rm×m , and V̄ ∈ Rm×m
such that Z T = Ū D V̄ T , we rewrite it as Z = (Ū D V̄ T )T = V̄ DŪ T . If we write Z
as Z = U DV T , then U = V̄ ∈ Rm×m , D ∈ Rm×m , V = Ū ∈ Rn×m such that V is
not square.
⎡ ⎤
0 −2
Example 53 For Z = ⎣ 5 −4 ⎦, we execute the singular decompositions of Z
−1 1
and Z T using the following code. If we transpose Z , then U and V are switched.

1 Z = matrix(c(0, 5, -1, -2, -4, 1), nrow = 3); Z

1 [,1] [,2]
2 [1,] 0 -2
3 [2,] 5 -4
4 [3,] -1 1

1 svd(Z)
182 6 Matrix Decomposition

1 $d
2 [1] 6.681937 1.533530
3 $u
4 [,1] [,2]
5 [1,] -0.1987442 0.97518046
6 [2,] -0.9570074 -0.21457290
7 [3,] 0.2112759 -0.05460347
8 $v
9 [,1] [,2]
10 [1,] -0.7477342 -0.6639982
11 [2,] 0.6639982 -0.7477342

1 svd(t(Z))

1 $d
2 [1] 6.681937 1.533530
3 $u
4 [,1] [,2]
5 [1,] -0.7477342 0.6639982
6 [2,] 0.6639982 0.7477342
7 $v
8 [,1] [,2]
9 [1,] -0.1987442 -0.97518046
10 [2,] -0.9570074 0.21457290
11 [3,] 0.2112759 0.05460347

Example 54 If Z ∈ Rn×n is symmetric, the eigenvalues of Z T Z = Z 2 are the


squares of the eigenvalues of Z . In fact, from
√ √
det(Z 2 − t I ) = det(Z − t I ) det(Z + tI) ,
√ √ √
the solutions λ1 , . . . , λn for ± t of det(Z − t I ) = 0, det(Z + t I ) = 0 are the
eigenvalues of Z , and the solution for t ≥ 0 of det(Z 2 − t I ) = 0 is their square. For
t < 0, Z 2 − t I is positive definite, and det(Z 2 − t I ) > 0.
On the other hand, if we singular-decompose Z , then D is a diagonal matrix with
the elements |λ1 |, . . . , |λn | (the absolute values of the eigenvalues of Z ). Thus, if we
compare the eigenvalue decomposition Z = W DW T and singular-value decompo-
sition Z = U DV T , then we have

λi ≥ 0 ⇐⇒ u i = vi = wi
λi < 0 ⇐⇒ u i = wi , vi = −wi or u i = −wi , vi = wi ,

where wi ∈ Rn is the eigenvector associated with u i , vi . If Z is nonnegative defi-


nite, these decompositions coincide. In the following,
 we
 execute singular-value and
0 5
eigenvalue decomposition of the matrix Z = .
5 −1
6.1 Singular Decomposition 183

1 Z = matrix(c(0, 5, 5, -1), nrow = 2); Z

1 [,1] [,2]
2 [1,] 0 5
3 [2,] 5 -1

1 svd(Z)

1 $d
2 [1] 5.524938 4.524938
3 $u
4 [,1] [,2]
5 [1,] 0.6710053 -0.7414525
6 [2,] -0.7414525 -0.6710053
7 $v
8 [,1] [,2]
9 [1,] -0.6710053 -0.7414525
10 [2,] 0.7414525 -0.6710053

1 eigen(Z)

1 $values
2 [1] 4.524938 -5.524938
3 $vectors
4 [,1] [,2]
5 [1,] -0.7414525 -0.6710053
6 [2,] -0.6710053 0.7414525

6.2 Eckart-Young’s Theorem

We herein state Eckart-Young’s theorem, which plays a crucial role in approximating


a low-rank alternative matrix.
Proposition 18 (Eckart-Young’s Theorem) Let m ≤ n. We are given observation
Z = U DV T ∈ Rm×n , where U ∈ Rm×m , V ∈ Rm×n , and the rank of Z is m. The
M ∈ Rm×n with the rank of at most r that minimizes Z − M F is given by U Dr V T ,
where Dr ∈ Rm×m is obtained by making the elements dr +1 , · · · , dn in D ∈ Rm×m
zero.
For the proof, see the Appendix.
We can construct a function svd.r that approximates a matrix z by rank r as
follows. It diminishes only the diagonal elements of D, except the first d elements.
184 6 Matrix Decomposition

Fig. 6.1 We compared the


matrix obtained after

25000
Squared Frobenius norm
executing svd.r(z,r)
and the original z. The
horizontal and vertical axes
show r and the difference in

15000
the squared Frobenius norms

0 5000
0 50 100 150
Rank

1 svd.r = function(z, r) {
2 n = min(nrow(z), ncol(z))
3 ss = svd(z)
4 tt = ss$u %*% diag(c(ss$d[1:r], rep(0, n - r))) %*% t(ss$v)
5 return(tt)
6 }

Example 55 Executing the function svd.r, we examine the behavior. We show


the output in Fig. 6.1.
1 m = 200; n = 150; z = matrix(rnorm(m * n), nrow = m)
2 F.norm = NULL; for (r in 1:n) {m = svd.r(z, r); F.norm = c(F.norm, norm
(z - m, "F") ^ 2)}
3 plot(1:n, F.norm, type = "l", xlab = "Rank", ylab = "Squared Frobenius
Norm")

Example 56 The following code is used to obtain another image file with a lower
rank from the original lion.jpg. In general, if an image takes the J P G form, then
each pixel takes 256 grayscale values. Moreover, if the image is in color, we have the
information for each of the blue, red, and green pixels. For example, if the display
is 480 × 360 pixels in size, then it has 480 × 360 × 3 pixels in total. The following
codes are executed for ranks r = 2, 5, 10, 20, 50, 100 (Fig. 6.2). Before executing the
procedure, we need to generate a folder /compressed under the current folder
beforehand.
6.2 Eckart-Young’s Theorem 185

rank 2 rank 10

rank 10 rank 20

rank 50 rank 100

Fig. 6.2 We approximate the red, green, and blue matrix expression of the image file lion.jpg by
several ranks r . We observe that the lower the rank is, the less clear the approximated image

1 library(jpeg)
2 image = readJPEG(’lion.jpg’)
3 rank.seq = c(2, 5, 10, 20, 50, 100)
4 mat = array(0, dim = c(nrow(image), ncol(image), 3))
5 for (j in rank.seq) {
6 for (i in 1:3) mat[, , i] = svd.r(image[, , i], j)
7 writeJPEG(mat, paste("compressed/lion_compressed", "_mat_rank_", j, "
.jpg", sep = ""))
8 }

For the next section, we assume the following situation. Suppose that the loca-
tions indexed by  ⊆ {1, . . . , m} × {1, . . . , n} in the matrix Z ∈ Rm×n are observed.
Given z i, j ((i, j) ∈ ), we wish to contain the matrix M ∈ Rm×n with a rank of at
most r that minimizes the Frobenius norm of M − Z , i.e., the M that minimizes
186 6 Matrix Decomposition

Z − M2F + λr (6.4)

for some λ > 0.


If we fix the rank r (which is equivalent to fixing some λ), the problem is to find
the matrix M with rank r that minimizes

(z i, j − m i, j )2
(i, j)∈

To this end, we define the symbol [, ·, ·]: [, A, B] whose (i, j)-element is

ai, j , (i, j) ∈ 
bi, j , (i, j) ∈
/

for matrices A = (ai, j ), B = (bi, j ) ∈ Rm×n .


We write the matrix obtained by approximating a matrix A by rank r via singular
decomposition as svd(A, r ). At first, we randomly give the matrix and repeat the
updates
M ← svd([, Z , M], r ) . (6.5)

For example, we construct the following procedure (this process stops at a local
optimum and cannot converge to the globally optimum M that minimizes (6.4)).
The function mask takes one and zero for the observed and unobserved positions,
respectively; thus, 1 - mask takes the opposite values.
1 mat.r = function(z, mask, r) {
2 z = as.matrix(z)
3 min = Inf
4 m = nrow(z); n = ncol(z)
5 for (j in 1:5) {
6 guess = matrix(rnorm(m * n), nrow = m)
7 for (i in 1:10) guess = svd.r(mask * z + (1 - mask) * guess, r)
8 value = norm(mask * (z - guess), "F")
9 if (value < min) {min.mat = guess; min = value}
10 }
11 return(min.mat)
12 }

Example 57 To obtain the M that minimizes (6.4), we repeat the updates (6.5). In
the inner loop, the rank is approximated to r , and the unobserved value is predicted
ten times. In the outer loop, to prevent falling into a local solution, the initial value
is changed and executed five times to find the solution that minimizes the Frobenius
norm of the difference.
6.2 Eckart-Young’s Theorem 187

1 library(jpeg)
2 image = readJPEG(’lion.jpg’)
3 m = nrow(image); n = ncol(image)
4 mask = matrix(rbinom(m * n, 1, 0.5), nrow = m)
5 rank.seq = c(2, 5, 10, 20, 50, 100)
6 mat = array(0, dim = c(nrow(image), ncol(image), 3))
7 for (j in rank.seq) {
8 for (i in 1:3) mat[, , i] = mat.r(image[, , i], mask, j)
9 writeJPEG(mat, paste("compressed/lion_compressed", "_mat_rank_", j, "
.jpg", sep = ""))
10 }

The main reason for the hardness of optimization as in (6.4) is that the arithmetic
of computing the rank is not convex. We show that the proposition
For arbitrary A, B ∈ Rm×n and 0 < α < 1,

rank(α A + (1 − α)B) ≤ α · rank A + (1 − α) rank B (6.6)

is false, which means that the optimization of (6.4) is not convex.

Example 58 The following is a counterexample of (6.6). The ranks in (6.6) are two
and one on the left and right sides if
   
1 0 0 0
α = 0.5, A= , and B = .
0 0 0 1

6.3 Norm

We consider the function  ·  : Rn → R to be a norm if

αa = |α| a


a = 0 ⇐⇒ a = 0
a + b ≤ a + b

for α ∈ R, a, b ∈ Rn . The L1 and L2 norms that we have considered in this book


satisfy them. We note that the norm is a convex function. In fact, from the first and
last conditions, for 0 ≤ α ≤ 1, we have

αa + (1 − α)b ≤ αa + (1 − α)b = αa + (1 − α)b .

Then, b∗ := supa≤1 a, b satisfies the above three conditions as well. In fact, if
an element in b is nonzero, then we have a, b > 0 for any a that shares its sign with
that of b for each nonzero element of b. Thus, b∗ = 0 ⇐⇒ b = 0. Moreover, we
have
188 6 Matrix Decomposition

sup a, b + c ≤ sup a, b + sup a, c .


a≤1 a≤1 a≤1

We regard such a norm  · ∗ as the dual norm of  · .


Additionally, for matrices, we can similarly define the norm and its dual form. In
the following, by di (M), we denote the i-th singular value of a matrix M ∈ Rm×n .
Let m ≥ n. We consider
M∗ := sup M x2
x2 ≤1

to be the spectral norm of a matrix M, which is also a norm:

sup (M1 + M2 )x2 ≤ sup M1 x2 + sup M2 x2


x2 ≤1 x2 ≤1 x2 ≤1

Proposition 19 The spectral norm M∗ of a matrix M coincides with the largest
singular value d1 (M) of M.

Proof We have

sup M x2 = sup U DV T x2 = sup DV T x2 = sup Dy2 = d1 (M) ,
x2 ≤1 x2 ≤1 x2 ≤1 y2 ≤1

where we used U DV T x2 = x T V DU T U DV T x = DV T x2 and y T y =


x T V V T x = x T x for y = V T x (y = [1, 0, . . . , 0]T satisfies the last equality). 

We define the dual of the norm  · ∗ for a matrix M ∈ Rm×n as

M∗ := sup M, Q ,
Q∗ ≤1

where the inner product M, Q is the trace of M T Q. We consider the dual M∗ of
the spectral norm of M to be the nuclear norm.
We state an inequality that plays a significant role in deriving the conclusion of
this section.

Proposition 20 For matrices Q, M ∈ Rm×n , if d1 (M), . . . , dn (M) are the singular


values of M, then we have


n
Q, M ≤ Q∗ di (M) ,
i=1

where the equality holds if and only if

d1 (Q) = · · · = dr (Q) ≥ dr +1 (Q) ≥ · · · ≥ dn (Q)


6.3 Norm 189

and either u i (Q) = u i (M), vi (Q) = vi (M) or u i (Q) = −u i (M), vi (Q) = −vi (M)
ni = 1, . . . , r , where we performed
for each n the singular decomposition of M, Q as
M = i=1 di (M)u i (M)vi (M), Q = i=1 di (Q)u i (Q)vi (Q).

For the proof, see the Appendix.

Proposition 21 The nuclear norm M∗ of matrix M coincides with the sum
 n
i=1 di (M) of the singular values of M.

Proof If the singular decomposition of M is M = U DV T , from (6.2), for Q =


U V T , we have

Q, M = U V T , U DV T  = trace(V U T U DV T ) = trace(V T V U T U D)
r
= trace D = di (M),
i=1

where di (M) is the i-th singular value of M. Since themaximum eigenvalue of Q


n
is one, it is sufficient to show that supd1 (Q)≤1 Q, M i=1 di (M). However, from
Proposition 20, we have


n 
n
sup Q, M ≤ sup di (M)d1 (Q) = di (M) ,
d1 (Q)≤1 d1 (Q)≤1 i=1 i=1

which completes the proof. 

In particular, the nuclear norm  · ∗ is convex. Note that Proposition 20 can be


written as M∗ Q∗ ≥ M, Q.
Next, we consider that the set of G ∈ Rm×n such that

B∗ ≥ A∗ + G, B − A (6.7)

for arbitrary B ∈ Rm×n is a subderivative of  · ∗ at A ∈ Rm×n .

n 22 Let r ≥ 1 be theT rand of A. The necessary and sufficient condition


Proposition
for G = i=1 di (G)u i (G)vi (G) to be the subderivative of the nuclear norm  · ∗
at matrix A ∈ Rm×n is as follows:

d1 (G) = · · · = dr (G) = 1
1 ≥ dr +1 (G) ≥ · · · ≥ dn (G) ≥ 0
either u i (G) = u i (A), vi (G) = vi (A) or
u i (G) = −u i (A), vi (G) = −vi (A) for each i = 1, . . . , r.

Square We show the necessity first. Since (6.7) needs to hold even when B is the
zero matrix, we have
190 6 Matrix Decomposition

G, A ≥ A∗ .

Combining this with Propositions 20 and 21, we have

A∗ G∗ ≥ A∗ , (6.8)

which means that G∗ ≥ 1. On the other hand, if we substitute B = A + u ∗ v∗T into
(6.7), where u ∗ , v∗ are the largest-singular-value left and right vectors of G, then we
have
A + u ∗ v∗T ∗ ≥ A∗ + G, u ∗ v∗T  .

Because  · ∗ is convex (triangle inequality), we have

A∗ + u ∗ v∗T ∗ ≥ A∗ + G∗ ,

which means that G∗ ≤ 1. From the equality of (6.8) and Propositions 20 and
d1 (G) = G∗ , the proof of necessity is completed.
We show the sufficiency next. If we substitute M = B and Q = G into Proposi-
tion 20, when G∗ = 1, we have

B∗ ≥ G, B . (6.9)

In addition, under the derived necessary condition, we have


⎛ ⎞

r 
n 
r
G, A = trace ⎝ vi (A)di u i (A)T u j (G)v j (G)T ⎠ = di = A∗ .
i=1 j=1 j=1
(6.10)
Hence, (6.9), (6.10) means (6.7), which completes the proof. 

6.4 Sparse Estimation for Low-Rank Estimations


n
In the previous sections, we showed that the nuclear norm M∗ = i=1 di (M) of
M is convex and derived its subderivative. In the following, we obtain the matrix
M ∈ Rm×n that minimizes
1 
L := (z i, j − m i, j )2 + λM∗ (6.11)
2 (i, j)∈

for Z ∈ Rm×n ,  ⊆ {(i, j) | i = 1, . . . , m, j = 1, . . . , n}, and λ > 0 [19].


6.4 Sparse Estimation for Low-Rank Estimations 191

n
From Proposition 22, the subderivative G of  · ∗ at M := i=1 di u i viT is


r 
n
G= u i viT + d̃i ũ i ṽiT
i=1 i=r +1

when the rank of M is r , where ũ r +1 , . . . , ũ n and ṽr +1 , . . . , ṽn are the singular vectors
associated with the singular values 1 ≥ d̃r +1 , · · · , d̃n of G.
Thus, if we differentiate both sides of (6.11) by the matrix M = (m i, j ), we have

[, M − Z , 0] + λG ,

∂L
where the values differentiated by m i, j are .
∂m i, j
Next, we show that when the rank of M is r <n, m, M∗ cannot be differentiated
by M without subderivative. If we write M = ri=1 u i di viT , for  > 0, we have
r r
M + u j v Tj ∗ − M∗ i=1 di + u j v Tj ∗ − i=1 di
= =1
 
r r
M − u j v Tj ∗ − M∗ i=1 di − u j v Tj ∗ − i=1 di
= =1
− −

for j = 1, . . . , r . On the other hand, because d j = 0 for j = r + 1, . . . , n, if we


subtract u j v Tj from M, then because the singular values cannot be negative, one of
the singular vectors is multiplied by −1, and we have d j = . If we add u j v Tj to
M, we have d j = .
Suppose that we have observations Z ∈ Rm×n only for the positions
 ⊆ {1, . . . , m} × {1, . . . , n} and that we take the subderivative of (6.11) w.r.t.
M = (m i, j ) to obtain the following equation:
 

r 
n
[, M − Z , 0] + λ u i viT + dk ũ k ṽkT =0. (6.12)
i=1 k=r +1

Mazumder et al. (2010) [19] showed that when starting with an arbitrary initial
value M0 and updating Mt via

Mt+1 ← Sλ ([, Z , Mt ]) (6.13)

until convergence, both sides of M in (6.13) coincide. This matrix M satisfies (6.12),
where Sλ (W ) denotes the matrix whose singular values di , i = 1, . . . , n, are replaced
with di − λ if di ≥ λ and zeros otherwise. Hence, singular-decompose [, Z , M]
and subtract min{λ, dk } from the singular values dk so that the obtained singular
values are nonnegative.
192 6 Matrix Decomposition

Note that (6.12) can be obtained if we set Mt+1 = Mt = M in the update (6.13).
For example, we can construct the function soft.svd that returns Z  = U D  V T
given Z = U DV T , where D and D  are diagonal matrices with the diagonal elements
dk and dk := max{dk − λ, 0}, k = 1, . . . .n.
1 soft.svd = function(lambda, z) {
2 n = ncol(z); ss = svd(z); dd = pmax(ss$d - lambda, 0)
3 return(ss$u %*% diag(dd) %*% t(ss$v))
4 }

Then, the update (6.13) can be constructed as follows.


1 mat.lasso = function(lambda, z, mask) {
2 z = as.matrix(z); m = nrow(z); n = ncol(z)
3 guess = matrix(rnorm(m * n), nrow = m)
4 for (i in 1:20) guess = soft.svd(lambda, mask * z + (1 - mask) *
guess)
5 return(guess)
6 }

Example 59 For the same image as in Example 55, we construct a procedure such
that we recover the whole image when we mask some part. More precisely, we apply
the functions soft.svd and mat.lasso for p = 0.5, 0.25 and λ = 0.5, 0.1 to
recover the image (Fig. 6.3), where the larger the value of λ is, the lower the rank.

p = 0.25, λ = 0.5 p = 0.50, λ = 0.5

p = 0.75, λ = 0.1 p = 0.50, λ = 0.1

Fig. 6.3 We obtain the image that minimizes (6.11) from an image with the locations  of lion.jpg
unmasked, where p = 0.25, 0.5, 0.75 are the ratios of the (unmasked) observations, λ is the param-
eter in (6.11), and the larger the value of λ is, the lower the rank
6.4 Sparse Estimation for Low-Rank Estimations 193

1 library(jpeg)
2 image = readJPEG(’lion.jpg’)
3 m = nrow(image[, , 1]); n = ncol(image[, , 1])
4 p = 0.5
5 lambda = 0.5
6 mat = array(0, dim = c(m, n, 3))
7 mask = matrix(rbinom(m * n, 1, p), ncol = n)
8 for (i in 1:3) mat[, , i] = mat.lasso(lambda, image[, , i], mask)
9 writeJPEG(mat, paste("compressed/lion_compressed", "_mat_soft.jpg", sep
= ""))

Appendix: Proof of Propositions

Proposition 18 (Eckart-Young’s Theorem) Let m ≤ n. We are given the observation


Z = U DV T ∈ Rm×n , where U ∈ Rm×m , V ∈ Rm×n , and the rank of Z is m. The
M ∈ Rm×n with a rank of at most r that minimizes Z − M F is given by U Dr V T ,
where Dr ∈ Rm×m is obtained by making the elements dr +1 , · · · , dn in D ∈ Rm×m
zero.

Proof In the following, we assume that the singular decomposition of Z ∈ Rm×n


(m ≤ n) can be written as Z = U DV T , where (U ∈ Rm×m , D ∈ Rm×m , and V ∈
Rn×m ).
If the rank of M ∈ Rm×n is r , there exist Q ∈ Rm×r and A ∈ Rr ×n such that
M = Q A and Q T Q = I . In fact, since the rank is r , we can take as the basis each
column Q i(i = 1, . . . , r ) of Q, and the j-th column ( j = 1, . . . , n) of M can be
written as ri=1 Q i ai, j .
Then, the optimum A = (ai, j ) can be written as Q T Z using Q. In fact, if we
differentiate
1  
m n r
1
Z − Q A2F = (z i j − qi,k ak, j )2
2 2 i=1 j=1 k=1


m 
r
by the element a p,q of A, then it becomes − (z i,q − qi,k ak,q )qi, p , which is the
i=1 k=1
( p, q)-th element of −Q T (Z − Q A) = −Q T Z + A.
Moreover, for  = Z Z T and M := Q Q T Z , we have

Z − M2F = Z T (I − Q Q T )2F = trace  − trace(Q T  Q) ,


194 6 Matrix Decomposition

which means that the problem reduces to finding the value of Q that maximizes
trace(Q T  Q). In fact, we derive from (6.1)

Z T (I − Q Q T )2F = trace((I − Q Q T )(I − Q Q T )) ,

and the right-hand side can be transformed into

trace((I − Q Q T )2 ) = trace((I − Q Q T )) = trace  − trace(Q T  Q) ,

where (6.2) has been used for the first and last variables. Thus, the problem further
reduces to maximizing trace(Q T  Q). Then, using U T U = UU T = I , Q T Q = I
and R := U T Q ∈ Rm×r , from

Q T  Q = Q T Z Z T Q = Q T U D2U T Q = R T D2 R ,

the problem reduces to maximizing trace(R T D 2 R) under R T R = Q T UU T Q =


QT Q = I .
Furthermore, if we let H := R R T ∈ Rm×m and write the (i, j)-th element as h i, j ,
then, as we will see, h 1,1 = · · · = h r,r = 1 and h r +1,r +1 = · · · = h m,m = 0 is the
optimum solution.
If we apply (6.2), then we have

trace H = trace(R R T ) = trace(R T R) = r .


 
From h i, j = rk=1 ri,k r j,k , we have h i,i = rk=1 ri,k2
≥ 0. Moreover, if we compare
it with the S ∈ R m×m
S S = SS = Im , which is obtained by adding the m − r
T T

columns to R ∈ Rm×r , then h i,i is the squared sum over the columns of R along
the
m i-th row and does not exceed 1. Because the (i, j)-th element of R T D 2 R is
2
k=1 r k,i r k, j dk , from (6.2), the trace is


m
trace(R T D 2 R) = trace(R R T D 2 ) = h i,i di2
i=1

m m r
From R T R = Ir , we have i=1 h i,i = i=1 j=1 ri, j = r . Thus, h 1,1 = · · · =
2

h r,r = 1 and h r +1,r +1 = · · · = h m,m = 0 is the solution.



Then, for i = r + 1, . . . , m, h i,i = 0 means rj=1 ri,2 j = 0, i.e., ri, j = 0 ( j =
 
Rr
1, . . . , r ). Therefore, this R can be written as R = ∈ Rm×r using an arbitrary
O
orthogonal matrix Rr ∈ Rr ×r , where O ∈ Rr ×(m−r ) is the zero matrix.
Note that the matrix Ur obtained by replacing U with zeros except the first r
columns is the Q that corresponds to the solution. In fact, for U = [Ur Um−r ],
Appendix: Proof of Propositions 195
   
Rr UrT
Ur ∈ Rm×r , Um−r ∈ Rm×(m−r ) , and Q ∈ Rm×r , we have = T Q and
O Um−r
 
  Rr
Q = Ur Um−r = Ur Rr .
O
Finally, we have

M = Q Q T Z = Ur UrT Z = Ur UrT [Ur Um−r ]DV T


 
  Dr O
= Ur [I O]DV T = Ur Um−r V T = U Dr V T ,
O O

which completes the proof. 

Proposition 20 For the matrices Q, M ∈ Rm×n , if d1 (M), . . . , dn (M) are the sin-
gular values of M, then we have


n

Q, M ≤ Q di (M) ,
i=1

where the equality holds if and only if

d1 (Q) = · · · = dr (Q) ≥ dr +1 (Q) = · · · = dn (Q) = 0

and either u i (Q) = u i (M), vi (Q) = vi (M) or u i (Q) = −u i (M), vi (Q) = −vi (M)
n each i = 1, . . . , r , where 
for we singular-decomposed M, Q as M =
n
i=1 di (M)u i (M)vi (M) and Q = i=1 di (Q)u i (Q)vi (Q).

Proof Using (6.2), we have

Q, M = trace(Q T U DV T ) = trace(V T Q T U D) = U T QV, D


 n n
= di (M) · (U QV )i,i =
T
di (M)u i (M)T Qvi (M)
i=1 i=1

n
= di (M)u i (M)T u i (Q)di (Q)viT (Q)vi (M)
i=1

≤ di (M)d1 (Q) = M∗ Q∗
i

with the equality if and only if for i such that di = 0,

u i (M)T u i (Q)di (Q)viT (Q)vi (M) = d1 (Q) ,

which is di (Q) = d1 (q) and

u i (M)T u i (Q) · vi (Q)T vi (M) = 1 (6.14)


196 6 Matrix Decomposition

for i = 1, . . . , r , because the lengths of u i (M), u i (Q), vi (M), vi (Q) are ones. Note
that (6.14) means either u i (M) = u i (Q), vi (Q) = vi (M) or u i (M) = −u i (Q),
vi (Q) = −vi (M), which completes the proof. 

Exercises 76–87

76. For the singular decomposition, answer the following inquiries.


(a) For the symmetric matrices, what relations exist between singular and eigen-
value decompositions? How about the nonnegative definite matrices?
(b) Suppose that m ≤ n. We define the singular decomposition of Z ∈ Rm×n
as the transpose of the singular decomposition Z T = Ū D V̄ T . What are the
sizes of U, D, V in Z = U DV T ?
77. Show the following:
(a) Z 2F = trace(Z T Z ), where Z  F is the square root of the squared sum of
the mn elements in Z ∈ Rm×n (Z Frobenius norm).
Hint For Z = (z i, j ), the i-th diagonal element of Z T Z is nj=1 z i,2 j .
(b) Let l, m, n ≥ 1. For matrices A ∈ Rl×m , B ∈ Rm×n , C ∈ Rn×l , we have
trace(ABC) = trace(BC A).
78. Raise a counterexample against the statement “The matrix rank is a convex
function”:
For arbitrary A, B ∈ Rm×n and 0 < α < 1, we have

rank(α A + (1 − α)B) ≤ α · rank A + (1 − α) rank B .

Hint Find A, B such that m = n = 2 rank A = rank B = 1.


79. When the singular decomposition of Z ∈ Rm×n can be written as Z = U DV T ,1
we show that given Z = U DV T , the matrix M that minimizes Z − M F and
has a rank of at most r is U Dr V T (Eckart-Young’s theorem), where Dr is the
diagonal matrix D such that dr +1 = · · · = dn = 0. We assume that the rank of
Z is m.
(a) If the rank of M is r , using a matrix Q ∈ Rm×r (Q T Q = Ir , m ≥ r ), we
can write it as M = Q A. In fact, if the rank is r , we can take as basis the
columns Q j ( j = 1, . . . , r ) of Q, which means that the j-th column of M
can be written as ri=1 Q i ai, j . Show that the optimum A = (ai, j ) can be
written as Q T Z using the matrix Q.
1 1  
Hint If we differentiate Z − Q A2F = (z i j − qik ak j )2 by
2 2 i j k

1If m ≥ n, then U ∈ Rm×n , D ∈ Rn×n , and V ∈ Rn×n , and if m ≤ n, then U ∈ Rm×m , D ∈ Rm×m ,
and V ∈ Rn×m .
Exercises 76–87 197


m 
r
the element a p,q of A, we obtain − (z i,q − qi,k ak,q )qi, p , which is the
i=1 k=1
( p, q)-th element of −Q T (Z − Q A).
(b) Let  := Z Z T . Show that

Z − M2F = Z T (I − Q Q T )2F = trace  − trace(Q T  Q) ,

which means that minimizing Z − M2F reduces to finding the Q that max-
imizes trace(Q T  Q).
Hint Z T (I − Q Q T )2F = trace((I − Q Q T )(I − Q Q T )) can be
derived from Exercise 77 (a) and becomes trace((I − Q Q T )2 ) from
Exercise 77 (b).
(c) Show that (b) further reduces to finding the orthogonal matrix R ∈ Rm×r
that maximizes trace(R T D 2 R).
Hint Show that Q T  Q = Q T U D 2 U T Q and

Qorthogonal ⇐⇒ R := U T Qorthogonal.

(d) Let H = R R T and h i,i (i = 1, . . . , m) be the diagonal elements. By proving


the following statements, show that h 1,1 = · · · = h r,r = 1, h r +1,r +1 = · · · =
h m,m = 0 is the optimum solution.
m
i. i=1 h i,i = r
ii. trace H = r
iii. 0 ≤ h i,i ≤ 1 m
iv. trace(R T D 2 R) = i=1 h i,i di2
Hint Because R becomes a square orthogonal matrix by adding
columns, for the square matrix, the squared row and column sums are
both one.  
Rr
(e) Show that the solution of (d) can be written as R = ∈ Rm×r using
O
an arbitrary orthogonal matrix Rr ∈ Rr ×r , where O ∈ Rr ×(m−r ) is the zero
matrix.
(f) Let U+ be the matrix U with zeros in all columns except the first r columns.
Show that U+ is the Q that corresponds to (e) and that M = Q Q T Z =
U+ U+T Z = U Dr V T .
Hint For U =[U+ U− ], U+ ∈ Rm×r , and U− ∈ Rm×(m−r ) , Q ∈ Rm×r , we
Rr U+T
have = Q, from which we derive Q = U+ Rr .
O U−T
80. Based on Exercise 79, we construct a function svd.r to approximate the matrix
z by the rank r . Fill in the blanks.
Hint Transpose ss$v before multiplication.
1 svd.r = function(z, r) {
2 n = min(nrow(z), ncol(z)); ss = svd(z)
198 6 Matrix Decomposition

3 return(ss$u %*% diag(c(ss$d[1:r], rep(0, n - r))) %*% ## Blank


(1) ##)
4 }
5 ## if r is assume to be at least two, the following is ok.
6 svd.r = function(z, r) {
7 ss = svd(z); return(ss$u[, 1:r] %*% diag(ss$d[1:r]) %*% ## Blank
(2) ##)
8 }

Moreover, execute the following to examine the behavior.


1 m = 100; n = 80; z = matrix(rnorm(m * n), nrow = m)
2 F.norm = NULL
3 for (r in 1:n) {m = svd.r(z, r); F.norm = c(F.norm, norm(z - m, "F"
))}
4 plot(1:n, F.norm, type = "l", xlab = "Rank", ylab = "Squared
Frobenius norm")

81. The following code is the process of obtaining another image file with the same
lower rank as that of lion.jpg. Generally, in the J P G format, each pixel has
a 256 grayscale of information. It has information about the three colors blue,
red, and green. For example, if we have 480 × 360 pixels vertically and hori-
zontally, each pixel of 480 × 360 × 3 will hold one of 256 values. In the code
below, the rank is r = 2, 5, 10, 20, 50, 100. This time, we approximate the lower
ranks for red, green, and blue (Fig. 6.2). It is necessary to create a folder called
/compressed under the current folder in advance.
1 library(jpeg)
2 image = readJPEG(’lion.jpg’)
3 rank.seq = c(2, 5, 10, 20, 50, 100)
4 mat = array(0, dim = c(nrow(image), ncol(image), 3))
5 for (j in rank.seq) {
6 for (i in 1:3) mat[, , i] = svd.r(image[, , i], j)
7 writeJPEG(mat,
8 paste("compressed/lion_compressed", "_svd_rank_", j, ".
jpg", sep = ""))
9 }

82. Suppose we observe only the position of  ⊆ {1, . . . , m} × {1, . . . , n} in the


matrix Z ∈ Rm×n . From z i, j ((i, j) ∈ ), we find the matrix M of rank r that
minimizes the Frobenius norm of the difference between the matrices M and
Z . In the following process, mask is a matrix with 1 and 0 at the observed and
unobserved positions, respectively (in 1 - mask, the values 1, 0 are reversed).
In the inner loop, the rank is approximated to r , and the unobserved value is
predicted ten times. In the outer loop, to prevent falling into a local solution, the
initial value is changed and executed five times to find the solution that minimizes
the Frobenius norm of the difference. Fill in the blanks, and execute the process.

1 mat.r = function(z, mask, r) {


2 z = as.matrix(z)
3 min = Inf
Exercises 76–87 199

4 m = nrow(z); n = ncol(z)
5 for (j in 1:5) {
6 guess = matrix(rnorm(m * n), nrow = m)
7 for (i in 1:10) guess = svd.r(mask * z + (1 - mask) * guess, r)
8 value = norm(mask * (z - guess), "F")
9 if (value < min) {min.mat = ## Blank (1) ##; min = ## Blank (2) ##}
10 }
11 return(min.mat)
12 }
13
14 library(jpeg)
15 image = readJPEG(’lion.jpg’)
16 m = nrow(image); n = ncol(image)
17 mask = matrix(rbinom(m * n, 1, 0.5), nrow = m)
18 rank.seq = c(2, 5, 10, 20, 50, 100)
19 mat = array(0, dim = c(nrow(image), ncol(image), 3))
20 for (j in rank.seq) {
21 for (i in 1:3) mat[, , i] = mat.r(image[, , i], mask, j)
22 writeJPEG(mat,
23 paste("compressed/lion_compressed", "_mat_rank_", j, ".jpg"
, sep = ""))
24 }

the following, when the singular values of Z are d1 , . . . , dn , we consider the


In
n
sum i=1 di (Z ) to be the nuclear norm and write it as Z ∗ . Moreover, we consider
the maximum max{d1 , . . . , dn } to be the spectral norm and write it as Z ∗ .
83. Prove the following on norms.
(a) The matrix norms are convex.
(b) The dual norms satisfy the definition of norms.
(c) The nuclear norm is the dual of the spectral norm.
84. For matrices A, B with the same size, show A∗ B∗ ≥ A, B, and derive the
condition under which the equality holds.
85. Derive the subderivative of the nuclear norm  · ∗ at matrix M.
86. For Z ∈ Rm×n ,  ⊆ {(i, j) | i = 1, . . . , m, j = 1, . . . , n}, λ > 0, we wish to
find the M ∈ Rm×n that minimizes
1 
L := (z i, j − m i, j )2 + λM∗ (cf. (6.11)) . (6.15)
2 (i, j)∈

Suppose that we have observations Z ∈ Rm×n only for the positions  ⊆


{1, . . . , m} × {1, . . . , n} and that we take the subderivative of (6.11) w.r.t.
M = (m i, j ) to obtain the following equation:
 

r 
n
[, M − Z , 0] + λ u i viT + dk ũ k ṽkT =0. (6.16)
i=1 k=r +1
200 6 Matrix Decomposition

Mazumder et al. (2010) [19] showed that starting with an arbitrary initial value
M0 and updating Mt via

Mt+1 ← Sλ ([, Z , Mt ]) (6.17)

until convergence, both sides of M in (6.17) coincide. Show that this matrix
M satisfies (6.16), where Sλ (W ) denotes the matrix whose singular values di ,
i = 1, . . . , n, are replaced with di − λ if di ≥ λ and zeros otherwise.
Hint Singular-decompose [, Z , M], and subtract min{λ, dk } from the singular
values dk such that the obtained singular values are nonnegative. If we set Mt+1 =
Mt = M, then we obtain (6.16).
87. We construct the function soft.svd that returns Z  = U D  V T given Z =
U DV T , where D and D  are diagonal matrices with the diagonal elements dk
and dk := max{dk − λ, 0}, k = 1, . . . .n. Fill in the blank.
1 soft.svd = function(lambda, z) {
2 n = ncol(z); ss = svd(z); dd = pmax(ss$d - lambda, 0)
3 return(## Blank ##)
4 }

For the program below, obtain three images, changing the values of p and λ.
1 mat.lasso = function(lambda, z, mask) {
2 z = as.matrix(z); m = nrow(z); n = ncol(z)
3 guess = matrix(rnorm(m * n), nrow = m)
4 for (i in 1:20) guess = soft.svd(lambda, mask * z + (1 - mask) *
guess)
5 return(guess)
6 }
7
8 library(jpeg)
9 image = readJPEG(’lion.jpg’)
10 m = nrow(image[, , 1]); n = ncol(image[, , 1])
11 p = 0.5
12 lambda = 0.5
13 mat = array(0, dim = c(m, n, 3))
14 mask = matrix(rbinom(m * n, 1, p), ncol = n)
15 for (i in 1:3) mat[, , i] = mat.lasso(lambda, image[, , i], mask)
16 writeJPEG(mat, paste("compressed/lion_compressed", "_mat_soft.jpg",
sep = ""))
Chapter 7
Multivariate Analysis

In this chapter, we consider sparse estimation for the problems of multivariate analy-
sis, such as principal component analysis and clustering. There are two equivalence
definitions for principal component analysis: finding orthogonal vectors that maxi-
mize the variance and finding a vector that minimizes the reconstruction error when
the dimension is reduced. This chapter first introduces the sparse estimation methods
for principal component analysis, of which SCoTLASS and SPCA are popular. In
each case, the purpose is to find a principal component vector with few nonzero
components. On the other hand, introducing sparse estimation is crucial for cluster-
ing to select variables. In particular, we consider the K-means and convex clustering
problems. The latter has the advantage of not falling into a locally optimal solution
because it becomes a problem of convex optimization.

7.1 Principal Component Analysis (1): SCoTLASS

In the following, we assume that the matrix X = (xi, j ) ∈ R N × p is centered: we


subtract
 Nthe arithmetic mean from each column, and the average in each column is
zero i=1 xi, j = 0 ( j = 1, . . . , p).
Let v1 be the v ∈ R p for v = 1 that maximizes

X v2 . (7.1)

Let v2 be the v ∈ R p for v = 1 that is orthogonal to v1 and maximizes (7.1),


and so on. In this way, finding the orthonormal system V = [v1 , . . . , v p ] is called
principal component analysis (PCA) . We consider the constraint where v1 , . . . , v p
are orthogonal later and first find the v that maximizes X v2 under v = 1. For
each j = 1, . . . , p, because this v j maximizes
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 201
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0_7
202 7 Multivariate Analysis

X v j 2 − μ(v j 2 − 1) ,

if we differentiate by v j , we have

XT Xvj − μjvj = 0

1
More precisely, using the sample covariance  := X T X based on X and λ j :=
N
μj
, we can express
N
v j = λ j v j . (7.2)

We note that each column V is a basis of the eigenspace . In case some λ1 ≥


· · · ≥ λ p values conflict, we choose the eigenvectors such that they are orthogonal.
Because  is nonnegative definite, the eigenvectors with different eigenvalues are
orthogonal. If all of the eigenvalues are different, i.e., λ1 > · · · > λ p , then v1 , . . . , v p
are orthogonal.
In reality, we use only the first m (1 ≤ m ≤ p) rather than all v1 , . . . , v p . If we
project each row of X to Vm := [v1 , . . . , vm ] ∈ R p×m , we obtain Z := X Vm ∈ R N ×m .
We project the p-dimensional information to the space of the m principle components
v1 , . . . , vm to approximate the p-dimensional space X by the m-dimensional Z . The
PCA is the linear map for this compression.
Although there are several approaches to sparse estimation for PCA, we first
consider restricting the number of zero elements in PCA.
When we find the v ∈ R p for v2 = 1 that maximizes X v2 , if we restrict the
number v0 of nonzero elements in v, then the formulation maximizing

v T X T X v − λv0

under v0 ≤ t, v2 = 1, for some integer t is not convex. On the other hand, when
we wish to add the constraint v1 ≤ t (t > 0), the formulation maximizing

v T X T X v − λv1 (7.3)

under v2 = 1 is not convex.


The formulation maximizing

u T X v − λv1 (7.4)

under u2 = v2 = 1 for u ∈ R N is proposed (SCoTLASS) [15] (simplified com-


ponent technique-Lasso). The optimum v obtained in (7.4) is optimized in (7.3) as
well. In fact, differentiating

μ T δ
L := −u T X v + λv1 + (u u − 1) + (v T v − 1) (7.5)
2 2
7.1 Principal Component Analysis (1): SCoTLASS 203

Xv
by u, from X v − μu = 0, u = 1, we have u = . If we substitute this into
X v2
(7.5), we obtain
δ
−X v2 + λv1 + (v T v − 1) .
2
Although (7.5) takes positive values when we differentiate it by each of u, v twice,
it is not convex w.r.t. (u, v).

Example 60 When N = p = 1, X > μδ, it is not convex. In fact, because L is a
bivariate function of u, v, we have
⎡ ⎤
∂2 L ∂2 L
⎢ ∂u∂v ⎥
∇2 L = ⎢
∂u 2 ⎥ = μ −X .
⎣ ∂2 L ∂2 L ⎦ −X δ
∂u∂v ∂v 2

If the determinant μδ − X 2 is negative, ∇ 2 L contains a negative eigenvalue.


Thus, if α → f (α, β) and β → f (α, β) are convex w.r.t. α ∈ Rm and β ∈ Rn ,
respectively, then we say that f : Rm × Rn → R is biconvex. In general, convexity
does not mean convexity.
The reason why we transform (7.4) is that we can efficiently obtain the solution
of (7.3): Choose an arbitrary v ∈ R p such that v2 = 1. Repeat to update u ∈ R N ,
v ∈ Rp:
Xv
1. u ←
X v2
S λ (X T u)
2. v ←
Sλ (X T u)2
until the u, v values converge, where Sλ (z) is the function Sλ (·) introduced in
Chap. 1 that applies each element of z = [z 1 , · · · , z p ] ∈ R p , in which
Sλ (X T u) Xv ∂L ∂L
v= and u = are obtained from = = 0.
Sλ (X u)2
T X v2 ∂u ∂v
In fact, if we take the subderivative of v1 , we have 1, −1, [−1, 1] for v j >
0, v j < 0, v j = 0, respectively, for j = 1, . . . , p, which means that


⎨ (the j − th column of X T u) − λ + δv j = 0, (the j − th column of X T u) > λ
(the j − th column of X T u) + λ + δv j = 0, (the j − th column of X T u) < −λ ,

(the j − th column of X T u) + λ[−1, 1] 0, −λ ≤ (the j − th column of X T u) ≤ λ

i.e., δv = −S λ (X T u).
In general, when  : R p × R p → R satisfies

f (β) ≤ (β, θ ), θ ∈ R p
(7.6)
f (β) = (β, β)
204 7 Multivariate Analysis

for f : R p → R, we say that  majorizes f at a point β ∈ R p .


When  majorizes f , if we arbitrarily choose β 0 ∈ R p and generate β 1 , β 2 , . . .
via the recurrence formula

β t+1 = arg minp (β, β t ) ,


β∈R

we have
f (β t ) = (β t , β t ) ≥ (β t+1 , β t ) ≥ f (β t+1 ) .

In SCoTLASS, for

f (v) := −X v2 + λv1


(X v)T (X v )
(v, v ) := − + λv1 ,
X v 2

 majorizes f at v. In fact, we check (7.6) via the Schwartz inequality.


After arbitrarily giving v 0 , if we generate v 0 , v 1 , . . . via

v t+1 := arg max (v, v t ) ,


v2 =1

Sλ (X T u)
then each value expresses at the time, and it is monotonically decreas-
Sλ (X T u)2
ing.
For example, we can construct the function SCoTLASS as follows.
1 soft.th = function(lambda, z) return(sign(z) * pmax(abs(z) - lambda, 0))
2 ## even if z is a vector, soft.th works
3 SCoTLASS = function(lambda, X) {
4 n = nrow(X); p = ncol(X); v = rnorm(p); v = v / norm(v, "2")
5 for (k in 1:200) {
6 u = X %*% v; u = u / norm(u, "2"); v = t(X) %*% u
7 v = soft.th(lambda, v); size = norm(v, "2")
8 if (size > 0) v = v / size else break
9 }
10 if (norm(v, "2") == 0) print("all the elements of v are zero"); return(
v)
11 }

Example 61 Due to nonconvexity, even when λ increases, X v2 may not be mono-
tonically decreasing even though the number of zeros in v decreases with λ. We show
the execution of the code below in Fig. 7.1.
7.1 Principal Component Analysis (1): SCoTLASS 205

50

16.0
# Nonzero Vectors
40

Variance Sum
15.5
30

15.0
1st 1st
2nd 2nd
20

3rd 3rd
4th 4th

14.5
5th 5th

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
λ λ

Fig. 7.1 Execution of Example 61. With λ, the number of nonzero vectors and the sum of the
variances decrease monotonically, If we change the initial value and execute it, these values will be
different each time. This tendency becomes more remarkable as the value of λ increases

1 ## Data Generation
2 n = 100; p = 50; X = matrix(rnorm(n * p), nrow = n); lambda.seq = 0:10
/ 10
3
4 m = 5; SS = array(dim = c(m, 11)); TT = array(dim = c(m, 11))
5 for (j in 1:m) {
6 S = NULL; T = NULL
7 for (lambda in lambda.seq) {
8 v = SCoTLASS(lambda, X); S = c(S, sum(sign(v ^ 2))); T = c(T, norm
(X %*% v, "2"))
9 }
10 SS[j, ] = S; TT[j, ] = T
11 }
12 ## Display
13 par(mfrow = c(1, 2))
14 SS.min = min(SS); SS.max = max(SS)
15 plot(lambda.seq, xlim = c(0, 1), ylim = c(SS.min, SS.max),
16 xlab = "lambda", ylab = "# of nonzero vectors")
17 for (j in 1:m) lines(lambda.seq, SS[j, ], col = j + 1)
18 legend("bottomleft", paste0(1:5, "-th"), lwd = 1, col = 2:(m + 1))
19 TT.min = min(TT); TT.max = max(TT)
20 plot(lambda.seq, xlim = c(0, 1), ylim = c(TT.min, TT.max), xlab = "
lambda", ylab = "Variance Sum")
21 for (j in 1:m) lines(lambda.seq, TT[j, ], col = j + 1)
22 legend("bottomleft", paste0(1:5, "-th"), lwd = 1, col = 2:(m + 1))
23 par(mfrow = c(1, 1))

To handle not only the first principal component but also multiple components,
the following formulation is made in SCoTLASS. We formulate maximizing

u kT X vk
206 7 Multivariate Analysis

under

vk 2 ≤ 1 , vk 1 ≤ c , u k 2 ≤ 1 , u kT u j = 0 , ( j = 1, . . . , k − 1)

for k = 1, . . . , m, given c > 0. In fact, if we include sparsity in the upstream eigen-


vector settings, there is no guarantee that it will be orthogonal to the eigenvectors
downstream. The u k when vk is fixed is given by

Pk−1 X vk
u k := ⊥
(7.7)
Pk−1 X vk 2


k−1
for Pk−1 := I − i=1 u i u iT . In fact, given u 1 , . . . , u k−1 , we have


k−1

u Tj Pk−1 X vk = u Tj (I − u i u iT )X vk = 0 ( j = 1, . . . , k − 1) .
i=1

The u k that maximizes L = u kT X vk − μ(u kT u k − 1) is (Eq. 7-6). Compared with the


⊥ ⊥
m = 1 case, from (7.7), the difference is only in multiplying Pk−1 , where Pk−1 is the
unit matrix for k = 1.

7.2 Principle Component Analysis (2): SPCA

If m = p, then V is nonsingular, which means that X can be recovered from Z =


X V via Z V −1 = Z V T = X . In general, however, because we have X Vm VmT = X ,
X Vm VmT and X do not coincide. Because each sample xi ∈ R p (the i-th row vector in
X ) is transformed into xi Vm , we obtain xi Vm VmT when we recover, and the difference
is xi (I − Vm VmT ) ∈ R p . We define the reconstruction error by


N 
N 
N
xi (I − Vm VmT )2 = xi (I − Vm VmT )2 xiT = xi (I − Vm VmT )xiT . (7.8)
i=1 i=1 i=1

Then, given X , (7.8) is minimized when


N
xi Vm VmT xiT = trace(X Vm VmT X T ) = trace(VmT X T X Vm )
i=1

m 
m
= v Tj X T X v j = X v j 2 (7.9)
j=1 j=1

is maximized. Thus, PCA can be regarded as finding the v1 , . . . , vm that minimize


the reconstruction error. In fact, maximizing (7.9) under v1 2 = · · · = vm 2 = 1
is maximizing
7.2 Principle Component Analysis (2): SPCA 207


m 
m
X v j 2 − λ j (v j 2 − 1) .
j=1 j=1

If we differentiate by v j , we obtain (7.2).


SPCA (sparse principal component analysis) [32] is another formulation of PCA
different from SCoTLASS introduced in the previous section. Note that the problem
of minimizing
 
1 
N
min xi − xi vu T 22 + λ1 v1 + λ2 v22 (7.10)
u,v∈R p , u2 =1 N i=1

is biconvex for u, v, where xi is the i-th row vector of X . In fact, if we consider the
constraint u2 = 1, we can formulate (7.10) as

1  2  1 
N N N
L := xi xiT − xi vu T xiT + xi vv T xiT + λ1 v1 + λ2 v22 + μ(u T u − 1) .
N N N
i=1 i=1 i=1

We observe that if we differentiate it by u j and vk twice, then the results are nonneg-
ative when excluding the term v1 .
When we optimize w.r.t. u when v is fixed, the solution is

XT z
u=
X T z2

with z = (z i ) (i = 1, . . . , N ), z 1 = x1 v, . . . , z N = x N v. In fact, if we differen-


tiate L by u j , the vector of the differentiated values for j = 1, . . . , p satisfies
2 
N
− xi vxiT + 2μu = 0.
N i=1
When we optimize w.r.t. v when u is fixed, an extended algorithm of the elastic
net can be applied. However, the problem is not convex (only biconvex), and we need
to execute it several times, changing the initial value to obtain a solution close to the
optimum.
If we take the subderivative of (7.10) w.r.t. vk with u fixed, then when the constraint
v2 = 1 is missing, we have

⎪ 1  1 
N k−1 N k−1



⎪ − u j xi,k (ri, j,k − u j xi,k vk ) + λ1 , ri, j xi,k u j < −λ1

⎪ N i=1 j=1 N i=1 j=1



⎨ 1  N 
1 
k−1 N k−1
− u j xi,k (ri, j,k − u j xi,k vk ) − λ1 , ri, j xi,k u j > λ1 ,

⎪ N i=1 j=1 N i=1 j=1



⎪ 1 
N k−1



⎪ ri, j xi,k u j < λ1
⎩ 0, N i=1 j=1
208 7 Multivariate Analysis

λ = 0.00001 λ = 0.001
-0.2 0.0 0.2 0.4 0.6 0.8

Each Element of v
Each element of v

0.5
0.0
-0.5
0 20 40 60 80 100 0 20 40 60 80 100
# Repetitions # Repetitions

Fig. 7.2 Execution of Example 62. We observed the changes of v for λ = 0.00001(Left) and
λ = 0.001(Right)


where ri, j,k = xi, j − u j h=k xi,h vh . Thus, vk is the normalized constraint such that
v2 = 1, i.e.,
⎛ ⎞
1 N 
k−1
Sλ ⎝ xi,k ri, j,k u j ⎠
N i=1 j=1
vk =  ⎛ ⎞2 (k = 1, . . . , p) .


1  
p N k−1

 Sλ ⎝ xi,h ri, j,h u j ⎠
h=1
N i=1 j=1

Example 62 We construct the SPCA process in the R language. We change the


initial value several times and execute with each. Due to nonconvexity, we observe
that when we increase the λ value, the set of nonzero variables does not monotonically
grow. By repeating the process, we observe how each element of the vector v changes
with the number of iterations (Fig. 7.2).
1 ## Data Generation
2 n = 100; p = 5; x = matrix(rnorm(n * p), ncol = p)
3 ## Computation of u,v
4 lambda = 0.001; m = 100
5 g = array(dim = c(m, p))
6 for (j in 1:p) x[, j] = x[, j] - mean(x[, j])
7 for (j in 1:p) x[, j] = x[, j] / sqrt(sum(x[, j] ^ 2))
8 r = rep(0, n)
9 v = rnorm(p)
10 for (h in 1:m) {
11 z = x %*% v
12 u = as.vector(t(x) %*% z)
13 if (sum(u ^ 2) > 0.00001) u = u / sqrt(sum(u ^ 2))
7.2 Principle Component Analysis (2): SPCA 209

14 for (k in 1:p) {
15 for (i in 1:n) r[i] = sum(u * x[i, ]) - sum(u ^ 2) * sum(x[i, -k]
* v[-k])
16 S = sum(x[, k] * r) / n
17 v[k] = soft.th(lambda, S)
18 }
19 if (sum(v ^ 2) > 0.00001) v = v / sqrt(sum(v ^ 2))
20 g[h, ] = v
21 }
22 ## Graph Display
23 g.max = max(g); g.min = min(g)
24 plot(1:m, ylim = c(g.min, g.max), type = "n",
25 xlab = "# Repetition", ylab = "Each element of v", main = "lambda
= 0.001")
26 for (j in 1:p) lines(1:m, g[, j], col = j + 1)

When we consider sparse PCA not just for the first element but also for up to the
m-th element, we minimize

1    
N m m m
L := xi − xi Vm UmT 22 + λ1 v j 1 + λ2 v j 2 + μ (u Tj u j − 1)
N i=1 j=1 j=1 j=1

Compared with SCoTLASS, no orthogonality condition u Tj u k = 0 ( j = 1, . . . , k −


1) is included. Moreover, the formulation considers sparsity in minimizing the recon-
struction error (7.8).
However, the formulations of SCoTLASS and SPCA are not convex but merely
biconvex. SCoTLASS claims that the objective function decreases with the number
of iterations, while SPCA claims the merit, and the elastic net algorithm is available
for either of the two update formulas.

7.3 K -Means Clustering

We consider the problem of finding disjoint subsets C1 , . . . , C K of {1, . . . , N } that


minimize
K 
xi − x̄k 22
k=1 i∈Ck

from the data x1 , . . . , x N ∈ R p and positive integer K to be K -means clustering,


where x̄k is the arithmetic mean ofthe data in Ck . Witten-Tibshirani (2010) formulated
p
maximizing the weighted sum j=1 w j a j of p elements with
210 7 Multivariate Analysis

1   1 
N N K
a j (C1 , . . . , C K ) := (xi, j − xi , j )2 − (xi, j − xi , j )2
N i=1 i =1 k=1
N k i∈Ck i ∈Ck
(7.11)
under the constraint w2 ≤ 1, w1 ≤ s, w ≥ 0 (all the elements are nonnegative)
for s > 0. The problem is to remove unnecessary variables by weighting for j =
1, . . . , p and to make the interpretation of the clustering easier. To solve the problem,
they proposed repeating the following updates [31].
p
1. fixing w1 , . . . , w p , find the C1 , . . . , C K that maximize j=1 w j a j (C1 , . . . , C K )
p
2. fixing C1 , . . . , C K , find the w1 , . . . , w p that maximize j=1 w j a j (C1 , . . . , C K )
The first term in (7.11) is constant, and the second term can be written as

1  
(xi, j − xi , j )2 = 2 (xi − x̄k )2
Nk i∈C i ∈C i∈C
k k k

1
; thus, to obtain the optimum C1 , . . . , C N when the weights w1 , . . . , w p are fixed,
we may execute the following procedure.
1 k.means = function(X, K, weights = w) {
2 n = nrow(X); p = ncol(X)
3 y = sample(1:K, n, replace = TRUE); center = array(dim = c(K, p))
4 for (h in 1:10) {
5 for (k in 1:K) {
6 if (sum(y[] == k) == 0) center[k, ] = Inf else
7 for (j in 1:p) center[k, j] = mean(X[y[] == k, j])
8 }
9 for (i in 1:n) {
10 S.min = Inf
11 for (k in 1:K) {
12 if (center[k, 1] == Inf) break
13 S = sum((X[i, ] - center[k, ]) ^ 2 * w)
14 if (S < S.min) {S.min = S; y[i] = k}
15 }
16 }
17 }
18 return(y)
19 }

Example 63 Generate data artificially (N = 1000, p = 2), and execute the function
k.means for the weights 1 : 1 and 1 : 100 (Fig. 7.3).
1 ## Data Generation
2 K = 10; p = 2; n = 1000; X = matrix(rnorm(p * n), nrow = n, ncol = p)
3 w = c(1, 1); y = k.means(X, K, w)
4 ## Display Output
5 plot(-3:3, -3:3, xlab = "x", ylab = "y", type = "n")
6 points(X[, 1], X[, 2], col = y + 1)

1 See Chap. 10 of Statistical Learning with Math and R, Springer (2020).


7.3 K -Means Clustering 211

(a) Weight 1 : 1 (b) Weight 1 : 100


3

3
2

2
1

1
0

0
y

y
-1

-1
-2

-2
-3

-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
x x

Fig. 7.3 We execute the function k.means for the weights 1 : 1 (left) and 1 : 100 (right). For case
on the right whose weight of the second element is large, due to the larger penalty, the generated
cluster is horizontally long

Comparing them, if we make the second element large, then we tend to obtain
horizontal clusters.

Given C1 , . . . , C K , to obtain w1 , . . . , w p that maximize (7.11) and a ∈ R p with


 p elements, it is sufficient to find the value of w that maximizes the inner
nonnegative
product h=1 wh ah under w2 = 1, w1 ≤ s, w ≥ 0.

Proposition 23 Let s > 0 and a ∈ R p with nonnegative elements. Then, the w with
w2 = 1, w1 ≤ s that maximizes w T a can be written as

Sλ (a)
w= (7.12)
Sλ (a)2

for some λ ≥ 0.

For the proof, see the Appendix.


More precisely, we can construct the following procedure [31]. In the function
w.a, we obtain such a λ via binary search. At the beginning, we set λ = maxi ai /2
and δ = λ/2. For each time, we divide δ by half, and if w1 > s, add we δ to λ to
make w smaller, while if w1 < s, we subtract δ from λ to make w larger.
1 sparse.k.means = function(X, K, s) {
2 p = ncol(X); w = rep(1, p)
3 for (h in 1:10) {
4 y = k.means(X, K, w)
5 a = comp.a(X, y)
6 w = w.a(a, s)
7 }
8 return(list(w = w, y = y))
9 }
212 7 Multivariate Analysis

10
11 w.a = function(a, s) {
12 w = rep(1, p)
13 a = a / sqrt(sum(a ^ 2))
14 if (sum(a) < s) return(a)
15 p = length(a)
16 lambda = max(a) / 2
17 delta = lambda / 2
18 for (h in 1:10) {
19 for (j in 1:p) w[j] = soft.th(lambda, a[j])
20 ww = sqrt(sum(w ^ 2))
21 if (ww == 0) w = 0 else w = w / ww
22 if (sum(w) > s) lambda = lambda + delta else lambda = lambda -
delta
23 delta = delta / 2
24 }
25 return(w)
26 }
27
28 comp.a = function(X, y) {
29 n = nrow(X); p = ncol(X); a = array(dim = p)
30 for (j in 1:p) {
31 a[j] = 0
32 for (i in 1:n) for (h in 1:n) a[j] = a[j] + (X[i, j] - X[h, j]) ^
2 / n
33 for (k in 1:K) {
34 S = 0
35 index = which(y == k)
36 if (length(index) == 0) break
37 for (i in index) for (h in index) S = S + (X[i, j] - X[h, j]) ^
2
38 a[j] = a[j] - S / length(index)
39 }
40 }
41 return(a)
42 }

Example 64 Generating data and executing the function sparse.k.means, we


observe what variables play crucial roles in clustering.
1 p = 10; n = 100; X = matrix(rnorm(p * n), nrow = n, ncol = p)
2 sparse.k.means(X, 5, 1.5)

1 $w
2 [1] 0.00659343 0.00000000 0.74827158 0.00000000
3 [5] 0.00000000 0.00000000 0.00000000 0.65736124
4 [9] 0.00000000 0.08900768
5 $y
6 [1] 2 2 3 2 2 5 2 5 2 3 1 3 2 4 3 2 1 5 5 4 1 1 2 2
7 [25] 2 3 3 4 2 4 2 2 3 1 1 4 2 3 2 5 2 2 1 1 4 1 4 2
8 [49] 3 1 5 4 5 3 5 4 4 1 3 3 2 3 3 3 3 5 2 1 3 4 3 3
9 [73] 5 4 4 3 5 2 2 5 3 4 3 3 4 1 2 1 3 3 5 5 1 2 4 4
10 [97] 5 4 2 2
7.3 K -Means Clustering 213

Because K -means clustering is not convex, we cannot hope to obtain the optimum
solution by repeating the functions k.means, a.comp, and w.a. However, we can
find the relevant variables for clustering.

7.4 Convex Clustering

Given the data x1 , . . . , x N ∈ R p , we find u 1 , . . . , u N ∈ R p that minimize

1 
N
xi − u i 2 + γ wi, j u i − u j 2
2 i=1 i< j

for γ > 0, where wi, j is the constant determined from xi , x j , such as exp(−xi −
x j 22 ). We refer to this clustering method as convex clustering. If u i = u j , we regard
the samples indexed by i, j as being in the same cluster [7].
To solve via the ADMM, for U ∈ R N × p , V ∈ R N ×N × p , and ∈ R N ×N × p , we
define the extended Lagrangian as

1  
L ν (U, V, ) := xi − u i 22 + γ wi, j vi, j  + λi, j , vi, j − u i + u j 
2
i∈V (i, j)∈E (i, j)∈E
ν 
+ vi, j − u i + u j 22 , (7.13)
2
(i, j)∈E

for u i ∈ R p (i ∈ V ), vi, j , λi, j ∈ R p ((i, j) ∈ E), where we write ( j, k) when the


vertices i, j are connected and j < k, V := {1, . . . , N } is the set of vertices, and E
is a subset of {(i, j) | i, j ∈ V, i < j} (w j,i = wi, j , v j,i = vi, j , λ j,i = λi, j ).

Proposition 24 Suppose that the edge set E is {(i, j) | i < j, i, j ∈ V }, i.e., all the
vertices are connected. If we fix V, and denote
 
yi := xi + (λi, j + νvi, j ) − (λ j,i + νv j,i ) , (7.14)
j:(i, j)∈E j:( j,i)∈E

then (7.13) is minimized when



yi + ν j∈V xj
ui = . (7.15)
1 + Nν

For the proof, see the Appendix.


When we fix U, , the minimization of (7.13) w.r.t. V is that of
 2
1 
 1 (λi, j + νvi, j ) − (u i − u j ) + γ wi, j vi, j 

2 ν  ν
2
214 7 Multivariate Analysis

w.r.t. vi, j ∈ R p . Therefore, if we write v ∈ R p that minimizes

1
σ (v) + u − v22
2
by proxσ  (u), (7.13) is minimized w.r.t. V when

1
vi, j = proxσ · (u i − u j − λi, j ), (7.16)
ν
γ wi, j
where σ := .
ν
Finally, the update of the Lagrange coefficients is as follows:

λi, j = λi, j + ν(vi, j − u i + u j ) (7.17)

Based on (7.15)–(7.17), we construct the ADMM procedure for the extended


Lagrangian (7.13), where we set the weights of the samples such that the distance is
more than dd zero.
1 ## Computing weights
2 ww = function(x, mu = 1, dd = 0) {
3 n = nrow(x)
4 w = array(dim = c(n, n))
5 for (i in 1:n) for (j in 1:n) w[i, j] = exp(-mu * sum((x[i, ] - x[j,
]) ^ 2))
6 if (dd > 0) for (i in 1:n) {
7 dis = NULL
8 for (j in 1:n) dis = c(dis, sqrt(sum((x[i, ] - x[j, ]) ^ 2)))
9 index = which(dis > dd)
10 w[i, index] = 0
11 }
12 return(w)
13 }
14 ## prox (group Lasso) for L2
15 prox = function(x, tau) {
16 if (sum(x ^ 2) == 0) return(x) else return(max(0, 1 - tau / sqrt(sum
(x ^ 2))) * x)
17 }
18 ## Update u
19 update.u = function(v, lambda) {
20 u = array(dim = c(n, d))
21 z = 0; for (i in 1:n) z = z + x[i, ]
22 y = x
23 for (i in 1:n) {
24 if (i < n) for (j in (i + 1):n) y[i, ] = y[i, ] + lambda[i, j, ] +
nu * v[i, j, ]
25 if (1 < i) for (j in 1:(i - 1)) y[i, ] = y[i, ] - lambda[j, i, ] -
nu * v[j, i, ]
26 u[i, ] = (y[i, ] + nu * z) / (n * nu + 1)
27 }
28 return(u)
29 }
7.4 Convex Clustering 215

30 ## Update v
31 update.v = function(u, lambda) {
32 v = array(dim = c(n, n, d))
33 for (i in 1:(n - 1)) for (j in (i + 1):n) {
34 v[i, j, ] = prox(u[i, ] - u[j, ] - lambda[i, j, ] / nu, gamma * w[
i, j] / nu)
35 }
36 return(v)
37 }
38 ## Update lambda
39 update.lambda = function(u, v, lambda) {
40 for (i in 1:(n - 1)) for (j in (i + 1):n) {
41 lambda[i, j, ] = lambda[i, j, ] + nu * (v[i, j, ] - u[i, ] + u[j,
])
42 }
43 return(lambda)
44 }
45 ## Repeats the updates of u,v,lambda for the max_iter times
46 convex.cluster = function() {
47 v = array(rnorm(n * n * d), dim = c(n, n, d))
48 lambda = array(rnorm(n * n * d), dim = c(n, n, d))
49 for (iter in 1:max_iter) {
50 u = update.u(v, lambda); v = update.v(u, lambda); lambda = update.
lambda(u, v, lambda)
51 }
52 return(list(u = u, v = v))
53 }

Example 65 Generating data, we execute and cluster. The output is shown in


Fig. 7.4. Fix ν = 1, and set the values of γ and tt dd to (1, 0.5), (10, 0.5). Com-
pared to K -means clustering, this method decides whether to join with a pair of

gamma = 1, dd=0.5 gamma = 10, dd=0.5


3

3
2

2
Second Element

Second Element
1
1
0

0
-1

-1
-2

-2
-3

-3

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
First Element First Element

Fig. 7.4 Execution of Example 65. We executed convex clustering with γ and dd being (1, 0.5)
and (10, 0.5) for the left and right cases, respectively. Compared with K -means clustering, convex
clustering often constructs clusters with distant samples (left)
216 7 Multivariate Analysis

vertices without realizing the minimum intracluster distribution and the maximum
intercluster distribution of the cluster samples.
1 ## Data Generation
2 n = 50; d = 2; x = matrix(rnorm(n * d), n, d)
3 ## Convex Clustering
4 w = ww(x, 1, dd = 0.5)
5 gamma=1 # gamma = 10
6 nu = 1; max_iter = 1000; v = convex.cluster()$v
7 ## Adjacency Matrix
8 a = array(0, dim = c(n, n))
9 for (i in 1:(n - 1)) for (j in (i + 1):n) {
10 if (sqrt(sum(v[i, j, ] ^ 2)) < 1 / 10 ^ 4) {a[i, j] = 1; a[j, i] =
1}
11 }
12 ## Display Figure
13 k = 0
14 y = rep(0, n)
15 for (i in 1:n) {
16 if (y[i] == 0) {
17 k = k + 1
18 y[i] = k
19 if (i < n) for (j in (i + 1):n) if (a[i, j] == 1) y[j] = k
20 }
21 }
22 plot(0, xlim = c(-3, 3), ylim = c(-3, 3), type = "n", main = "gamma =
10")
23 points(x[, 1], x[, 2], col = y + 1)

The clustering will not be uniquely determined even if the cluster size is given, but
the cluster size will be biased. Therefore, we use the parameter dd to limit the size of
a particular cluster. Therefore, this method guarantees the convexity and efficiency
of the calculation, but considering the tuning effort and clustering criteria, it is not
always the best method.
Additionally, as for the sparse K -means clustering we considered in Sect. 7.3,
for the convex clustering in this section, sparse convex clustering that penalizes
extra variables is proposed in (7.13). We consider the following method to be sparse
convex clustering (B. Wang et al., 2018) [30]. The purpose is to remove unnecessary
variables in the clustering, make its interpretation easier, and save the computation.
Let x ( j) , u ( j) ∈ R N ( j = 1, . . . , p) be the column vectors of X, U ∈ R N × p .
We rewrite
 p the first term of (7.13) using the column vectors and add the penalty
( j)
term j=1 r j u 2 for unnecessary variables. For U ∈ R N × p , V ∈ R N ×N × p ,
N ×N × p
∈R , we define the extended Lagrangian for the ADMM as

1  ( j)  
p p
L ν (U, V, ) := x − u ( j) 22 + γ1 wi,k vi,k  + γ2 r j u ( j) 2
2 j=1 (i,k)∈E j=1
 ν 
+ λi,k , vi,k − u i + u k  + vi,k − u i + u k 22 ,
(i,k)∈E
2 (i,k)∈E
(7.18)
7.4 Convex Clustering 217

where r j is the weight for the penalty and u 1 , . . . , u N and u (1) , . . . , u ( p) are the row
and column vectors of U ∈ R N × p , respectively.
The optimization w.r.t. U when V, are fixed is given as follows.

Proposition 25 Given V, , the u ( j) ( j = 1, . . . , p) that minimizes (7.18) is given


by the u ( j) that minimizes

1 −1 ( j)
G y − Gu ( j) 22 + γ2 r j u ( j) 2
2

where y (1) , . . . , y ( p) ∈ R N are the column vectors of y1 , . . . , y N ∈ R p in (7.14) and



√ 1 + Nν − 1
G := 1 + N ν IN − EN
N

with all the values of E N ∈ R N ×N being ones.

For the proof, see the Appendix.


Note that the inverse matrix of G is given by
 √ 
−1 1 1 + Nν − 1
G =√ IN + EN .
1 + Nν N

Proposition 25 suggests solving G, G −1 y ( j) , γ2 r j via group Lasso. The original paper


centers u ( j) in each cycle of the ADMM.
The optimization by V when fixing U, and the optimization by when fixing
U, V are the same as in the original convex clustering.
The processing of sparse convex clustering is configured as follows. Other than
setting the values of γ2 , r1 , . . . , r p and defining G, G −1 , only the line of tt # # of
the function s.update_u is different. In particular, the function gr of the group
Lasso defined in Chap. 3 is applied.
1 s.update.u = function(G, G.inv, v, lambda) {
2 u = array(dim = c(n, d))
3 y = x
4 for (i in 1:n) {
5 if (i < n) for (j in (i + 1):n) y[i, ] = y[i, ] + lambda[i, j, ] +
nu * v[i, j, ]
6 if (1 < i) for (j in 1:(i - 1)) y[i, ] = y[i, ] - lambda[j, i, ] -
nu * v[j, i, ]
7 }
8 for (j in 1:d) u[, j] = gr(G, G.inv %*% y[, j], gamma.2 * r[j])
9 for (j in 1:d) u[, j] = u[, j] - mean(u[, j])
10 return(u)
11 }
12 s.convex.cluster = function() {
13 ## Set gamma.2, r[1], ..., r[p]
14 G = sqrt(1 + n * nu) * diag(n) - (sqrt(1 + n * nu) - 1) / n * matrix
(1, n, n)
218 7 Multivariate Analysis

15 G.inv = (1 + n * nu) ^ (-0.5) * (diag(n) + (sqrt(1 + n * nu) - 1) /


n * matrix(1, n, n))
16 v = array(rnorm(n * n * d), dim = c(n, n, d))
17 lambda = array(rnorm(n * n * d), dim = c(n, n, d))
18 for (iter in 1:max_iter) {
19 u = s.update.u(G, G.inv, v, lambda); v = update.v(u, lambda)
20 lambda = update.lambda(u, v, lambda)
21 }
22 return(list(u = u, v = v))
23 }

Example 66 Using the function in Example 65 and the functions s.update.u,


+s.convex.cluster + constructed this time, we examine the problem of variable selec-
tion by sparse convex clustering. Since we are applying the group Lasso, the N values
contained in each u ( j) became 0 at the same time. In Fig. 7.5, the value of γ2 shows
how the value of a particular sample is reduced. At γ = 1 and γ = 10, there was
no change in each u (i) due to gamma2 . γ1 has a relationship of u i , u j and seems to
have an impact. The execution is carried out according to the code below.
1 ## Data Generation
2 n = 50; d = 10; x = matrix(rnorm(n * d), n, d)
3 ## Setting before execution
4 w = ww(x, 1/d, dd = sqrt(d)) ## d is large and adjust it
5 gamma = 10; nu = 1; max_iter = 1000
6 r = rep(1, d)
7 ## Change gamma.2, and execute it, and display the coefficients
8 gamma.2.seq = seq(1, 10, 1)
9 m = length(gamma.2.seq)
10 z = array(dim = c(m, d))
11 h = 0
12 for (gamma.2 in gamma.2.seq) {
13 h = h + 1
14 u = s.convex.cluster()$u
15 print(gamma.2)
16 for (j in 1:d) z[h, j] = u[5, j]
17 }
18 plot(0, xlim = c(1, 10), ylim = c(-2, 2), type = "n",
19 xlab = "gamma.2", ylab = "Coefficients", main = "gamma = 100")
20 for (j in 1:d) lines(gamma.2.seq, z[, j], col = j + 1)

Appendix: Proof of Proposition

Proposition 23 Let s > 0 and a ∈ R p with nonnegative elements. Then, the w for
w2 = 1 and w1 ≤ s that maximizes w T a can be written as

Sλ (a)
w=
Sλ (a)2
Appendix: Proof of Proposition 219

gamma = 1 gamma = 10
2

2
1

1
Coefficients
Coefficient
0

0
-1

-1
-2

-2
2 4 6 8 10 2 4 6 8 10
gamma.2 gamma.2

Fig. 7.5 Fixing γ1 = 1 (left) and γ1 = 100 (right), changing the value of γ2 , we observed the
changes in the value of u i, j for a specific sample (i = 5). For γ = 10, the absolute value of the
coefficient is reduced compared to that for γ = 1

for some λ ≥ 0.

Proof: For w1 , . . . , w p to minimize the Lagrange


⎛ ⎞

p
λ1 T  p
p
L := −w j a j + (w w − 1) + λ2 ⎝ ⎠
wj − s − μjwj
j=1
2 j=1 j=1

is equivalent to the existence of λ1 , λ2 , μ1 , . . . , μ p ≥ 0 (KKT condition) such that

−a j +λ1 w j + λ2 − μ j = 0 (7.19)
μjwj = 0 (7.20)
wj ≥ 0 (7.21)

for j = 1, . . . , p and

wT w ≤ 1 (7.22)
p
wj ≤ s (7.23)
j=1

λ1 (w T w − 1) = 0 (7.24)
⎛ ⎞
 p
λ2 ⎝ w j − s⎠ = 0 . (7.25)
j=1
220 7 Multivariate Analysis

Therefore, it is sufficient to show the existence of λ1 , λ2 , μ1 , . . . , μ p that satisfy the


conditions. First, since w j ≥ 0, from (7.19), we have λ1 w j = a j + μ j − λ2 ≥ 0.
When a j − λ2 ≥ 0, if μ j > 0, then λ1 w j > 0, μ j w j > 0, which contradicts (7.20);
thus, we have μ j = 0. When a j − λ2 < 0, from (7.20), we have μ j = λ2 − a j and
w j = 0. Therefore, we have λ1 w = Sλ2 (a). Since w T w = 1, we may choose λ1 = 0,
and the proposition follows. 

Proposition 24 Suppose that the edge set E is {(i, j) | i < j, i, j ∈ V }, i.e., all the
vertices are connected. If we fix V, and denote
 
yi := xi + (λi, j + νvi, j ) − (λ j,i + νv j,i ) (7.14)
j:(i, j)∈E j:( j,i)∈E

then (7.13) is minimized when



yi + ν j∈V xj
ui = . (7.26)
1 + Nν

Proof The optimization w.r.t. U when V, are fixed is the minimization of


 2
1 ν    1 
f (U ) := xi − u i 2 +
2
(λi, j + νvi, j ) − u i + u j 
 .
2 i∈V 2 (i, j)∈E  ν 2

Note that the term in f (U ) that depends on u i can be written as


⎧ ⎫
 2  2 ⎬
ν⎨  1 
 (λi, j + νvi, j ) − u i + u j  +
 1 
 (λi, j + νvi, j ) − u j + u i 
2⎩ ν  ν  ⎭ .
j∈V :(i, j)∈E 2 j∈V :( j,i)∈E 2

If we differentiate f (U ) by u i , we have
 
−(xi − u i ) − (λi, j + νvi, j − ν(u i − u j )) + (λi, j + νvi, j − ν(u j − u i )) = 0 ,
j∈V :(i, j)∈E j∈V :( j,i)∈E

which is
  
ui − ν (u j − u i ) = xi + (λi, j + νvi, j ) − (λi, j + νvi, j ) .
j=i j:(i, j)∈E j:( j,i)∈E

Thus, we have 
(N ν + 1)u i − ν u j = yi . (7.27)
j∈V

If we sum both sides of (7.27) over i ∈ V ,


Appendix: Proof of Proposition 221
 
ui = yi (7.28)
i∈V i∈V

 
then from (7.27)(7.28) and j∈V yj = j∈V xj,

yi + ν j∈V xj
ui =
1 + Nν

is the optimum. 

Proposition 25 Given V, , the u ( j) ( j = 1, . . . , p) that minimizes (7.18) is given


by the u ( j) that minimizes

1 −1 ( j)
G y − Gu ( j) 22 + γ2 r j u ( j) 2
2

where y (1) , . . . , y ( p) ∈ R N are the column vectors of y1 , . . . , y N ∈ R p in (7.14) and



√ 1 + Nν − 1
G := 1 + N ν I N − EN
N

with all the values of E N ∈ R N ×N being ones.

Proof If we differentiate all the terms except the third term in (7.18) by the j-th
element of u i (the i-th element of a j ), then from (7.27), we have

(N ν + 1)u i, j − ν u k, j − yi, j , (7.29)
k∈V

where
 
yi, j := xi, j + (λi,k, j + νvi,k, j ) − (λi,k, j + νvi,k, j ) .
k:(i,k)∈E k:(k,i)∈E

If we write (7.29) in the form of a matrix, we have

{(N ν + 1)I N − ν E N }[u (1) , . . . , u ( p) ] − y ,

where y is the matrix such that y1 , . . . , y N ∈ R p are the rows, I N ∈ R N ×N is the


unit matrix, and E N ∈ R N ×N is the matrix in which all elements are one. Note from
E N2 = N E N that M := (N ν + 1)I N − ν E N ∈ R N ×N can be written as the square of
the symmetric matrix G. In fact,
√ 2 √
1 + Nν − 1 ( 1 + ν N − 1) √ 1 1 + νN
N −2 · 1 + Nν = − = −ν ,
N N N N
222 7 Multivariate Analysis

and (7.29) can be written as

GG[u (1) , . . . , u ( p) ] − [y (1) , . . . , y ( p) ] .


p
Therefore, if we consider the subderivative of γ2 j=1 r j u ( j) 2 , we find that the
optimization problem reduces to finding the value of u ( j) that minimizes

1 −1 ( j)
G y − Gu ( j) 22 + γ2 r j u ( j) 2 (7.30)
2
for each j = 1, . . . , p. 

Exercises 88–100
N
In the following, it is assumed that X ∈ R N × p is centered, i.e., X = (xi, j ), i=1 xi, j =
0 ( j = 1, . . . , p), and xi represents a row vector.

88. V1 ∈ R p , which has a length of one and maximizes X v, is called the first
principal component vector. In a vector that is orthogonal from the first principal
component vector to the i − 1 principal component vector and has a length of
one, the Vi ∈ R p that maximizes X vi  is called the i principal component
vector. We wish to find it for i = 1, . . . , m (m ≤ p).
(a) Show that the first principal component vector is the eigenvector of  :=
X T X/N .
(b) Show that eigenvectors are orthogonal when the eigenvalues are distinct.
Moreover, show that eigenvectors can be orthogonalized. If the eigenvalues
are the same, what should we do?
(c) To obtain the first m principal component vectors, we choose the ones with
the largest m eigenvalues and normalize them. Why can we ignore the con-
dition “orthogonal from the first principal component vector to the i − 1
principal component vector”?
89. For the principal component vector Vm := [v1 , . . . , vm ] defined in Exercise 88,
show that (I − Vm VmT )2 = I − Vm VmT . In addition, show that the Vm defined
in Exercise 88 and the Vm ∈ R N timesm that minimizes the reconstruction error
 N
i=1 x i (I − Vm Vm )x i coincide.
T T

90. When we find v ∈ R for v2 = 1 that maximizes X v2 , we may restrict the
p

number of nonzero elements in v: maximize v T X T X v − λv0 } under v0 ≤ t


and v2 = 1. However, the modification loses the convexity of the problem;
thus, we may consider the alternative: the same maximization under the con-
straint v1 ≤ t (t > 0)

max {v T X T X v − λv1 } (cf. 7.4 ) (7.31)


v2 =1
Exercises 88–100 223

(SCoTLASS). Nonetheless, it is still not convex. Therefore, we consider the


problem
max {u T X v − λv1 } (cf. 7.4) (7.32)
u2 =v2 =1

for u ∈ R N .

(a) Show that the v obtained in (7.32) optimizes (7.31).


Hint If we differentiate
μ T δ
L := −u T X v + λv1 + (u u − 1) + (v T v − 1) (cf. 7.5) (7.33)
2 2
Xv
by u, we have X v − μu = 0, u = . Substitute this into u T X v.
X v2
(b) The reason why we transform (7.4) is that we can efficiently obtain the
solution of (7.3): Choose an arbitrary v ∈ R p such that v2 = 1. Repeat to
update u ∈ R N , v ∈ R p :
Xv
i. u ←
X v2
Sλ (X T u)
ii. v ←
Sλ (X T u)2
until the u, v values converge, where Sλ (z) is the function Sλ (·) introduced
in Chap. 1, which applies each element of z = [z 1 , · · · , z p ] ∈ R p . From
∂L ∂L Sλ (X T u) Xv
= = 0, derive v = and u = .
∂u ∂v Sλ (X T u)2 X v2
Hint If we take the subderivative of v1 , it is 1, −1, [−1, 1] for v > 0, v <
0, v = 0, respectively, which means that

⎨ (the j − th column of X T u) − λ + δv j = 0, (the j − th column of X T u) > λ
(the j − th column of X T u) + λ + δv j = 0, (the j − th column of X T u) < −λ ,

(the j − th column of X T u) + λ[−1, 1] 0, −λ ≤ (the j − th column of X T u) ≤ λ

for each j = 1, . . . , p, i.e., δv = Sλ (X T u).

91. If β → f (α, β) and α → f (α, β) are convex w.r.t. α ∈ Rm and β ∈ Rn , respec-


tively, we say that the bivariate function f : Rm × Rn → R is biconvex.

(a) Show that the bivariate function (7.33) is biconvex w.r.t. u, v.


Hint Because v1 is convex, differentiate it twice with u j and vk , and
determine whether the solutions are nonnegative.
√ general, biconvexity does not mean convexity. When N = p = 1, X >
(b) In
μδ, show that it is not convex.
224 7 Multivariate Analysis

Hint Because
⎡ 2 L is ⎤a bivariate function of u, v, we have
∂ L ∂2 L
⎢ ∂u 2 ∂u∂v ⎥ μ −X
∇2 L = ⎢ ⎥
⎣ ∂ 2 L ∂ 2 L ⎦ = −X δ . If the determinant μδ − X is
2

∂u∂v ∂v 2
negative, it contains a negative eigenvalue.

92. In general, if  : R p × R p → R, then f : R p → R satisfies



f (β) ≤ (β, θ ), θ ∈ R p
, (cf. (7.6))
f (β) = (β, β)

we say that  majorizes f at β ∈ R p .


(a) Suppose that  majorizes f . Show that if we choose β 0 ∈ R p arbitrarily
and generate β 1 , β 2 , . . . via

β t+1 = arg minp (β, β t ) ,


β∈R

we have the following inequality:

f (β t ) = (β t , β t ) ≥ (β t+1 , β t ) ≥ f (β t+1 ) .

(X v)T (X v )
(b) If we define f (v) := −X v2 + λv1 , (v, v ) := − +
X v 2
λv1 , then show that  majorizes f at v using Schwartz’s inequality.
(c) Suppose that we choose v 0 arbitrarily and generate v 0 , v 1 , . . . via v t+1 :=
arg maxv2 =1 (v, v t ). Express the right-hand side value as X, v, λ. Show
Sλ (X T u)
also that it coincides with .
Sλ (X T u)2
93. For the procedures below (SCoTLASS), fill in the blank, and execute for the
sample.
1 soft.th = function(lambda, z) return(sign(z) * pmax(abs(z) -
lambda, 0))
2 ## Even if z is a vector, soft.th works
3 SCoTLASS = function(lambda, X) {
4 n = nrow(X); p = ncol(X); v = rnorm(p); v = v / norm(v, "2")
5 for (k in 1:200) {
6 u = X %*% v; u = u / norm(u, "2"); v = ## Blank ##
7 v = soft.th(lambda, v); size = norm(v, "2")
8 if (size > 0) v = v / size else break
9 }
10 if (norm(v, "2") == 0) print("All the elements of v are zeros");
return(v)
11 }
12 ## Sample
Exercises 88–100 225

13 n = 100; p = 50; X = matrix(rnorm(n * p), nrow = n); lambda.seq =


0:10 / 10
14 S = NULL; T = NULL
15 for (lambda in lambda.seq) {
16 v = SCoTLASS(lambda, X)
17 S = c(S, sum(sign(v ^ 2)))
18 T = c(T, norm(X %*% v, "2"))
19 }
20 plot(lambda.seq, S, xlab = "lambda", ylab = "# Nonzero Vectors")
21 plot(lambda.seq, T, xlab = "lambda", ylab = "Variance Sum")

94. We formulate a sparse PCA problem different from SCoTLASS (SPCA, sparse
principal component analysis):
 
1 
N
min xi − xi vu T 22 + λ1 v1 + λ2 v22 (cf. 7.8)
u,v∈R p ,u2 =1 N i=1
(7.34)
For this method, when we fix u and optimize w.r.t. v, we can apply the elastic
net procedure. Show each of the following.
(a) The function (7.34) is biconvex w.r.t. u, v.
Hint When we consider the constraint u2 = 1, (7.34) can be written as

1  2  T T 1  T T
N N N
L := xi xiT − u xi xi v + v xi xi v + λ1 v1 + λ2 v2 + μ(u T u − 1) .
N N N
i=1 i=1 i=1

Show that when differentiating this equation w.r.t. u j twice, it is nonnegative,


and that when differentiating it w.r.t. vk twice except for v1 , it is still
nonnegative.
(b) When we fix v and optimize it w.r.t. u, the optimum solution is given by

XT z
u= ,
X T z2

where z 1 = v T x1 , . . . , z N = v T x N .
Hint Differentiating L w.r.t. u j , show that the differentiated vector w.r.t.
2  T
N
j = 1, . . . , p satisfies − x xi v + 2μu = 0.
N i=1 i
95. We construct the SPCA process using the R language. The process is repeated to
observe how each element of the vector v changes with the number of iterations.
Fill in the blanks, execute the process, and display the output as a graph.
226 7 Multivariate Analysis

1 ## Data Generation
2 n = 100; p = 5; x = matrix(rnorm(n * p), ncol = p)
3 ## Compute u,v
4 lambda = 0.001; m = 100
5 g = array(dim = c(m, p))
6 for (j in 1:p) x[, j] = x[, j] - mean(x[, j])
7 for (j in 1:p) x[, j] = x[, j] / sqrt(sum(x[, j] ^ 2))
8 r = rep(0, n)
9 v = rnorm(p)
10 for (h in 1:m) {
11 z = x %*% v
12 u = as.vector(t(x) %*% z)
13 if (sum(u ^ 2) > 0.00001) u = u / sqrt(sum(u ^ 2))
14 for (k in 1:p) {
15 for (i in 1:n) r[i] = sum(u * x[i, ]) - sum(u ^ 2) * sum(x[i,
-k] * v[-k])
16 S = sum(x[, k] * r) / n
17 v[k] = ## Blank (1) ##
18 }
19 if (sum(v ^ 2) > 0.00001) v = ## Blank (2) ##
20 g[h, ] = v
21 }
22 ## Display Graph
23 g.max = max(g); g.min = min(g)
24 plot(1:m, ylim = c(g.min, g.max), type = "n",
25 xlab = "# Repetitions", ylab = "Each element of v", main = "
lambda = 0.001")
26 for (j in 1:p) lines(1:m, g[, j], col = j + 1)

96. It is not easy to perform sparse PCA on general multiple components instead
of the first principal component, and the following is required for SCoTLASS:
maximize u kT X vk w.r.t. u k , vk under

vk 2 ≤ 1 , vk 1 ≤ c , u k 2 ≤ 1 , u kT u j = 0 ( j = 1, . . . , k − 1) .

Show that the value of u k when we fix vk is given by



Pk−1 X vk
u k := ⊥
(cf. (7.7))
Pk−1 X vk 2


k−1
where Pk−1 := I − i=1 u i u iT .

k−1
Hint When u 1 , . . . , u k−1 are given, show that u j Pk−1 X vk = u j (I − i=1 ui
u iT )X vk = 0 ( j = 1, . . . , k − 1) and that the u k that maximizes L = u kT X vk −
μ(u kT u k − 1) is u k = X vk /(u kT X vk ).
97. K -means clustering is a problem of finding the disjoint subsets C1 , . . . , C K of
{1, . . . , N } that minimize
 K 
xi − x̄k 22
k=1 i∈Ck
Exercises 88–100 227

from x1 , . . . , x N ∈ R p and the positive integer K , where x̄k is the arithmetic mean
of the samples whose indices are in Ck . Witten–Tibshirani (2010) proposed the
formulation that maximizes the weighted sum
⎧ ⎫

p ⎨1 
N 
N K
1  2 ⎬
wh di,2 j,h − d (7.35)
⎩N Nk i, j∈C i, j,h ⎭
h=1 i=1 j=1 k=1 k

with wh ≥ 0, h = 1, . . . , p, and di, j,h := |xi,h − x j,h | under w2 ≤ 1, w1 ≤


s, and w ≥ 0 for s > 0, where Nk is the number of samples in Ck .

(a) The following is a general K -means clustering procedure. Fill in the blanks,
and execute it.
1 k.means = function(X, K, weights = w) {
2 n = nrow(X); p = ncol(X)
3 y = sample(1:K, n, replace = TRUE)
4 center = array(dim = c(K, p))
5 for (h in 1:10) {
6 for (k in 1:K) {
7 if (sum(y[] == k) == 0) center[k, ] = Inf else
8 for (j in 1:p) center[k, j] = ## Blank (1) ##
9 }
10 for (i in 1:n) {
11 S.min = Inf
12 for (k in 1:K) {
13 if (center[k, 1] == Inf) break
14 S = sum((X[i, ] - center[k, ]) ^ 2 * w)
15 if (S < S.min) {S.min = S; ## Blank(2) ##}
16 }
17 }
18 }
19 return(y)
20 }
21 ## Data Generation
22 K = 10; p = 2; n = 1000; X = matrix(rnorm(p * n), nrow = n,
ncol = p)
23 w = c(1, 1); y = k.means(X, K, w)
24 ## Display Output
25 plot(-3:3, -3:3, xlab = "x", ylab = "y", type = "n")
26 points(X[, 1], X[, 2], col = y + 1)
p
(b) To obtain w1 , . . . , w p that maximizes j=1 w j a j with

1   1  
N N K
a j (C1 , . . . , C K ) := (xi, j − xi ,j)
2
− (xi, j − xi ,j)
2
(cf. (7.11))
N Nk
i=1 i =1 k=1 i∈Ck i ∈Ck

p
given C1 , . . . , C K , we find the value of w that maximizes h=1 wh ah under
w2 = 1, w1 ≤ s, and w ≥ 0 given a ∈ R p with nonnegative elements.
We define λ > 0 as zero and w such that w1 = s for the cases where
w1 < s and w1 = s. Suppose that a ∈ R p has nonnegative elements.
228 7 Multivariate Analysis

Based on the fact that


Sλ (a)
w= (cf. (7.12))
Sλ (a)2

maximizes w T a among w ∈ R p with nonnegative elements, we construct


the following procedure. Fill in the blanks, and execute it.
1 w.a = function(a, s) {
2 a = a/sqrt(sum(a^2))
3 if (sum(a) < s) return(a)
4 p = length(a)
5 lambda = max(a) / 2
6 delta = lambda / 2
7 for (h in 1:10) {
8 for (j in 1:p) w[j] = soft.th(lambda, ## Blank(1) ##)
9 ww = sqrt(sum(w ^ 2))
10 if (ww == 0) w = 0 else w = w / ww
11 if (sum(w) > s) lambda = lambda + delta else lambda = ##
Blank(2) ##
12 delta = delta / 2
13 }
14 return(w)
15 }

98. The following is a sparse K -means procedure based on Exercise 97, to which
the functions below have been added. Fill in the blanks, and execute it.
1 sparse.k.means = function(X, K, s) {
2 p = ncol(X); w = rep(1, p)
3 for (h in 1:10) {
4 y = k.means(## Blank (1) ##)
5 a = comp.a(## Blank (2) ##)
6 w = w.a(## Blank (3) ##)
7 }
8 return(list(w = w, y = y))
9 }
10 comp.a = function(X, y) {
11 n = nrow(X); p = ncol(X); a = array(dim = p)
12 for (j in 1:p) {
13 a[j] = 0
14 for (i in 1:n) for (h in 1:n) a[j] = a[j] + (X[i, j] - X[h, j
]) ^ 2 / n
15 for (k in 1:K) {
16 S = 0
17 index = which(y == k)
18 if (length(index) == 0) break
19 for (i in index) for (h in index) S = S + (X[i, j] - X[h, j
]) ^ 2
20 a[j] = a[j] - S / length(index)
21 }
22 }
23 return(a)
24 }
Exercises 88–100 229

25 ## Execute the two line below


26 p = 10; n = 100; X = matrix(rnorm(p * n), nrow = n, ncol = p)
27 sparse.k.means(X, 5, 1.5)

99. Given data x1 , . . . , x N ∈ R p and parameter γ > 0, find the u 1 , . . . , u N ∈ R p


that minimize
1 
N
xi − u i 2 + γ wi, j u i − u j 2 ,
2 i=1 i< j

where wi, j are determined from xi , x j , such as exp(−xi − x j 22 ). If u i = u j ,


we consider them to be in the same cluster. We define the extended Lagrangian

1  
L ν (U, V, ) := xi − u i 22 + γ wi, j vi, j  + λi, j , vi, j − u i + u j 
2
i∈V (i, j)∈E (i, j)∈E
ν 
+ vi, j − u i + u j 22 (cf. (7.13))
2
(i, j)∈E

for U ∈ R N × p , V ∈ R N ×N × p , and ∈ R N ×N × p to obtain the solution of clus-


tering via the ADMM. Fill in the blanks, and execute it. Explain what process is
being executed for the optimization w.r.t. U with V, fixed, that w.r.t. V with
, U fixed, and that w.r.t. with U, V fixed.
1 ww = function(x, mu = 1, dd = 0) {
2 n = nrow(x)
3 w = array(dim = c(n, n))
4 for (i in 1:n) for (j in 1:n) w[i, j] = exp(-mu * sum((x[i, ] -
x[j, ]) ^ 2))
5 if (dd > 0) for (i in 1:n) {
6 dis = NULL
7 for (j in 1:n) dis = c(dis, sqrt(sum((x[i, ] - x[j, ]) ^ 2)))
8 index = which(dis > dd)
9 w[i, index] = 0
10 }
11 return(w)
12 }
13 prox = function(x, tau) {
14 if (sum(x ^ 2) == 0) return(x) else return(max(0, 1 - tau / sqrt
(sum(x ^ 2))) * x)
15 }
16 update.u = function(v, lambda) {
17 u = array(dim = c(n, d))
18 z = 0; for (i in 1:n) z = z + x[i, ]
19 y = x
20 for (i in 1:n) {
21 if (i < n) for (j in (i + 1):n) y[i, ] = y[i, ] + lambda[i, j,
] + nu * v[i, j, ]
22 if (1 < i) for (j in 1:(i - 1)) y[i, ] = y[i, ] - lambda[j, i,
] - nu * v[j, i, ]
23 u[i, ] = (y[i, ] + nu * z) / (n * nu + 1)
24 }
25 return(u)
230 7 Multivariate Analysis

26 }
27 update.v = function(u, lambda) {
28 v = array(dim = c(n, n, d))
29 for (i in 1:(n - 1)) for (j in (i + 1):n) {
30 v[i, j, ] = prox(u[i, ] - u[j, ] - lambda[i, j, ] / nu, gamma
* w[i, j] / nu)
31 }
32 return(v)
33 }
34 update.lambda = function(u, v, lambda) {
35 for (i in 1:(n - 1)) for (j in (i + 1):n) {
36 lambda[i, j, ] = lambda[i, j, ] + nu * (v[i, j, ] - u[i, ] + u
[j, ])
37 }
38 return(lambda)
39 }
40 ## Repeat the updates of u,v,lambda for max_iter times
41 convex.cluster = function() {
42 v = array(rnorm(n * n * d), dim = c(n, n, d))
43 lambda = array(rnorm(n * n * d), dim = c(n, n, d))
44 for (iter in 1:max_iter) {
45 u = ## Blank (1) ##
46 v = ## Blank (2) ##
47 lambda = ## Blank (3) ##
48 }
49 return(list(u = u, v = v))
50 }
51 ## Data Generation
52 n = 50; d = 2; x = matrix(rnorm(n * d), n, d)
53 ## Convex Clustering
54 w = ww(x, 1, dd = 1); gamma = 10; nu = 1; max_iter = 1000; v =
convex.cluster()$v
55 ## Adjacency Matrix
56 a = array(0, dim = c(n, n))
57 for (i in 1:(n - 1)) for (j in (i + 1):n) {
58 if (sqrt(sum(v[i, j, ] ^ 2)) < 1 / 10 ^ 4) {a[i, j] = 1; a[j, i]
= 1}
59 }
60 ## Display
61 k = 0
62 y = rep(0, n)
63 for (i in 1:n) {
64 if (y[i] == 0) {
65 k = k + 1
66 y[i] = k
67 if (i < n) for (j in (i + 1):n) if (a[i, j] == 1) y[j] = k
68 }
69 }
70 plot(0, xlim = c(-3, 3), ylim = c(-3, 3), type = "n", main = "
gamma = 10")
71 points(x[, 1], x[, 2], col = y + 1)

100. Let x ( j) , u ( j) ∈ R N ( j = 1, . . . , p) be the column vectors of X, U ∈ R N × p . We


define the extended Lagrangian as
Exercises 88–100 231

1  ( j)  
p p
L ν (U, V, ) := x − u ( j) 22 + γ1 wi,k vi,k  + γ2 r j u ( j) 2
2
j=1 (i,k)∈E j=1
 ν 
+ λi,k , vi,k − u i + u k  + vi,k − u i + u k 22 (cf. (7.18))
2
(i,k)∈E (i,k)∈E

for convex optimization w.r.t. U ∈ R N × p , V ∈ R N ×N × p , ∈ R N ×N × p (sparse


convex clustering), where r j is the weight for the penalty of the j-th vari-
able. The following is the code. Unlike ordinary convex clustering, the param-
eters γ2 , r1 , . . . , r p have been added, and the functions s.update.u and
s.convex.cluster have been modified. Explain why these differences are
required between convex clustering and sparse convex clustering.
1 s.update.u = function(v, lambda) {
2 u = array(dim = c(n, d))
3 y = x
4 for (i in 1:n) {
5 if (i < n) for (j in (i + 1):n) y[i, ] = y[i, ] + lambda[i, j,
] + nu * v[i, j, ]
6 if (1 < i) for (j in 1:(i - 1)) y[i, ] = y[i, ] - lambda[j, i,
] - nu * v[j, i, ]
7 }
8 for (j in 1:d) u[, j] = gr(G, G.inv %*% y[, j], gamma.2 * r[j])
9 for (j in 1:d) u[, j] = u[, j] - mean(u[, j])
10 return(u)
11 }
12 s.convex.cluster = function() {
13 ## gamma.2, r[1], ..., r[p]
14 G = sqrt(1 + n * nu) * diag(n) - (sqrt(1 + n * nu) - 1) / n %*%
matrix(1, n, n)
15 G.inv = (1 + n * nu) ^ (-0.5) %*%
16 (diag(n) + (sqrt(1 + n * nu) - 1) / n * matrix(1, n, n))
17 v = array(rnorm(n * n * d), dim = c(n, n, d))
18 lambda = array(rnorm(n * n * d), dim = c(n, n, d))
19 for (iter in 1:max_iter) {
20 u = s.update.u(v, lambda); v = update.v(u, lambda)
21 lambda = update.lambda(u, v, lambda)
22 }
23 return(list(u = u, v = v))
24 }
Bibliography

1. Alizadeh, A., Eisen, M., Davis, R.E., Ma, C., Lossos, I., Rosenwal, A., Boldrick, J., Sabet, H.,
Tran, T., Yu, X., Pwellm, J., Marti, G., Moore, T., Hudsom, J., Lu, L., Lewis, D., Tibshirani,
R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, K., Levy, R., Wilson,
W., Greve, M., Byrd, J., Botstein, D., Brown, P., Staudt, L.: Identification of molecularly and
clinically distinct subtypes of diffuse large B cell lymphoma by gene expression profiling.
Nature 403, 503–511 (2000)
2. Arnold, T., Tibshirani, R.: genlasso: path algorithm for generalized lasso problems. R package
version 1.5
3. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM J. Imaging Sci. 2, 183–202 (2009)
4. Bertsekas, D.: Convex Analysis and Optimization. Athena Scientific, Nashua (2003)
5. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical
learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1),
1–124 (2011)
6. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge
(2004)
7. Chi, E.C., Lange, K.: Splitting methods for convex clustering. J. Comput. Graph. Stat. (online
access) (2014)
8. Danaher, P., Witten, D.: The joint graphical lasso for inverse covariance estimation across
multiple classes. J. R. Stat. Soc. Ser. B 76(2), 373–397 (2014)
9. Efron, B., Hastie, T., Johnstone, I., Tibshirani, B.: Least angle regression. Ann. Stat. 32(2),
407–499 (2004)
10. Friedman, J., Hastie, T., Hoefling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann.
Appl. Stat. 1(2), 302–332 (2007)
11. Friedman, J., Hastie, T., Simon, N., Tibshirani, R.: glmnet: lasso and elastic-net regularized
generalized linear models. R package version 4.0 (2015)
12. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical
lasso. Biostatistics 9, 432–441 (2008)
13. Jacob, L., Obozinski, G., Vert, J.-P.: Group lasso with overlap and graph lasso. In: Proceeding
of the 26th International Conference on Machine Learning, Montreal, Canada (2009)
14. Johnson, N.: A dynamic programming algorithm for the fused lasso and ‘0-segmentation. J.
Comput. Graph. Stat. 22(2), 246–260 (2013)

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 233
J. Suzuki, Sparse Estimation with Math and R,
https://doi.org/10.1007/978-981-16-1446-0
234 Bibliography

15. Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based
on the lasso. J. Comput. Graph. Stat. 12, 531–547 (2003)
16. Suzuki, J.: Statistical Learning with Math and R. Springer, Berlin (2020)
17. Kawano, S., Matsui,H., Hirose, K.: Statistical Modeling for Sparse Estimation. Kyoritsu-
Shuppan, Tokyo (2018) (in Japanese)
18. Lauritzen, S.L.: Graphical Models. Oxford University Press, Oxford (1996)
19. Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large
incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)
20. Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the lasso.
Ann. Stat. 34, 14361462 (2006)
21. Mota, J., Xavier, J., Aguiar, P., Püschel, M.: A proof of convergence for the alternating direction
method of multipliers applied to polyhedral-constrained functions. Optimization and Control,
Mathematics arXiv (2011)
22. Nesterov, Y.: Gradient methods for minimizing composite objective function. Technical Report
76, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain
(UCL) (2007)
23. Ravikumar, P., Liu, H., Lafferty, J., Wasserman, L.: Sparse additive models. J. R. Stat. Soc.
Ser. B 71(5), 1009–1030 (2009)
24. Ravikumar, P., Wainwright, M.J., Raskutti, G., Yu, B.: High-dimensional covariance estimation
by minimizing 1-penalized logdeterminant divergence. Electron. J. Stat. 5, 935–980 (2011)
25. Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for Cox’s proportional
hazards model via coordinate descent. J. Stat. Softw. 39(5), 1–13 (2011)
26. Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph.
Stat. 22(2), 231–245 (2013)
27. Simon, N., Friedman, J., Hastie, T.: A blockwise descent algorithm for group-penalized mul-
tiresponse and multinomial regression. Computation, Mathematics arXiv (2013)
28. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58,
267–288 (1996)
29. Tibshirani, R., Taylor, J.: The solution path of the generalized lasso. Ann. Stat. 39(3), 1335–
1371 (2011)
30. Wang, B., Zhang, Y., Sun, W., Fang, Y.: Sparse convex clustering. J. Comput. Graph. Stat.
27(2), 393–403 (2018)
31. Witten, D., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc.
105(490), 713–726 (2010)
32. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph.
Stat. 15(2), 265–286 (2006)

You might also like