(Ebook) Sparse Estimation with Math and Python: 100 Exercises for Building Logic by Joe Suzuki ISBN 9789811614378, 9811614377 - The ebook is ready for download with just one simple click
(Ebook) Sparse Estimation with Math and Python: 100 Exercises for Building Logic by Joe Suzuki ISBN 9789811614378, 9811614377 - The ebook is ready for download with just one simple click
com
https://ebooknice.com/product/sparse-estimation-with-math-
and-python-100-exercises-for-building-logic-35997448
OR CLICK BUTTON
DOWLOAD EBOOK
(Ebook) Sparse Estimation with Math and Python: 100 Exercises for Building Logic by
Joe Suzuki ISBN 9789811614378, 9811614377
https://ebooknice.com/product/sparse-estimation-with-math-and-
python-100-exercises-for-building-logic-35997450
ebooknice.com
(Ebook) Statistical Learning with Math and Python: 100 Exercises for Building Logic
by Joe Suzuki ISBN 9789811578762, 9811578761
https://ebooknice.com/product/statistical-learning-with-math-and-
python-100-exercises-for-building-logic-33792268
ebooknice.com
(Ebook) Kernel Methods for Machine Learning with Math and Python: 100 Exercises for
Building Logic by Joe Suzuki ISBN 9789811904004, 9811904006
https://ebooknice.com/product/kernel-methods-for-machine-learning-with-math-and-
python-100-exercises-for-building-logic-43837738
ebooknice.com
(Ebook) Sparse Estimation with Math and R by Joe Suzuki ISBN 9789811614453,
9811614458
https://ebooknice.com/product/sparse-estimation-with-math-and-r-33795076
ebooknice.com
(Ebook) doing math with python doing math with python by kan
https://ebooknice.com/product/doing-math-with-python-doing-math-with-
python-50196050
ebooknice.com
(Ebook) The Definitive Guide to Masonite: Building Web Applications with Python by
Christopher Pitt, Joe Mancuso ISBN 9781484256015, 1484256018
https://ebooknice.com/product/the-definitive-guide-to-masonite-building-web-
applications-with-python-50194580
ebooknice.com
(Ebook) Math for Programmers: 3D graphics, machine learning, and simulations with
Python MEAP V10 by Paul Orland
https://ebooknice.com/product/math-for-programmers-3d-graphics-machine-learning-
and-simulations-with-python-meap-v10-11069540
ebooknice.com
(Ebook) Math Adventures with Python: An Illustrated Guide to Exploring Math with
Code by Peter Farrell ISBN 9781593278670, 1593278675
https://ebooknice.com/product/math-adventures-with-python-an-illustrated-guide-
to-exploring-math-with-code-36151058
ebooknice.com
(Ebook) Math Adventures with Python: An Illustrated Guide to Exploring Math with
Code by Peter Farrell ISBN 9781593278670, 1593278675
https://ebooknice.com/product/math-adventures-with-python-an-illustrated-guide-
to-exploring-math-with-code-10674096
ebooknice.com
Joe Suzuki
This work is subject to copyright. All rights are solely and exclusively
licensed by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other
physical way, and transmission or information storage and retrieval,
electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer
Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04
Gateway East, Singapore 189721, Singapore
Preface
I started considering the sparse estimation problems around 2017
when I moved from the mathematics department to statistics in Osaka
University, Japan. I have been studying information theory and
graphical models for over thirty years.
The first book I found is “Statistical Learning with Sparsity” by T.
Hastie, R. Tibshirani, and M. Wainwright. I thought it was a monograph
rather than a textbook and that it would be tough for a non-expert to
read it through. However, I downloaded more than fifty papers that
were cited in the book and read them all. In fact, the book does not
instruct anything but only suggests how to study sparsity. The contrast
between statistics and convex optimization gradually attracted me as I
understand the material.
On the other hand, it seems that the core results on sparsity have
come out around 2010–2015 for research. However, I still think further
possibilities and expansions are there. This book contains all the
mathematical derivations and source programs, so graduate students
can construct any procedure from scratch by getting help from this
book.
Recently, I published books “Statistical Learning with Math and R”
(SLMR), “Statistical Learning with Math and Python” (SLMP), and
“Sparse Estimation with Math and R” (SEMR). The common idea is
behind the books (XXMR/XXMP). They not only give knowledge on
statistical learning and sparse estimation but also help build logic in
your brain by following each step of the derivations and each line of the
source programs. I often meet data scientists engaged in machine
learning and statistical analyses for research collaborations and
introduce my students to them. I recently found out that almost all of
them think that (mathematical) logic rather than knowledge and
experience is the most crucial ability for grasping the essence in their
jobs. Our necessary knowledge is changing every day and can be
obtained when needed. However, logic allows us to examine whether
each item on the Internet is correct and follow any changes; we might
miss even chances without it.
What makes SEMP unique?
I have summarized the features of this book as follows.
1. Developing logic
This book explains how to use the package and provides examples
of executions for those who are not familiar with them. Still,
because only the inputs and outputs are visible, we can only see the
procedure as a black box. In this sense, the reader will have limited
satisfaction because they will not be able to obtain the essence of
the subject. SEMP intends to show the reader the heart of ML and is
more of a full-fledged academic book.
The reader can ask any question on the book via https://bayesnet.
org/books.
Exercises 1–20
2 Generalized Linear Regression
2.1 Generalization of Lasso in Linear Regression
2.2 Logistic Regression for Binary Values
2.3 Logistic Regression for Multiple Values
2.4 Poisson Regression
2.5 Survival Analysis
Appendix Proof of Proposition
Exercises 21–33
3 Group Lasso
3.1 When One Group Exists
3.2 Proxy Gradient Method
3.3 Group Lasso
3.4 Sparse Group Lasso
3.5 Overlap Lasso
3.6 Group Lasso with Multiple Responses
3.7 Group Lasso Via Logistic Regression
3.8 Group Lasso for the Generalized Additive Models
Appendix Proof of Proposition
Exercises 34–46
4 Fused Lasso
4.1 Applications of Fused Lasso
4.2 Solving Fused Lasso Via Dynamic Programming
4.3 LARS
4.4 Dual Lasso Problem and Generalized Lasso
4.5 ADMM
Appendix Proof of Proposition
Exercises 47–61
5 Graphical Models
5.1 Graphical Models
5.2 Graphical Lasso
5.3 Estimation of the Graphical Model Based on the Quasi-
Likelihood
5.4 Joint Graphical Lasso
Appendix Proof of Propositions
Exercises 62–75
6 Matrix Decomposition
6.1 Singular Decomposition
6.2 Eckart-Young’s Theorem
6.3 Norm
6.4 Sparse Estimation for Low-Rank Estimations
Appendix Proof of Propositions
Exercises 76–87
7 Multivariate Analysis
7.1 Principal Component Analysis (1):SCoTLASS
7.2 Principle Component Analysis (2):SPCA
7.3 -Means Clustering
7.4 Convex Clustering
Appendix Proof of Proposition
Exercises 88–100
References
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021
J. Suzuki, Sparse Estimation with Math and Python
https://doi.org/10.1007/978-981-16-1438-5_1
1. Linear Regression
Joe Suzuki1
(1) Graduate School of Engineering Science, Osaka University, Toyonaka, Osaka, Japan
Joe Suzuki
Email: [email protected]
In general statistics, we often assume that the number of samples N is greater than the
number of variables p. If this is not the case, it may not be possible to solve for the best-
fitting regression coefficients using the least squares method, or it is too
computationally costly to compare a total of models using some information
criterion.
When p is greater than N (also known as the sparse situation), even for linear
regression, it is more common to minimize, instead of the usual squared error, the
modified objective function to which a term is added to prevent the coefficients from
being too large (the so-called regularization term). If the regularization term is a
constant times the L1-norm (resp. L2-norm) of the coefficient vector, it is called Lasso
(resp. Ridge). In the case of Lasso, if the value of increases, there will be more
coefficients that go to 0, and when reaches a certain value, all the coefficients will
eventually become 0. In that sense, we can say that Lasso also plays a role in model
selection.
In this chapter, we examine the properties of Lasso in comparison to those of Ridge.
After that, we investigate the elastic net, a regularized regression method that combines
the advantages of both Ridge and Lasso. Finally, we consider how to select an
appropriate value of .
is denoted by .
First, for the sake of simplicity, we assume that the jth column of X,
and y have already been centered. That is, for each , define
, and assume that has already been subtracted from each so that is
satisfied. Similarly, defining , we assume that was subtracted in advance
from each so that holds. Under this condition, one of the parameters
holds. Thus, from now, without loss of generality, we may assume that the intercept
is zero and use this in our further calculations.
We begin by first observing the following equality:
(1.1)
Thus, when we set the right-hand side of (1.1) equal to 0, and if is invertible, then
becomes
(1.2)
For the case where , write each column of X as ; we see that
(1.3)
If we had not performed the data centering, we would still obtain the same slope ,
(1.4)
In this book, we focus more on the sparse case, i.e., when p is larger than N. In this
case, a problem arises. When , the matrix does not have an inverse. In fact,
since
(that is, whether we choose each of the variables or not). This means we have to
compare a total of models. Then, to extract the proper model combination by using
an information criterion or cross-validation, the computational resources required will
grow exponentially with the number of variables p.
To deal with this kind of problem, let us consider the following. Let be a
constant. We add a term to that penalizes for being too large in size.
Specifically, we define
(1.5)
or
(1.6)
Our work now is to solve for the minimizer to one of the above quantities. Here,
for , is the L1-norm and is the L2-
either (1.5) or (1.6), then minimize it with respect to the slope (this is called
1.2 Subderivative
To address the minimization problem for Lasso, we need a method for optimizing
functions that are not differentiable. When we want to find the points x of the maxima
or minima of a single-variable polynomial, say, , we can differentiate
(1.7)
Therefore, it is not convex (Fig. 1.1, right). If functions f, g are convex, then for any
, the function has to be convex since the following holds:
Fig. 1.1 Left: is convex. However, at the origin, the derivatives from each side differ; thus, it
is not differentiable. Right: We cannot simply judge from its shape, but this function is not convex
Next, for any convex function , fix arbitrarily. For all , we
say that the set of all that satisfies
(1.8)
is a subderivative of f at .
If f is differentiable at , then the subderivative will be a set that contains only 1
element, say, .2 We prove this as follows.
First, the convex function f is differentiable at ; thus, it satisfies
. To see this, since f is convex,
This can be rewritten as
left and, at the same time, be less than or equal to the derivative on the right at . Since
f is differentiable at , those 2 derivatives are equal; this completes the proof.
The main interest of this book is specifically the case where and .
Hence, by (1.8), its subderivative is the set of z such that for any , . These
values of z lie in the interval greater than or equal to and less than or equal to 1, and
is true. Let us confirm this. If for any x, holds, then for and ,
and , respectively, need to be true. Conversely, if , then
is true for any arbitrary .
Example 1 By dividing into 3 cases , , and , find the values x that
attain the minimum of and . Note that for , we can find
. Neither is differentiable at . Despite not being differentiable, the point is a minimum for
1.3 Lasso
As stated in Sect. 1.1, the method considered for the minimization of
(1.9)
holds, and let . With this assumption, the calculations are made much
simpler.
Solving for the subderivative of L with respect to gives
(1.10)
(1.11)
At a certain , all the coefficients will be 0. The at which each coefficient becomes 0 varies
As we can see in Fig. 1.4, as increases, the absolute value of each coefficient
decreases. When reaches a certain value, all coefficients go to 0. In other words, for
each value of , the set of nonzero coefficients differs. The larger becomes, the
smaller the set of selected variables.
When performing coordinate descent, it is quite common to begin with a large
enough that every coefficient is zero and then make smaller gradually. This method is
called a warm start, which utilizes the fact that when we want to calculate the
coefficient for all values of , we can improve the calculation performance by setting
the initial value of to the estimated from a previous . For example, we can write
the program as follows:
Example 3 We use the warm start method to reproduce the coefficient for each
in Example 2.
First, we make the value of large enough so that all are 0 and
then gradually decrease the size of while performing coordinate descent. Here, for
simplicity, we assume that for each , we have ; moreover, the
values of are all different. In this case, the smallest that makes
larger than this formula, it will be satisfied that for all , . Then, we
have that and
hold. Thus, when we decrease the size of , one of the values of j will satisfy
. Again, since ( ), if we continue to make it
Fig. 1.5 The execution result of Example 4. The numbers at the top represent the number of estimated
coefficients that are not zero
The glmnet package is often used [11].
Example 4 (Boston) Using the Boston dataset and setting the variable of the 14th
column as the target variable and the other 13 variables as predictors, we plot a graph
similar to that of the previous one (Fig. 1.5).
So far, we have seen how Lasso can be useful in the process of variable selection.
However, we have not explained the reason why we seek to minimize (1.5) for .
Why do we not instead consider minimizing the following usual information criterion
(1.12)
1.4 Ridge
In Sect. 1.1, we made an assumption about that the matrix is
invertible, and, based on this, we showed that the that minimizes the squared error
is given by .
First, when , the possibility that is singular is not that high, though we
may have another problem instead: the confidence interval becomes significant when
the determinant is small. To cope with this, we let be a constant and add to the
squared error the norm of times . That is, the method considering the minimization
of
($1.6$)
is commonly used. This method is called Ridge. Differentiating L with respect to gives
positive, which is the same as saying that is nonsingular. Note that this
less than or equal to N, and hence, the matrix is singular. Therefore, for this case, the
following conditions are equivalent:
Example 5 We use the same U.S. crime data as that of Example 2 and perform the
following analysis. To control the size of the coefficient of each predictor, we call the
function ridge and then execute.
In Fig. 1.6, we plot how each coefficient changes with the value of .
Fig. 1.6 The execution result of Example 5. The changes in the coefficient with respect to based
give the same value of (1.13). The rhombus in the left figure is the L1 regularization constraint
, while the circle in the right figure is the L2 regularization constraint
the least squares process, we solve for the values that minimize
. For now, let us denote them by , respectively. Here,
if we let , we have
(1.13)
and, of course, if we let here, we obtain the minimum ( RSS).
Thus, we can view the problems of Lasso and Ridge in the following way: the
minimization of quantities (1.5) and (1.6) is equivalent to finding the values of
that satisfy the constraints and , respectively, that also
minimize the quantity of (1.13) (here, the case where is large is equivalent to the case
where are small).
The case of Lasso is the same as in the left panel of Fig. 1.7. The ellipses are centered
at and represent contours on which the values of (1.13) are the same. We
expand the size of the ellipse (the contour), and once we make contact with the
rhombus, the corresponding values of are the solution to Lasso. If the rhombus
is small ( is large), it is more likely to touch only one of the four rhombus vertices. In
this case, one of the values will become 0. However, in the Ridge case, as in the
right panel of Fig. 1.7, a circle replaces the Lasso rhombus; hence, it is less likely that
will occur.
In this case, if the least squares solution lies in the green zone of Fig. 1.8,
larger, which is the reason why Lasso performs well in variable selection.
Fig. 1.8 The green zone represents the area in which the optimal solution would satisfy either
the VIF (variance inflation factor). The larger this value is, the better the jth column
variable is explained by the other variables. Here, denotes the coefficient of
determination squared in which is the target variable, and the other variables are
predictors.
Example 6 We compute the VIF for the Boston dataset. It shows that the 9th and 10th
variables (RAD and TAX) have strong collinearity.
In usual linear regression, if the VIF is large, the estimated coefficient will be
unstable. Particularly, if two columns are precisely the same, the coefficient is
unsolvable. Moreover, for Lasso, if two columns are highly related, then generally one of
them will be estimated as 0 and the other as nonzero. However, in the Ridge case, for
, even when the columns j, k of X are the same, the estimation is solvable, and both
of them will obtain the same value.
In particular, we find the partial derivative of
the coefficients change relative to the value of in Fig. 1.9. Naturally, one might expect
that similar coefficient values should be given to each of the related variables, though it
turns out that Lasso does not behave in this way.
Fig. 1.9 The execution result of Example 7. The Lasso case is different from that of Ridge: related
variables are not given similar estimated coefficients. When we use the glmnet package, its default
horizontal axis is the L1-norm [11]. This is the value , which becomes smaller as increases.
Therefore, the figure here is left-/right-reversed compared to that where or is set as the
horizontal axis
(1.14)
This is called the elastic net. For each , if we find the subderivative of (1.14)
with respect to , we obtain
fact that
more it is able to handle collinearity, which is in contrast to the case of (Lasso), where the
Example 9 We apply the function cv.glmnet to the U.S. crime dataset from
Examples 2 and 5, obtain the optimal , use it for the usual Lasso, and then obtain the
coefficients of . For each , the function also provides the value of the least squares of
the test data and the confidence interval (Fig. 1.11). Each number above the figure
represents the number of nonzero coefficients for that .
For the elastic net, we have to perform double loop cross-validation for both .
The function cv_glmnet provides us the output cvm, which contains the evaluation
values from the cross-validation.
Example 10 We generate random numbers as our data and try conducting the double
loop cross-validation for .
For the problem, we changed the values and compared the evaluation for the best
to find that the difference is within 1% through .
Exercises 1–20
In the following exercises, we estimate the intercept and the coefficients
from N groups of p explanatory variables and a target variable data
. We subtract from each the
and from each the such that each of
them has mean 0 and then estimate (the estimated value is denoted by ). Finally, we
let . Again, we let and
for a function called linear with Python. This function accepts a matrix
and the estimated slope as outputs. Please complete blanks (1), (2) below.
3.
(a)
(b)
(a) If the functions g(x), h(x) are convex, show that for any , the function
is convex.
(c) Decide whether the following functions of are convex or not. In addition,
provide the proofs.
(i)
(ii)
(iii)
Melting Point.
Salt.
Deg. Cent. Deg. Fahr.
1 molecule common salt + 650 1202
1 molecule potassium chloride
Common salt 800 1472
Anhydrous sodium carbonate 850 1562
” ” sulphate 900 1652
Sodium plumbate 1000 1832
Anhydrous potassium sulphate 1070 1958
” magnesium sulphate 1150 2102
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com