Islp 5
Islp 5
Statistical Learning
of homes to inputs such as crime rate, zoning, distance from a river, air
quality, schools, income level of community, size of houses, and so forth. In
this case one might be interested in the association between each individ-
ual input variable and housing price—for instance, how much extra will a
house be worth if it has a view of the river? This is an inference problem.
Alternatively, one may simply be interested in predicting the value of a
home given its characteristics: is this house under- or over-valued? This is
a prediction problem.
Depending on whether our ultimate goal is prediction, inference, or a
combination of the two, different methods for estimating f may be ap-
propriate. For example, linear models allow for relatively simple and in-
linear model
terpretable inference, but may not yield as accurate predictions as some
other approaches. In contrast, some of the highly non-linear approaches
that we discuss in the later chapters of this book can potentially provide
quite accurate predictions for Y , but this comes at the expense of a less
interpretable model for which inference is more challenging.
Incom
e
ity
Ye
or
a
ni
rs
Se
of
Ed
uc
ati
on
FIGURE 2.4. A linear model fit by least squares to the Income data from
Figure 2.3. The observations are shown in red, and the yellow plane indicates the
least squares fit to the data.
2. After a model has been selected, we need a procedure that uses the
training data to fit or train the model. In the case of the linear model
fit
(2.4), we need to estimate the parameters β0 , β1 , . . . , βp . That is, we train
want to find values of these parameters such that
Y ≈ β 0 + β 1 X1 + β2 X2 + · · · + βp Xp .
The most common approach to fitting the model (2.4) is referred to
as (ordinary) least squares, which we discuss in Chapter 3. However,
least squares
least squares is one of many possible ways to fit the linear model. In
Chapter 6, we discuss other approaches for estimating the parameters
in (2.4).
The model-based approach just described is referred to as parametric;
it reduces the problem of estimating f down to one of estimating a set of
parameters. Assuming a parametric form for f simplifies the problem of
estimating f because it is generally much easier to estimate a set of pa-
rameters, such as β0 , β1 , . . . , βp in the linear model (2.4), than it is to fit
an entirely arbitrary function f . The potential disadvantage of a paramet-
ric approach is that the model we choose will usually not match the true
unknown form of f . If the chosen model is too far from the true f , then
our estimate will be poor. We can try to address this problem by choos-
ing flexible models that can fit many different possible functional forms
flexible
for f . But in general, fitting a more flexible model requires estimating a
greater number of parameters. These more complex models can lead to a
phenomenon known as overfitting the data, which essentially means they
overfitting
follow the errors, or noise, too closely. These issues are discussed through-
noise
out this book.
Figure 2.4 shows an example of the parametric approach applied to the
Income data from Figure 2.3. We have fit a linear model of the form
Incom
e
ity
Ye
or
a
ni
rs
Se
of
Ed
uc
ati
on
FIGURE 2.5. A smooth thin-plate spline fit to the Income data from Figure 2.3
is shown in yellow; the observations are displayed in red. Splines are discussed in
Chapter 7.
Since we have assumed a linear relationship between the response and the
two predictors, the entire fitting problem reduces to estimating β0 , β1 , and
β2 , which we do using least squares linear regression. Comparing Figure 2.3
to Figure 2.4, we can see that the linear fit given in Figure 2.4 is not quite
right: the true f has some curvature that is not captured in the linear fit.
However, the linear fit still appears to do a reasonable job of capturing the
positive relationship between years of education and income, as well as the
slightly less positive relationship between seniority and income. It may be
that with such a small number of observations, this is the best we can do.
Non-Parametric Methods
Non-parametric methods do not make explicit assumptions about the func-
tional form of f . Instead they seek an estimate of f that gets as close to the
data points as possible without being too rough or wiggly. Such approaches
can have a major advantage over parametric approaches: by avoiding the
assumption of a particular functional form for f , they have the potential
to accurately fit a wider range of possible shapes for f . Any parametric
approach brings with it the possibility that the functional form used to
estimate f is very different from the true f , in which case the resulting
model will not fit the data well. In contrast, non-parametric approaches
completely avoid this danger, since essentially no assumption about the
form of f is made. But non-parametric approaches do suffer from a major
disadvantage: since they do not reduce the problem of estimating f to a
small number of parameters, a very large number of observations (far more
than is typically needed for a parametric approach) is required in order to
obtain an accurate estimate for f .
An example of a non-parametric approach to fitting the Income data is
shown in Figure 2.5. A thin-plate spline is used to estimate f . This ap-
thin-plate
proach does not impose any pre-specified model on f . It instead attempts spline
2.1 What Is Statistical Learning? 23
Incom
e
ity
Ye
or
a
ni
rs
Se
of
Ed
uc
ati
on
FIGURE 2.6. A rough thin-plate spline fit to the Income data from Figure 2.3.
This fit makes zero errors on the training data.
Subset Selection
High
Lasso
Least Squares
Interpretability
Generalized Additive Models
Trees
Bagging, Boosting
Deep Learning
Low High
Flexibility
One might reasonably ask the following question: why would we ever
choose to use a more restrictive method instead of a very flexible approach?
There are several reasons that we might prefer a more restrictive model.
If we are mainly interested in inference, then restrictive models are much
more interpretable. For instance, when inference is the goal, the linear
model may be a good choice since it will be quite easy to understand
the relationship between Y and X1 , X2 , . . . , Xp . In contrast, very flexible
approaches, such as the splines discussed in Chapter 7 and displayed in
Figures 2.5 and 2.6, and the boosting methods discussed in Chapter 8, can
lead to such complicated estimates of f that it is difficult to understand
how any individual predictor is associated with the response.
Figure 2.7 provides an illustration of the trade-off between flexibility and
interpretability for some of the methods that we cover in this book. Least
squares linear regression, discussed in Chapter 3, is relatively inflexible but
is quite interpretable. The lasso, discussed in Chapter 6, relies upon the
lasso
linear model (2.4) but uses an alternative fitting procedure for estimating
the coefficients β0 , β1 , . . . , βp . The new procedure is more restrictive in es-
timating the coefficients, and sets a number of them to exactly zero. Hence
in this sense the lasso is a less flexible approach than linear regression.
It is also more interpretable than linear regression, because in the final
model the response variable will only be related to a small subset of the
predictors—namely, those with nonzero coefficient estimates. Generalized
additive models (GAMs), discussed in Chapter 7, instead extend the lin-
generalized
ear model (2.4) to allow for certain non-linear relationships. Consequently, additive
GAMs are more flexible than linear regression. They are also somewhat model
less interpretable than linear regression, because the relationship between
each predictor and the response is now modeled using a curve. Finally,
fully non-linear methods such as bagging, boosting, support vector machines
bagging
with non-linear kernels, and neural networks (deep learning), discussed in
boosting
Chapters 8, 9, and 10, are highly flexible approaches that are harder to
support
interpret. vector
machine