A Short Course On Nonparametric Curve Estimation R PDF
A Short Course On Nonparametric Curve Estimation R PDF
1 Introduction 5
1.1 Course objectives and logistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Nonparametric inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Main references and credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Density estimation 15
2.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Bandwidth selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Regression estimation 55
3.1 Review on parametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Kernel regression estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Bandwidth selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5 Local likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
B Introduction to RStudio 87
C Introduction to R 89
3
4 CONTENTS
Chapter 1
Introduction
The animations of these notes will not be displayed the first time they are browsed1 . See for
example Figure 1.3. To see them, click on the caption’s link “Application also available here”.
You will get a warning from your browser saying that “Your connection is not private”. Click in
“Advanced” and allow an exception in your browser (I guarantee you I will not do anything evil!).
The next time the animation will show up correctly within the notes.
This course is intended to provide an introduction to nonparametric estimation of the density and
regression functions from, mostly, the perspective of kernel smoothing. The emphasis is placed in building
intuition behind the methods, gaining insights into their asymptotic properties, and showing their application
through the use of statistical software.
The codes in the notes may assume that the packages have been loaded, so it is better to do it now:
# Load packages
library(ks)
library(nor1mix)
library(KernSmooth)
library(locfit)
The Shiny interactive apps on the notes can be downloaded and run locally. This allows in particular to
examine their codes. Check this GitHub repository for the sources.
2 Among others: basic programming in R, ability to work with objects and data structures, ability to produce graphics,
knowledge of the main statistical functions, ability to run scripts in RStudio.
5
6 CHAPTER 1. INTRODUCTION
Each topic of this contains a mix of theoretical and practical exercises for grading. Groups of two students
must choose three exercises in total (at least one theoretical and other practical) from Sections 2.7 and 3.6
and turn them in order to be graded. The group grade is weighted according to the difficulty of the exercises,
which is given by the number of stars: easy (⋆), medium (⋆⋆), and hard (⋆ ⋆ ⋆). The final grade (0 − 5) is
1 ∑ Scorei
3
(2 + ⋆i ),
3 i=1 5
where Scorei is the score (0 − 5) for the i-th exercise and ⋆i represents its number of stars (1 − 3). Deadline
for submission is 2017-10-13.
A triple (Ω, A, P) is called a probability space. Ω represents the sample space, the set of all possible individual
outcomes of a random experiment. A is a σ-field, a class of subsets of Ω that is closed under complementation
and numerable unions, and such that Ω ∈ A. A represents the collection of possible events (combinations
of individual outcomes) that are assigned a probability by the probability measure P. A random variable is
a map X : Ω −→ R such that {ω ∈ Ω : X(ω) ≤ x} ∈ A (the set is measurable).
The cumulative distribution function (cdf) of a random variable X is F (x) := P[X ≤ x]. When an independent
and identically distributed (iid) sample X1 , . . . , Xn is given, the cdf can be estimated by the empirical
distribution function (ecdf)
1∑
n
Fn (x) = ⊮{Xi ≤x} , (1.1)
n i=1
{
1, A is true,
where 1A := is an indicator function. Continuous random variables are either characterized
0, A is false
by the cdf F or the probability density function (pdf) f = F ′ , which represents the infinitesimal relative
probability of X per unit of length. We write X ∼ F (or X ∼ f ) to denote that X has a cdf F (or a pdf f ).
d
If two random variables X and Y have the same distribution, we write X = Y .
The expectation operator is constructed using the Lebesgue-Stieljes “dF (x)” integral. Hence, for X ∼ F , the
expectation of g(X) is
∫
E[g(X)] := g(x)dF (x)
{∫
g(x)f (x)dx, X continuous,
= ∑
{i:P[X=xi ]>0} g(xi )P[X = xi ], X discrete.
Unless otherwise stated, the integration limits of any integral are R or Rp . The variance operator is defined
as Var[X] := E[(X − E[X])2 ].
1.2. BACKGROUND AND NOTATION 7
We employ boldfaces to denote vectors (assumed to be column matrices) and matrices. A p-random vector is
a map X : Ω −→ Rp , X(ω) := (X1 (ω), . . . , Xp (ω)), such that each Xi is a random variable. The (joint) cdf of
p
X is F (x) := P[X ≤ x] := P[X1 ≤ x1 , . . . , Xp ≤ xp ] and, if X is continuous, its (joint) pdf is f := ∂x1∂···∂xp F .
The marginals of F and f are the cdf and pdf of Xi , i = 1, . . . , p, respectively. They are defined as:
∫
FXi (xi ) := P[Xi ≤ x] = F (x)dx−i ,
Rp−1
∫
∂
fXi (xi ) := FX (xi ) = f (x)dx−i ,
∂xi i Rp−1
where x−i := (x1 , . . . , xi−1 , xi+1 , xp ). The definitions can be extended analogously to the marginals of the
cdf and pdf of different subsets of X.
The conditional cdf and pdf of X1 |(X2 , . . . , Xp ) are defined, respectively, as
We will make use of several parametric distributions. Some notation and facts are introduced as follows:
• N (µ, σ 2 ) stands for the normal distribution with mean µ and variance σ 2 . Its pdf is ϕσ (x − µ) :=
(x−µ)2 ( )
√ 1 e− 2σ2 , x ∈ R, and satisfies that ϕσ (x − µ) = 1 ϕ x−µ (if σ = 1 the dependence is omitted).
2πσ σ σ
Its cdf is denoted as Φσ (x − µ). zα denotes the upper α-quantile of a N (0, 1), i.e., zα = Φ−1 (1 − α).
Some uncentered moments of X ∼ N (µ, σ 2 ) are
E[X] = µ,
E[X 2 ] = µ2 + σ 2 ,
E[X 3 ] = µ3 + 3µσ 2 ,
E[X 4 ] = µ4 + 6µ2 σ 2 + 3σ 4 .
The multivariate normal is represented by Np (µ, Σ), where µ is a p-vector and Σ is a p × p symmetric
′ −1
and positive matrix. The pdf of a N (µ, Σ) is ϕΣ (x−µ) := (2π)p/21|Σ|1/2 e− 2 (x−µ) Σ (x−µ) , and satisfies
1
( )
that ϕΣ (x − µ) = |Σ|−1/2 ϕ Σ−1/2 (x − µ) (if Σ = I the depence is omitted).
d
• The lognormal distribution is denoted by LN (µ, σ 2 ) and is such that LN (µ, σ 2 ) = exp(N (µ, σ 2 )). Its
(log x−log µ)2 2
− µ+ σ2
1
pdf is f (x; µ, σ) := √2πσx e , x > 0. Note that E[LN (µ, σ 2 )] = e
2σ 2
• The exponential distribution is denoted as Exp(λ) and has pdf f (x; λ) = λe−λx , λ, x > 0.
ap
• The gamma distribution is denoted as Γ(a, p) and has pdf f (x; a, p) = Γ(p) xp−1 e−ax , a, p, x > 0, where
∫ ∞ p−1 −ax
Γ(p) = 0 x e dx. It is known that E[Γ(a, p)] = a and Var[Γ(a, p)] = ap2 .
p
d ap
• The inverse gamma distribution, IG(a, p) = Γ(a, p)−1 , has pdf f (x; a, p) = −p−1 − x
a
Γ(p) x e , a, p, x > 0.
a2
It is known that E[IG(a, p)] = and Var[IG(a, p)] =
a
p−1 (p−1)2 (p−2) .
• The binomial distribution is denoted as B(n, p). Recall that E[B(n, p)] = np and Var[B(n, p)] =
np(1 − p). A B(1, p) is a Bernouilli distribution, denoted as Ber(p).
1
• The beta distribution is denoted as β(a, b) and its pdf is f (x; a, b) = β(a,b) xa−1 (1 − x)1−b , 0 < x < 1,
Γ(a)Γ(b)
where β(a, b) = Γ(a+b) . When a = b = 1, the uniform distribution U(0, 1) arises.
xλ e−λ
• The Poisson distribution is denoted as Pois(λ) and has pdf P[X = x] = x! , x = 0, 1, 2, . . .. Recall
that E[Pois(λ)] = Var[Pois(λ)] = λ.
Let Xn be a sequence of random variables defined in a common probability space (Ω, A, P). The four most
common types of convergence of Xn to a random variable in (Ω, A, P) are the following.
d
Definition 1.1 (Convergence in distribution). Xn converges in distribution to X, written Xn −→ X, if
limn→∞ Fn (x) = F (x) for all x for which F is continuous, where Xn ∼ Fn and X ∼ F .
P
Definition 1.2 (Convergence in probability). Xn converges in probability to X, written Xn −→ X, if
limn→∞ P[|Xn − X| > ε] = 0, ∀ε > 0.
as
Definition 1.3 (Convergence almost surely). Xn converges almost surely (as) to X, written Xn −→ X, if
P[{ω ∈ Ω : limn→∞ Xn (ω) = X(ω)}] = 1.
r
Definition 1.4 (Convergence in r-mean). Xn converges in r-mean to X, written Xn −→ X, if
limn→∞ E[|Xn − X|r ] = 0.
Remark. The previous definitions can be extended to a sequence of p-random vectors Xn . For Definitions
1.2 and 1.4, replace | · | by the Euclidean norm || · ||. Alternatively, Definition 1.2 can be extended marginally:
P P
Xn −→ X : ⇐⇒ Xj,n −→ Xj , ∀j = 1, . . . , p. For Definition 1.1, replace Fn and F by the joint cdfs of Xn
and X, respectively. Definition 1.3 extends marginally as well.
1.2. BACKGROUND AND NOTATION 9
The 2-mean convergence plays a remarkable role in defining a consistent estimator θ̂n = T (X1 , . . . , Xn ) of a
parameter θ. We say that the estimator is consistent if its mean squared error (MSE),
2
goes to zero as n → ∞. Equivalently, if θ̂n −→ θ.
d,P,r,as
If Xn −→ X and X is a degenerate random variable such that P[X = c] = 1, c ∈ R, then we write
d,P,r,as
Xn −→ c (the list-notation is used to condensate four different convergence results in the same line).
The relations between the types of convergences are conveniently summarized in the next proposition.
Proposition 1.6. Let Xn be a sequence of random variables and X a random variable. Then the following
implication diagram is satisfied:
r P as
Xn −→ X =⇒ Xn −→ X ⇐= Xn −→ X
⇓
d
Xn −→ X
None of the converses hold in general. However, there are some notable exceptions:
d P
i. If Xn −→ c, then Xn −→ c.
∑ P as
ii. If ∀ε > 0, limn→∞ n P[|Xn − X| > ε] < ∞ (implies4 Xn −→ X), then Xn −→ X.
P r
iii. If Xn −→ X and P[|Xn | ≤ M ] = 1, ∀n ∈ N and M > 0, then Xn −→ X for r ≥ 1.
∑n P as
iv. If Sn = i=1 Xi with X1 , . . . , Xn iid, then Sn −→ S ⇐⇒ Sn −→ S.
s r
Also, if s ≥ r ≥ 1, then Xn −→ X =⇒ Xn −→ X.
The corner stone limit result in probability is the central limit theorem (CLT). In its simpler version, it entails
that:
Theorem 1.1 (CLT). Let Xn be a sequence of iid random variables with E[Xi ] = µ and Var[Xi ] = σ 2 < ∞,
i ∈ N. Then: √
n(X̄ − µ) d
−→ N (0, 1).
σ
We will use later the following CLT for random variables that are independent but not identically distributed
due to its easy-to-check moment conditions.
Theorem 1.2 (Lyapunov’s CLT). Let Xn be a sequence of independent random variables with E[Xi ] = µi
and Var[Xi ] = σi2 < ∞, i ∈ N, and such that for some δ > 0
1 ∑ [ ]
n
E |Xi − µi |2+δ −→ 0 as n → ∞,
s2+δ
n i=1
∑n
with s2n = i=1 σi2 . Then:
1 ∑
n
d
(Xi − µi ) −→ N (0, 1).
sn i=1
Theorem 1.3 (Cramér-Wold device). Let Xn be a sequence of p-dimensional random vectors. Then:
Xn −→ X ⇐⇒ c′ Xn −→ c′ X,
d d
∀c ∈ Rp .
d, P, as d, P, as
Theorem 1.4 (Continuous mapping theorem). If Xn −→ X, then g(Xn ) −→ g(X) for any continuous
function g.
Theorem 1.5 (Slutsky’s theorem). Let Xn and Yn be sequences of random variables.
d P d
i. If Xn −→ X and Yn −→ c, then Xn Yn −→ cX.
d P d X
ii. If Xn −→ X and Yn −→ c, c ̸= 0, then X
Yn −→ c .
n
d P d
iii. If Xn −→ X and Yn −→ c, then Xn + Yn −→ X + c.
Theorem 1.6 (Limit algebra for (P, r, as)-convergence). Let Xn and Yn be sequences of random variables
P, r, as
and an → a and bn → b two sequences. Denote Xn −→ X to “Xn converges in probability (respectively
almost surely, respectively in r-mean)”.
P, r, as P, r, as P, r, as
i. If Xn −→ X and Yn −→ X, then an Xn + bn Yn −→ aX + bY .
P, as P, as P, as
ii. If Xn −→ X and Yn −→ X, then Xn Yn −→ XY .
Remark. Recall the abscense of results for convergence in distribution. They do not hold. Particularly:
d d d
Xn −→ X and Yn −→ X do not imply Xn + Yn −→ X + Y .
√ d √ d ( )
Theorem 1.7 (Delta method). If n(Xn − µ) −→ N (0, σ 2 ), then n(g(Xn ) − g(µ)) −→ N 0, (g ′ (µ))2 σ 2
for any function such that g is differentiable at µ and g ′ (µ) ̸= 0.
Example 1.2. It is well known that, given a parametric density fθ , with parameter ∑n θ ∈ Θ, and iid
X1 , . . . , Xn ∼ fθ , then the maximum likelihood (ML) estimator θ̂ML := arg maxθ∈Θ i=1 log fθ (Xi ) (the
parameter that maximizes the probability of the data based on the model) converges to a normal under
certain regularity conditions:
√ ( )
n(θ̂ML − θ) −→ N 0, I(θ)−1 ,
d
[ 2 ]
where I(θ) := −Eθ ∂ log ∂θ
fθ (x)
2 . Then, it is satisfied that:
√ ( )
n(g(θ̂ML ) − g(θ)) −→ N 0, (g ′ (θ))2 I(θ)−1 .
d
If we apply the continuous mapping theorem for g, we would have obtained that
√ ( ( ))
g( n(θ̂ML − θ)) −→ g N 0, I(θ)−1 ,
d
Theorem 1.8 (Weak and strong laws of large numbers). Let Xn be a iid sequence with E[Xi ] = µ, i ≥ 1.
∑n P, as
Then: n1 i=1 Xi −→ µ.
In computer science the O notation used to measure the complexity of algorithms. For example, when an
algorithm is O(n2 ), it is said that it is quadratic in time and we know that is going to take on the order of
n2 operations to process an input of size n. We do not care about the specific amount of computations, but
focus on the big picture by looking for an upper bound for the sequence of computation times in terms of n.
This upper bound disregards constants. For example, the dot product between two vectors of size n is an
O(n) operation, albeit it takes n multiplications and n − 1 sums, hence 2n − 1 operations.
In mathematical analysis, O-related notation is mostly used with the aim of bounding sequences that shrink
to zero. The technicalities are however the same.
Definition 1.5 (Big O). Given two strictly positive sequences an and bn ,
an
an = O(bn ) : ⇐⇒ lim ≤ C, for a C > 0.
n→∞ bn
If an = O(bn ), then we say that an is big-O of bn . To indicate that an is bounded, we write an = O(1).
1.2. BACKGROUND AND NOTATION 11
Definition 1.6 (Little o). Given two strictly positive sequences an and bn ,
an
an = o(bn ) : ⇐⇒ lim = 0.
n→∞ bn
{
1, p ≤ 1,
|a + b| ≤ Cp (|a| + |b| ),
p p p
Cp =
2p−1 , p > 1.
The previous notation is purely deterministic. Let’s add some stochastic flavor by establishing the stochastic
analogous of little-o.
Definition 1.7 (Little oP ). Given a strictly positive sequence an and a sequence of random variables Xn ,
|Xn | P
Xn = oP (an ) : ⇐⇒ −→ 0
an
[ ]
|Xn |
⇐⇒ lim P > ε = 0, ∀ε > 0.
n→∞ an
P
If Xn = oP (an ), then we say that Xn is little-oP of an . To indicate that Xn −→ 0, we write Xn = oP (1).
Therefore, little-oP allows to quantify easily the speed at which a sequence of random variables converges to
zero in probability. ( ) ( )
Example 1.3. Let Yn = oP n−1/2 and Zn = oP n−1 . Then Zn converges faster to zero in probability
than Yn . To visualize this, recall that the little-oP and limit definitions entail that
Definition 1.8 (Big OP ). Given a strictly positive sequence an and a sequence of random variables Xn ,
(√ )
X − E[X] = OP Var[Xn ] .
d
Exercise 1.3. Prove that if Xn −→ X, then Xn = OP (1). Hint: use the double-limit definition and note
that X = OP (1).
d
Example 1.5. (Example 1.18 in DasGupta (2008)) Suppose that an (Xn − cn ) −→ X for deterministic
sequences an and cn such that cn → c. Then, if an → ∞, Xn − c = oP (1). The argument is simple:
Xn − c = Xn − cn + cn − c
1
= an (Xn − cn ) + o(1)
an
1
= OP (1) + o(1).
an
Exercise 1.4. Using the previous example, derive the weak law of large numbers as a consequence of the
CLT, both for id and non-id independent random variables.
Proposition 1.8. Consider two strictly positive sequences an , bn → 0. The following properties hold:
i. oP (an ) = OP (an ) (little-oP implies big-OP ).
ii. o(1) = oP (1), O(1) = OP (1) (deterministic implies probabilistic).
iii. kOP (an ) = OP (an ), koP (an ) = oP (an ), k ∈ R.
iv. oP (an ) + oP (bn ) = oP (an + bn ), OP (an ) + OP (bn ) = OP (an + bn ).
v. oP (an ) + OP (bn ) = OP (an + bn ), oP (an )oP (bn ) = oP (an bn ).
vi. OP (an )OP (bn ) = OP (an bn ), oP (an )OP (bn ) = oP (an bn ).
vii. oP (1)OP (an ) = oP (an ).
viii. (1 + oP (1))−1 = OP (1).
(x − δ, x + δ).
Theorem 1.12 (Dominated convergence theorem). Let fn : S ⊂ R −→ R be a sequence of Lebesgue
measurable
∫ functions such that limn→∞ fn (x) = f (x) and |fn (x)| ≤ g(x), ∀x ∈ S and ∀n ∈ N, where
S
|g(x)|dx < ∞. Then ∫ ∫
lim fn (x)dx = f (x)dx < ∞.
n→∞ S S
Remark. Note that if S is bounded and |fn (x)| ≤ M , ∀x ∈ S and ∀n ∈ N, then limit interchangeability with
integral is always possible.
Is not so simple to obtain the exact MSE for (1.2), even if it is easy to prove that λ̂ML ∼ IG(λ−1 , n).
Approximations are possible using Exercise 1.2. However, the MSE can be easily approximated by Monte
Carlo.
14 CHAPTER 1. INTRODUCTION
What does it happen when the data is generated from an Exp(λ)? Then (1.2) uniformly dominates (1.1)
in performance. But, even for small deviations from Exp(λ) given by Γ(λ, p), p = 1 + δ, (1.2) shows major
problems in terms of bias, while (1.1) remains with the same performance. The animation in Figure 1.3
illustrates precisely this behavior.
A simplified example of parametric and nonparametric estimation. The objective is to estimate the distri-
bution function F of a random variable. The data is generated from a Γ(λ, p). The parametric method
assumes that p = 1, that is, that the data comes from a Exp(λ). The nonparametric method does not
assume anything on the data generation process. The left plot shows the true distribution function and
ten estimates of each method from samples of size n. The right plot shows the MSE of each method on
estimationg F (x). Application also available here.
Density estimation
A random variable X is completely characterized by its cdf. Hence, an estimation of the cdf yields as a
side-product ∫estimates for different characteristics
∫ of X by plugging-in
∑n Fn in the F . For example, the mean
µ = E[X] = xdF (x) can be estimated by xdFn (x) = n1 i=1 Xi = X̄. Despite its usefulness, cdfs are
hard to visualize and interpret.
Densities, on the other hand, are easy to visualize and interpret, making them ideal tools for data
exploration. They provide immediate graphical information about the most likely areas, modes, and
spread of X. A continuous random variable is also completely characterized by its pdf f = F ′ . Density
estimation does not follow trivially from the ecdf Fn , since this is not differentiable (not even continuous),
hence the need of the specific procedures we will see in this chapter.
2.1 Histograms
2.1.1 Histogram
The simplest method to estimate a density f form an iid sample X1 , . . . , Xn is the histogram. From an
analytical point of view, the idea is to aggregate the data in intervals of the form [x0 , x0 + h) and then use
their relative frequency to approximate the density at x ∈ [x0 , x0 + h), f (x), by the estimate of
f (x0 ) = F ′ (x0 )
F (x0 + h) − F (x0 )
= lim+
h→0 h
P[x0 < X < x0 + h]
= lim+ .
h→0 h
More precisely, given an origin t0 and a bandwidth h > 0, the histogram builds a piecewise constant function
in the intervals {Bk := [tk , tk+1 ) : tk = t0 + hk, k ∈ Z} by counting the number of sample points inside each
of them. These constant-length intervals are also denoted bins. The fact that they are of constant length
h is important, since it allows to standardize by h in order to have relative frequencies in the bins. The
histogram at a point x is defined as
1 ∑
n
fˆH (x; t0 , h) := 1{Xi ∈Bk :x∈Bk } . (2.1)
nh i=1
15
16 CHAPTER 2. DENSITY ESTIMATION
Equivalently, if we denote the number of points in Bk as vk , then the histogram is fˆH (x; t0 , h) = vk
nh if x ∈ Bk
for a k ∈ Z.
The analysis of fˆH (x; t0 , h) as a random variable is∫ simple, once it is recognized that the bin counts vk are
distributed as B(n, pk ), with pk = P[X ∈ Bk ] = Bk f (t)dt1 . If f is continuous, then by the mean value
theorem, pk = hf (ξk,h ) for a ξk,h ∈ (tk , tk+1 ). Therefore:
npk
E[fˆH (x; t0 , h)] = = f (ξk,h ),
nh
npk (1 − pk ) f (ξk,h )(1 − hf (ξk,h ))
Var[fˆH (x; t0 , h)] = 2 2
= .
n h nh
1. If h → 0, then ξh,k → x, resulting in f (ξk,h ) → f (x), and thus (2.1) becoming an asymptotically
unbiased estimator of f (x).
2. But if h → 0, the variance increases. For decreasing the variance, nh → ∞ is required.
3. The variance is directly dependent on f (x)(1 − hf (x)) ≈ f (x), hence the more variability at regions
with higher density.
A more detailed analysis of the histogram can be seen in Section 3.2.2 of Scott (2015). We skip it here since
we the detailed asymptotic analysis for the more general kernel density estimator is given in Section 2.2.
# Duration of eruption
faithE <- faithful$eruptions
1 Note that it is key that the {Bk } are deterministic (and not sample-dependent) for this result to hold.
2.1. HISTOGRAMS 17
Histogram of faithE
60
Frequency
40
20
0
2 3 4 5
faithE
# List that contains several objects
str(histo)
## List of 6
## $ breaks : num [1:9] 1.5 2 2.5 3 3.5 4 4.5 5 5.5
## $ counts : int [1:8] 55 37 5 9 34 75 54 3
## $ density : num [1:8] 0.4044 0.2721 0.0368 0.0662 0.25 ...
## $ mids : num [1:8] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
## $ xname : chr "faithE"
## $ equidist: logi TRUE
## - attr(*, "class")= chr "histogram"
Histogram of faithE
0.5
0.4
Density
0.3
0.2
0.1
0.0
2 3 4 5
faithE
# Choosing the breaks
# t0 = min(faithE), h = 0.25
Bk <- seq(min(faithE), max(faithE), by = 0.25)
hist(faithE, probability = TRUE, breaks = Bk)
rug(faithE) # Plotting the sample
Histogram of faithE
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Density
faithE
2.1. HISTOGRAMS 19
We focus first on exploring the dependence on t0 , as it serves for motivating the next density estimator.
# Uniform sample
set.seed(1234567)
u <- runif(n = 100)
# t0 = 0, h = 0.2
Bk1 <- seq(0, 1, by = 0.2)
# t0 = -0.1, h = 0.2
Bk2 <- seq(-0.1, 1.1, by = 0.2)
# Comparison
par(mfrow = c(1, 2))
hist(u, probability = TRUE, breaks = Bk1, ylim = c(0, 1.5),
main = "t0 = 0, h = 0.2")
rug(u)
abline(h = 1, col = 2)
hist(u, probability = TRUE, breaks = Bk2, ylim = c(0, 1.5),
main = "t0 = -0.1, h = 0.2")
rug(u)
abline(h = 1, col = 2)
1.5
1.0
1.0
Density
Density
0.5
0.5
0.0
0.0
u u
High dependence on t0 also happens when estimating densities that are not compactly supported. The next
snippet of code points towards it.
20 CHAPTER 2. DENSITY ESTIMATION
# Comparison
Bk1 <- seq(-2.5, 5, by = 0.5)
Bk2 <- seq(-2.25, 5.25, by = 0.5)
par(mfrow = c(1, 2))
hist(samp, probability = TRUE, breaks = Bk1, ylim = c(0, 0.5),
main = "t0 = -2.5, h = 0.5")
rug(samp)
hist(samp, probability = TRUE, breaks = Bk2, ylim = c(0, 0.5),
main = "t0 = -2.25, h = 0.5")
rug(samp)
An alternative to avoid the dependence on t0 is the moving histogram or naive density estimator. The idea is
to aggregate the sample X1 , . . . , Xn in intervals of the form (x − h, x + h) and then use its relative frequency
in (x − h, x + h) to approximate the density at x:
f (x) = F ′ (x)
F (x + h) − F (x − h)
= lim+
h→0 2h
P[x − h < X < x + h]
= lim+ .
h→0 2h
Recall the differences with the histogram: the intervals depend on the evaluation point x and are centered
around it. That allows to directly estimate f (x) (without the proxy f (x0 )) by an estimate of the symmetric
derivative.
More precisely, given a bandwidth h > 0, the naive density estimator builds a piecewise constant function
by considering the relative frequency of X1 , . . . , Xn inside (x − h, x + h):
1 ∑
n
fˆN (x; h) := 1{x−h<Xi <x+h} . (2.2)
2nh i=1
Similarly
∑n to the histogram, the analysis of fˆN (x; h) as a random variable follows from realizing that
i=1 {x−h<Xi <x+h} ∼ B(n, px,h ), px,h = P[x − h < X < x + h] = F (x + h) − F (x − h). Then:
1
2.2. KERNEL DENSITY ESTIMATION 21
F (x + h) − F (x − h)
E[fˆN (x; h)] = ,
2h
F (x + h) − F (x − h)
Var[fˆN (x; h)] =
4nh2
(F (x + h) − F (x − h))2
− .
4nh2
1 ∑1 {
n
fˆN (x; h) = 1 x−X
}
nh i=1 2 −1< h i <1
( )
1 ∑
n
x − Xi
= K , (2.3)
nh i=1 h
with K(z) = 21 1{−1<z<1} . Interestingly, K is a uniform density in (−1, 1). This means that, when approxi-
mating
[ ]
x−X
P[x − h < X < x + h] = P −1 < <1
h
by (2.3), we give equal weight to all the points X1 , . . . , Xn . The generalization of (2.3) is now obvious:
replace K by an arbitrary density. Then K is known as a kernel, a density that is typically symmetric and
unimodal at 0. This generalization provides the definition of kernel density estimator (kde)2 :
2 Also known as the Parzen-Rosemblatt estimator to honor the proposals by Parzen (1962) and Rosenblatt (1956).
22 CHAPTER 2. DENSITY ESTIMATION
( )
1 ∑
n
x − Xi
fˆ(x; h) := K . (2.4)
nh i=1 h
(z) ∑n
A common notation is Kh (z) := h1 K h , so fˆ(x; h) = 1
n i=1 Kh (x − Xi ).
Several types of kernels are possible. The most popular is the normal kernel K(z) = ϕ(z), although the
Epanechnikov kernel, K(z) = 43 (1−z 2 )1{|z|<1} , is the most efficient. The rectangular kernel K(z) = 12 1{|z|<1}
yields the moving histogram as a particular case. The kde inherits the smoothness properties of the kernel.
That means, for example, (2.4) with a normal kernel is infinitely differentiable. But with an Epanechnikov
kernel, (2.4) is not differentiable, and with a rectangular kernel is not even continuous. However, if a certain
smoothness is guaranteed (continuity at least), the choice of the kernel has little importance in practice (at
least compared with the choice of h). Figure 2.2 illustrates the construction of the kde, and the bandwidth
and kernel effects.
It is useful to recall the kde with the normal kernel. If that is the case, then Kh (x − Xi ) = ϕh (x − Xi ) and
the kernel is the density of a N (Xi , h2 ). Thus the bandwidth h can be thought as the standard deviation of
the normal densities that have mean zero.
Construction of the kernel density estimator. The animation shows how the bandwidth and kernel affect the
density estimate, and how the kernels are rescaled densities with modes at the data points. Application also
available here.
Implementation of the kde in R is built-in through the density function. The function automatically chooses
the bandwidth h using a bandwidth selector which will be studied in detail in Section 2.4.
# Sample 100 points from a N(0, 1)
set.seed(1234567)
samp <- rnorm(n = 100, mean = 0, sd = 1)
density.default(x = samp)
0.4
0.3
Density
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
0.2
0.1
0.0
−4 −2 0 2 4
0.2
0.1
0.0
−4 −2 0 2 4
kde$x
Exercise 2.1. Load the dataset faithful. Then:
• Estimate and plot the density of faithful$eruptions.
• Create a new plot and superimpose different density estimations with bandwidths equal to 0.1, 0.5,
and 1.
• Get the density estimate at exactly the point x = 3.1 using h = 0.15 and the Epanechnikov kernel.
Asymptotic results give us insights on the large-sample (n → ∞) properties of an estimator. One might think
why are they are useful, since in practice we only have finite sample sizes. Apart from purely theoretical
reasons, asymptotic results usually give highly valuable insights on the properties of the method, typically
simpler than finite-sample results (which might be analytically untractable).
2.3. ASYMPTOTIC PROPERTIES 25
∫
(f ∗ g)(x) := f (x − y)g(y)dy = (g ∗ f )(x). (2.5)
We are now ready to obtain the bias and variance of fˆ(x; h). Recall that is not possible to apply the
“binomial trick” we used previously since now the estimator is not piecewise constant. Instead of that, we
use the linearity of the kde and the convolution definition. For the bias, recall that
1∑
n
E[fˆ(x; h)] = E[Kh (x − Xi )]
n i=1
∫
= Kh (x − y)f (y)dy
1
Var[fˆ(x; h)] = ((Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)). (2.7)
n
These two expressions are exact, but they are hard to interpret. Equation (2.6) indicates that the estimator
is biased (we will see an example of that bias in Section 2.5), but it does not differentiate explicitly the effects
of kernel, bandwidth and density on the bias. The same happens with (2.7), yet more emphasized. That is
why the following asymptotic expressions are preferred.
Theorem 2.1. Under A1–A3, the bias and variance of the kde at x are
1
Bias[fˆ(x; h)] = µ2 (K)f ′′ (x)h2 + o(h2 ), (2.8)
2
R(K)
Var[fˆ(x; h)] = f (x) + o((nh)−1 ). (2.9)
nh
∫
E[fˆ(x; h)] = Kh (x − y)f (y)dy
∫
= K(z)f (x − hz)dz. (2.10)
3h = hn always depends on n from now on, although the subscript is dropped for the ease of notation.
26 CHAPTER 2. DENSITY ESTIMATION
f ′′ (x) 2 2
f (x − hz) = f (x) + f ′ (x)hz + h z (2.11)
2
+ o(h2 z 2 ). (2.12)
Substituting (2.12) in (2.10), and bearing in mind that K is a symmetric density around 0, we have
∫
K(z)f (x − hz)dz
∫ { f ′′ (x) 2 2
= K(z) f (x) + f ′ (x)hz + h z
2
}
+ o(h2 z 2 ) dz
1
= f (x) + µ2 (K)f ′′ (x)h2 + o(h2 ),
2
1 ∑
n
Var[fˆ(x; h)] = 2 Var[Kh (x − Xi )]
n i=1
1{ }
= E[Kh2 (x − X)] − E[Kh (x − X)]2 . (2.13)
n
The second term of (2.13) is already computed, so we focus on the first. Using the previous change of
variables and a first-order Taylor expansion, we have:
∫
1
E[Kh2 (x − X)] = K 2 (z)f (x − hz)dz
h
∫
1
= K 2 (z) {f (x) + O(hz)} dz
h
R(K)
= f (x) + O(1). (2.14)
h
{ }
1 R(K)
Var[fˆ(x; h)] = f (x) + O(1) − O(1)
n h
R(K)
= f (x) + O(n−1 )
nh
R(K)
= f (x) + o((nh)−1 ),
nh
Remark. Integrating little-o’s is a tricky issue. ∫In general, integrating a ox (1) quantity, possibly dependent
on x, does not provide an o(1). In other words: ox (1)dx ̸= o(1). If the previous hold with equality, then the
limits and integral will be interchangeable. But this is not always true – only if certains conditions are met;
recall the dominated convergence theorem (Theorem 1.12). If one wants to be completely rigorous on the
two implicit commutations of integrals and limits that took place in the proof, it is necessary to have explicit
control of the remainder via Taylor’s theorem (Theorem 1.11) and then apply the dominated convergence
theorem. For simplicity in the exposition, we avoid this.
The bias and variance expressions (2.8) and (2.9) yield interesting insights (see Figure 2.1.2 for their visual-
ization):
• The bias decreases with h quadratically. In addition, the bias at x is directly proportional to f ′′ (x).
This has an interesting interpretation:
– The bias is negative in concave regions, i.e. {x ∈ R : f (x)′′ < 0}. These regions correspod to
peaks and modes of f , where the kde underestimates f (tends to be below f ).
– Conversely, the bias is positive in convex regions, i.e. {x ∈ R : f (x)′′ > 0}. These regions
correspod to valleys and tails of f , where the kde overestimates f (tends to be above f ).
– The wilder the curvature f ′′ , the harder to estimate f . Flat density regions are easier to
estimate than wiggling regions with high curvature (several modes).
• The variance depends directly on f (x). The higher the density, the more variable is the kde.
Interestingly, the variance decreases as a factor of (nh)−1 , a consequence of nh playing the role of the
effective sample size for estimating f (x). The density at x is not estimated using all the sample points,
but only a fraction proportional to nh in a neighborhood of x.
Figure 2.1: Illustration of the effective sample size for estimating f (x) at x = 2. In blue, the kernels with
contribution larger than 0.01 to the kde. In grey, rest of the kernels.
The MSE of the kde is trivial to obtain from the bias and variance:
Corollary 2.1. Under A1–A3, the MSE of the kde at x is
2
Therefore, the kde is pointwise consistent in MSE, i.e., fˆ(x; h) −→ f (x).
Proof. We apply that MSE[fˆ(x; h)] = (E[fˆ(x; h)] − f (x))2 + Var[fˆ(x; h)] and that (O(h2 ) + o(h2 ))2 =
O(h4 ) + o(h4 ). Since MSE[fˆ(x; h)] → 0 when n → ∞, consistency follows.
2 P d
Note that, due to the consistency, fˆ(x; h) −→ f (x) =⇒ fˆ(x; h) −→ f (x) =⇒ fˆ(x; h) −→ f (x) under
A1–A2. However, these results are not useful for quantifying the randomness of fˆ(x; h): the convergence
is towards the degenerate random variable f (x), for a given x ∈ R. For that reason, we focus now on the
asymptotic normality of the estimator.
∫
Theorem 2.2. Assume that K 2+δ (z)dz < ∞4 for some δ > 0. Then, under A1–A3,
√ d
nh(fˆ(x; h) − E[fˆ(x; h)]) −→ N (0, R(K)f (x)), (2.16)
( )
√ 1 ′′ d
ˆ
nh f (x; h) − f (x) − µ2 (K)f (x)h −→ N (0, R(K)f (x)).
2
(2.17)
2
Proof. First note that Kh (x − Xn ) is a sequence of independent but not identically distributed random
variables: h = hn depends on n. Therefore, we look forward to apply Theorem 1.2.
We prove first (2.16). For simplicity, denote Ki := Kh (x − Xi ), i = 1, . . . , n. From the proof of Theorem 2.1
we know that E[Ki ] = E[fˆ(x; h)] = f (x) + o(1) and
∑
n
s2n = Var[Ki ]
i=1
= n Var[fˆ(x; h)]
2
R(K)
=n f (x)(1 + o(1)).
h
[ ] ( [ ] )
E |Ki − E[Ki ]|2+δ ≤ C2+δ E |Ki |2+δ + |E[Ki ]|2+δ
[ ]
≤ 2C2+δ E |Ki |2+δ
( [ ])
= O E |Ki |2+δ .
∫
In addition, due to a Taylor expansion after z = x−y
h and using that K 2+δ (z)dz < ∞:
∫ ( )
[ ] 1 x−y
E |Ki |2+δ = K 2+δ
f (y)dy
h2+δ h
∫
1
= 1+δ K 2+δ (z)f (x − hz)dy
h
∫
1
= 1+δ K 2+δ (z)(f (x) + o(1))dy
h
( )
= O h−(1+δ) .
Then:
4 This is satisfied, for example, if the kernel decreases exponentially, i.e., if ∃ α, M > 0 : K(z) ≤ e−α|z| , ∀|z| > M .
2.4. BANDWIDTH SELECTION 29
1 ∑ [ ]
n
E |Ki − E[Ki ]|2+δ
s2+δ
n i=1
( )1+ δ2 ( )
h
= (1 + o(1))O nh−(1+δ)
nR(K)f (x)
( )
= O (nh)− 2
δ
and the Lyapunov’s condition is satisfied. As a consequence, by Lyapunov’s CLT and Slutsky’s theorem:
√
nh
(fˆ(x; h) − E[fˆ(x; h)])
R(K)f (x)
1 ∑
n
= (1 + o(1)) (Ki − E[Ki ])
sn i=1
d
−→ N (0, 1)
( )
√ 1
nh fˆ(x; h) − f (x) − µ2 (K)f ′′ (x)h2
2
√
= nh(f (x; h) − E[fˆ(x; h)] + o(h2 )).
ˆ
Due to Example 1.5 and Proposition 1.8, we can prove that fˆ(x; h) − E[fˆ(x; h)] = oP (1). Then, fˆ(x; h) −
E[fˆ(x; h)] + o(h2 ) = (fˆ(x; h) − E[fˆ(x; h)])(1 + oP (h2 )). Therefore, Slutsky’s theorem and (2.17) give:
( )
√ 1 ′′
ˆ
nh f (x; h) − f (x) − µ2 (K)f (x)h 2
2
√
= nh(fˆ(x; h) − E[fˆ(x; h)])(1 + oP (h2 ))
d
−→ N (0, R(K)f (x)).
√
Remark.
√ Note the rate nh in the asymptotic normality results. This is different from the standard CLT rate
n (see Theorem 1.1). It is slower: the variance of the limiting normal distribution decreases as O((nh)−1 )
and not as O(n−1 ). The phenomenon is related with the effective sample size used in the smoothing.
Exercise 2.2. Using (2.17) and Exercise 1.5, show that fˆ(x; h) = f (x) + oP (1).
The first step is to define a global, rather than local, error criterion. The Integrated Squared Error (ISE),
∫
ISE[f (·, h)] := (fˆ(x; h) − f (x))2 dx,
ˆ
is the squared distance between the kde and the target density. The ISE is a random quantity, since it
depends directly on the sample X1 , . . . , Xn . As a consequence, looking for an optimal-ISE bandwidth is a
hard task, since it the optimality is dependent on the sample itself and not only on the population and n.
To avoid this problematic, it is usual to compute the Mean Integrated Squared Error (MISE):
[ ]
MISE[fˆ(·; h)] := E ISE[fˆ(·, h)]
[∫ ]
=E ˆ
(f (x; h) − f (x)) dx
2
∫ [ ]
= E (fˆ(x; h) − f (x))2 dx
∫
= MSE[fˆ(x; h)]dx.
The MISE is convenient due to its mathematical tractability and its natural relation with the MSE. There
are, however, other error criteria that present attractive properties, such as the Mean Integrated Absolute
Error (MIAE):
[∫ ]
ˆ
MIAE[f (·; h)] := E ˆ
|f (x; h) − f (x)|dx
∫ [ ]
= E |fˆ(x; h) − f (x)| dx.
The MIAE, unless the MISE, is invariant with respect to monotone transformations of the data. For example,
if g(x) = f (t−1 (x))(t−1 )′ (x) is the density of Y = t(X) and X ∼ f , then if the change of variables y = t(x)
is made,
∫ ∫
|fˆ(x; h) − f (x)|dx = |fˆ(t−1 (y); h) − f (t−1 (y))|)(t−1 )′ (y)dy
∫
= |ĝ(y; h) − g(y)|dy.
Despite this appealing property, the analysis of MIAE is substantially more complicated. We refer to Devroye
and Györfi (1985) for a comprehensive treatment of absolute value metrics for kde.
Once the MISE is set as the error criterion to be minimized, our aim is to find
hMISE := arg min MISE[fˆ(·; h)].
h>0
For that purpose, we need an explicit expression of the MISE that we can attempt to minimize. An asymp-
totic expansion for the MISE can be easily derived from the MSE expression.
Corollary 2.2. Under A1–A3,
1 R(K)
MISE[fˆ(·; h)] = µ22 (K)R(f ′′ )h4 +
4 nh
+ o(h4 + (nh)−1 ). (2.18)
Proof. Trivial.
The dominating part of the MISE is denoted as AMISE, which stands for Asymptotic MISE: AMISE[fˆ(·; h)] =
1 2 ′′ 4 R(K)
4 µ2 (K)R(f )h + nh . Due to its closed expression, it is possible to obtain a bandwidth that minimizes
the AMISE.
Corollary 2.3. The bandwidth that minimizes the AMISE is
[ ]1/5
R(K)
hAMISE = .
µ22 (K)R(f ′′ )n
d
Proof. Solving dh AMISE[fˆ(·; h)] = 0, i.e. µ22 (K)R(f ′′ )h3 − R(K)n−1 h−2 = 0, yields hAMISE and
ˆ
AMISE[f (·; hAMISE )] gives the AMISE-optimal error.
The AMISE-optimal order deserves some further inspection. It can be seen in Section 3.2 of Scott (2015) that
( )2/3
the AMISE-optimal order for the histogram of Section 2.1 is 34 R(f ′ )1/3 n−2/3 . Two facts are of interest.
First, the MISE of the histogram is asymptotically larger than the MISE of the kde (n−4/5 = o(n−2/3 )).
This is a quantification of the quite apparent visual improvement of the kde over the histogram. Second,
R(f ′ ) appears instead of R(f ′′ ), evidencing that the histogram is affected not only by the curvature of the
target density f but also by how fast it varies.
∫
Unfortunately, the AMISE bandwidth depends on R(f ′′ ) = (f ′′ (x))2 dx, which measures the curvature of
the density. As a consequence, it can not be readily applied in practice. In the next subsection we will see
how to plug-in estimates for R(f ′′ ).
A simple solution to estimate R(f ′′ ) is to assume that f is the density of a N (µ, σ 2 ), and then plug-in the
form of the curvature for such density:
3
R(ϕ′′σ (· − µ)) = .
8π 1/2 σ 5
While doing so, we approximate the curvature of an arbitrary density by means of the curvature of a Normal.
We have that
[ 1/2 ]1/5
8π R(K)
hAMISE = σ.
3µ22 (K)n
Interestingly, the bandwidth is directly proportional to the standard deviation of the target density. Replac-
ing σ by an estimate yields the normal scale bandwidth selector, which we denote by ĥNS to emphasize
its randomness:
[ 1/2 ]1/5
8π R(K)
ĥNS = σ̂.
3µ22 (K)n
The estimate σ̂ can be chosen as the standard deviation s, or, in order to avoid the effects of potential
outliers, as the standardized interquantile range:
X([0.75n]) − X([0.25n])
σ̂IQR := .
Φ−1 (0.75) − Φ−1 (0.25)
When combined with a normal kernel, for which µ2 (K) = 1 and R(K) = ϕ√2 (0) = 1
√
2 π
(check (2.26)), this
particularization of ĥNS gives the famous rule-of-thumb for bandwidth selection:
( )1/5
4
ĥRT = n−1/5 σ̂ ≈ 1.06n−1/5 σ̂.
3
ĥRT is implemented in R through the function bw.nrd (not to confuse with bw.nrd0).
# Data
x <- rnorm(100)
# Rule-of-thumb
bw.nrd(x = x)
## [1] 0.3698529
# Same as
iqr <- diff(quantile(x, c(0.75, 0.25))) / diff(qnorm(c(0.75, 0.25)))
1.06 * length(x)^(-1/5) * min(sd(x), iqr)
## [1] 0.367391
The previous selector is an example of a zero-stage plug-in selector, a terminology which lays on the fact
that R(f ′′ ) was estimated by plugging-in a parametric assumption at the “very beginning”. Because we
could have opted to estimate R(f ′′ ) nonparametrically and then plug-in the estimate into into hAMISE . Let’s
explore this possibility in more detail. But first, note the useful equality
∫ ∫
f (s) (x)2 dx = (−1)s f (2s) (x)f (x)dx.
This holds by a iterative application of integration by parts. For example, for s = 2, take u = f ′′ (x) and
dv = f ′′ (x)dx. This gives
∫ ∫
′′ ′′ ′
2
f (x) dx = [f (x)f − f ′ (x)f ′′′ (x)dx
(x)]+∞
−∞
∫
= − f ′ (x)f ′′′ (x)dx
under the assumption that the derivatives vanish at infinity. Applying again integration by parts with
u = f ′′′ (x) and dv = f ′ (x)dx gives the result. This simple derivation has an important consequence: for
estimating the functionals R(f (s) ) it is only required to estimate for r = 2s the functionals
∫
ψr := f (r) (x)f (x)dx = E[f (r) (X)].
Particularly, R(f ′′ ) = ψ4 .
Thanks to the previous expression, a possible way to estimate ψr nonparametrically is
1 ∑ ˆ(r)
n
ψ̂r (g) = f (Xi ; g)
n i=1
1 ∑ ∑ (r)
n n
= L (Xi − Xj ), (2.20)
n2 i=1 j=1 g
2.4. BANDWIDTH SELECTION 33
{ }
L(r) (0) µ2 (L)ψr+2 g 2 2R(L(r) )ψ0
AMSE[ψ̂r (g)] = r+1
+ +
ng 4 n2 g 2r+1
{∫ }
4
+ f (r) (x)2 f (x)dx − ψr2 .
n
Note: k is the highest integer such that µk (L) > 0. In these notes we have restricted to the case k = 2
for the kernels K, but there are theoretical gains if one allows high-order kernels L with vanishing even
moments larger than 2 for estimating ψr .
2. Obtain the AMSE-optimal bandwidth:
[ ]1/(r+k+1)
k!L(r) (0)
gAMSE = −
µk (L)ψr+k n
which shows that a parametric-like rate of convergence can be achieved with high-order kernels. If we
consider L = K and k = 2, then
[ ]1/(r+3)
2K (r) (0)
gAMSE = − .
µ2 (L)ψr+2 n
The above result has a major problem: it depends on ψr+2 ! Thus if we want to estimate R(f ′′ ) = ψ4 by
ψ̂4 (gAMSE ) we will need to estimate ψ6 . But ψ̂6 (gAMSE ) will depend on ψ8 and so on. The solution to this
convoluted problem is to stop estimating the functional ψr after a given number ℓ of stages, therefore the
terminology ℓ-stage plug-in selector. At the ℓ stage, the functional ψℓ+4 in the AMSE-optimal bandwidth
for estimating ψℓ+2 is computed assuming that the density is a N (µ, σ 2 ), for which
(−1)r/2 r!
ψr = √ , for r even.
(2σ)r+1 (r/2)! π
Typically, two stages are considered a good trade-off between bias (mitigated when ℓ increases) and variance
(augments with ℓ) of the plug-in selector. This is the method proposed by Sheather and Jones (1991), where
they consider L = K and k = 2, yielding what we call the direct plug-in (DPI). The algorithm is:
105
1. Estimate ψ8 using ψ̂8NS := √
32 π σ̂ 9
, where σ̂ is given in (2.19)
2. Estimate ψ6 using ψ̂6 (g1 ) from (2.20), where
[ ]1/9
2K (6) (0)
g1 := − .
µ2 (K)ψ̂8NS n
34 CHAPTER 2. DENSITY ESTIMATION
ĥDPI is implemented in R through the function bw.SJ (use method = "dpi"). The package ks provides
hpi, which is a faster implementation that allows for more flexibility and has a somehow more complete
documentation.
# Data
x <- rnorm(100)
# Rule-of-thumb
bw.SJ(x = x, method = "dpi")
## [1] 0.3538952
# Similar to
library(ks)
hpi(x) # Default is two-stages
## [1] 0.3531024
2.4.2 Cross-validation
We turn now our attention to a different philosophy of bandwidth estimation. Instead of trying to minimize
the AMISE by plugging-in estimates for the unknown curvature term, we directly attempt to minimize the
MISE by using the sample twice: one for computing the kde and other for evaluating its performance on
estimating f . To avoid the clear dependence on the sample, we do this evaluation in a cross-validatory way:
the data used for computing the kde is not used for its evaluation.
We begin by expanding the square in the MISE expression:
[∫ ]
MISE[fˆ(·; h)] = E ˆ
(f (x; h) − f (x)) dx
2
[∫ ] [∫ ]
=E ˆ
f (x; h) dx − 2E
2 ˆ
f (x; h)f (x)dx
∫
+ f (x)2 dx.
Since the last term does not depend on h, minimizing MISE[fˆ(·; h)] is equivalent to minimizing
[∫ ] [∫ ]
E fˆ(x; h)2 dx − 2E fˆ(x; h)f (x)dx .
This quantity is unknown, but it can be estimated unbiasedly (see Exercise 2.7) by
2.4. BANDWIDTH SELECTION 35
∫ ∑
n
LSCV(h) := fˆ(x; h)2 dx − 2n−1 fˆ−i (Xi ; h), (2.21)
i=1
where fˆ−i (·; h) is the leave-one-out kde and is based on the sample with the Xi removed:
1 ∑
n
fˆ−i (x; h) = Kh (x − Xj ).
n − 1 j=1
j̸=i
The motivation∫ for (2.21) is the following. The first term is unbiased by design. The second arises from
approximating fˆ(x; h)f (x)dx by Monte Carlo, or in other words, by replacing f (x)dx = dF (x) with dFn (x).
This gives
∫
1∑ˆ
n
fˆ(x; h)f (x)dx ≈ f (Xi ; h)
n i=1
and, in order to mitigate the dependence of the sample, we replace fˆ(Xi ; h) by fˆ−i (Xi ; h) above. In that
way, we use the sample for estimating the integral involving fˆ(·; h), but for each Xi we compute the kde
on the remaining points. The least squares cross-validation (LSCV) selector, also denoted unbiased
cross-validation (UCV) selector, is defined as
Numerical optimization is required for obtaining ĥLSCV , contrary to the previous plug-in selectors, and there
is little control on the shape of the objective function. This will be also the case for the forthcoming bandwidth
selectors. The following remark warns about the dangers of numerical optimization in this context.
Remark. Numerical optimization of the LSCV function can be challenging. In practice, several local minima
are possible, and the roughness of the objective function can vary notably depending on n and f . As a
consequence, optimization routines may get trapped in spurious solutions. To be on the safe side, it is
always advisable to check the solution by plotting LSCV(h) for a range of h, or to perform a search in a
bandwidth grid: ĥLSCV ≈ arg minh1 ,...,hG LSCV(h).
ĥLSCV is implemented in R through the function bw.ucv. bw.ucv uses R’s optimize, which is quite sensible
to the selection of the search interval (long intervals containing the solution may lead to unsatisfactory
termination of the search; and short intervals might not contain the minimum). Therefore, some care is
needed and that is why the bw.ucv.mod function is presented.
# Data
set.seed(123456)
x <- rnorm(100)
−0.10
−0.15
obj
−0.20
−0.25
0 1 2 3 4 5
h.grid
The next cross-validation selector is based on biased cross-validation (BCV). The BCV selector presents
a hybrid strategy that combines plug-in and cross-validation ideas. It starts by considering the AMISE
expression in (2.18)
1 R(K)
AMISE[fˆ(·; h)] = µ22 (K)R(f ′′ )h4 +
4 nh
and then plugs-in an estimate for R(f ′′ ) based on a modification of R(fˆ′′ (·; h)). The modification is
a [leave-out-diagonals
] estimate of R(f ′′ ). It is designed to reduce the bias of R(fˆ′′ (·; h)), since
′′
′′ ′′ R(K )
E R(fˆ (·; h)) = R(f ) + nh5 + O(h ). (Scott and Terrell, 1987). Plugging-in (2.22) into the
2
AMISE expression yields the BCV objective function and BCV bandwidth selector:
1 2 ^ ′′ )h4 +
R(K)
BCV(h) := µ (K)R(f ,
4 2 nh
ĥBCV := arg min BCV(h).
h>0
The appealing property of ĥBCV is that it has a considerably smaller variance compared to ĥLSCV . This
reduction in variance comes at the price of an increased bias, which tends to make ĥBCV larger than hMISE .
ĥBCV is implemented in R through the function bw.bcv. Again, bw.bcv uses R’s optimize so the bw.bcv.mod
function is presented to have better guarantees on finding the adequate minimum.
38 CHAPTER 2. DENSITY ESTIMATION
# Data
set.seed(123456)
x <- rnorm(100)
0.02
0.01
0 1 2 3 4 5
h.grid
We state next some insights from the convergence results of the DPI, LSCV, and BCV selectors. All of them
are based in results of the kind
d
nν (ĥ/hMISE − 1) −→ N (0, σ 2 ),
where σ 2 depends only on K and f , and measures how variable is the selector. The rate nν serves to quantify
how fast the relative ĥ/hMISE − 1 decreases (the larger the ν, the faster the convergence). Under certain
regularity conditions, we have:
d d
• n1/10 (ĥLSCV /hMISE − 1) −→ N (0, σLSCV
2
) and n1/10 (ĥBCV /hMISE − 1) −→ N (0, σBCV
2
). Both selectors
1/2
have a slow rate of convergence (compare it with the n of the CLT). Inspection of the variances of
2
both selectors reveals that, for the normal kernel σLSCV 2
/σBCV ≈ 15.7. Therefore, LSCV is considerably
more variable than BCV.
d
• n5/14 (ĥDPI /hMISE − 1) −→ N (0, σDPI
2
). Thus, the DPI selector has a convergence rate much faster
than the cross-validation selectors. There is an appealing explanation for this phenomenon. Recall
that ĥBCV minimizes the slightly modified version of BCV(h) given by
1 2 R(K)
µ (K)ψ̃4 (h)h4 +
4 2 nh
and
40 CHAPTER 2. DENSITY ESTIMATION
1 ∑∑n n
ψ̃4 (h) := (K ′′ ∗ Kh′′ )(Xi − Xj )
n(n − 1) i=1 j=1 h
j̸=i
n ^
= R(f ′′ ).
n−1
ψ̃4 is a leave-out-diagonals estimate of ψ4 . Despite being different from ψ̂4 , it serves for building a
DPI analogous to BCV points towards the precise fact that drags down the performance of BCV. The
modified version of the DPI minimizes
1 2 R(K)
µ2 (K)ψ̃4 (g)h4 + ,
4 nh
where g is independent of h. The two methods differ on the the way g is chosen: BCV sets g = h and the
modified DPI looks for the best g in terms of the AMSE[ψ̃4 (g)]. It can be seen that gAMSE = O(n−2/13 ),
whereas the h used in BCV is asymptotically O(n−1/5 ). This suboptimality on the choice of g is the
reason of the asymptotic deficiency of BCV.
We focus now on exploring the empirical performance of bandwidth selectors. The workhorse for doing that
is simulation. A popular collection of simulation scenarios was given by Marron and Wand (1992) and are
conveniently available through the package nor1mix. They forma collection of normal r-mixtures of the form
∑
r
f (x; µ, σ, w) : = wj ϕσj (x − µj ),
j=1
∑r
where wj ≥ 0, j = 1, . . . , r and j=1 wj = 1. Densities of this form are specially attractive since they
allow for arbitrarily flexibility and, if the normal kernel is employed, they allow for explicit and exact MISE
expressions:
√
MISEr [fˆ(·; h)] = (2 πnh)−1 + w′ {(1 − n−1 )Ω2 − 2Ω1 + Ω0 }w,
(Ωa )ij = ϕ(ah2 +σi2 +σj2 )1/2 (µi − µj ), i, j = 1, . . . , r. (2.23)
# Load package
library(nor1mix)
# Available models
?MarronWand
# Simulating
samp <- rnorMix(n = 500, obj = MW.nm9) # MW object in the second argument
hist(samp, freq = FALSE)
# Density evaluation
x <- seq(-4, 4, length.out = 400)
lines(x, dnorMix(x = x, obj = MW.nm9), col = 2)
2.4. BANDWIDTH SELECTION 41
Histogram of samp
0.30
0.20
Density
0.10
0.00
−3 −2 −1 0 1 2 3
samp
# Plot a MW object directly
# A normal with the same mean and variance is plotted in dashed lines
par(mfrow = c(2, 2))
plot(MW.nm5)
plot(MW.nm7)
plot(MW.nm10)
plot(MW.nm12)
lines(MW.nm10) # Also possible
42 CHAPTER 2. DENSITY ESTIMATION
#5 Outlier #7 Separated
0.4
f(x)
f(x)
0.2
2
0.0
0
x x
0.3
f(x)
f(x)
0.3
0.0
−2 −1 0 1 2 0.0 −3 −2 −1 0 1 2 3
x x
Figure 2.4.3 presents a visualization of the performance of the kde with different bandwidth selectors, carried
out in the family of mixtures of Marron and Wand (1992).
Performance comparison of bandwidth selectors. The RT, DPI, LSCV, and BCV are computed for each
sample for a normal mixture density. For each sample, computes the ISEs of the selectors and sorts them
from best to worst. Changing the scenarios gives insight on the adequacy of each selector to hard- and
simple-to-estimate densities. Application also available here.
Obtaining a confidence interval (CI) for f (x) is a hard task. Due to the bias results seen in Section 2.3, we
know that the kde is biased for finite sample sizes, E[fˆ(x; h)] = (Kh ∗ f )(x), and it is only asymptotically
unbiased when h → 0. This bias is called the smoothing bias and in essence complicates the obtention of
CIs for f (x), but not of (Kh ∗ f )(x). Let’s see with an illustrative example the differences between these two
objects.
Two well-known facts for normal densities (see Appendix C in Wand and Jones (1995)) are:
Thus, the exact expectation of the kde for estimating the density of a N (µ, σ 2 ) is the density of a N (µ, σ 2 +h2 ).
Clearly, when h → 0, the bias disappears (at expenses of increasing the variance, of course). Removing this
finite-sample size bias is not simple: if the bias is expanded, f ′′ appears. Thus for attempting to unbias fˆ(·; h)
we have to estimate f ′′ , which is much more complicated than estimating f ! Taking second derivatives on
the kde does not simply work out-of-the-box, since the bandwidths for estimating f and f ′′ scale differently.
We refer the interested reader to Section 2.12 in Wand and Jones (1995) for a quick review of derivative kde.
The previous deadlock can be solved if we limit our ambitions. Rather than constructing a confidence interval
for f (x), we will do it for E[fˆ(x; h)] = (Kh ∗ f )(x). There is nothing wrong with this as long as we are careful
when we report the results to make it clear that the CI is for (Kh ∗ f )(x) and not f (x).
The building block for the CI for E[fˆ(x; h)] = (Kh ∗ f )(x) is Theorem 2.2, which stated that:
√ d
nh(fˆ(x; h) − E[fˆ(x; h)]) −→ N (0, R(K)f (x)).
Plugging-in fˆ(x; h) = f (x) + OP (h2 + (nh)−1 ) = f (x)(1 + oP (1)) (see Exercise 2.8) as an estimate for f (x)
in the variance, we have by the Slutsky’s theorem that
√
nh
(fˆ(x; h) − E[fˆ(x; h)])
R(K)fˆ(x; h)
√
nh
= (fˆ(x; h) − E[fˆ(x; h)])(1 + oP (1))
R(K)f (x)
d
−→ N (0, 1).
Therefore, an asymptotic 100(1 − α)% confidence interval for E[fˆ(x; h)] that can be straightforwardly
computed is:
√
R(K)fˆ(x; h)
I = fˆ(x; h) ± zα . (2.29)
nh
Recall that if we wanted to do obtain a CI with the second result in Theorem 2.2 we will need to estimate
f ′′ (x).
Remark. Several points regarding the CI require proper awareness:
1. The CI is for E[fˆ(x; h)] = (Kh ∗ f )(x), not[f (x). ]
2. The CI is pointwise. That means that P E[fˆ(x; h)] ∈ I ≈ 1 − α for each x ∈ R, but not simulta-
[ ]
neously. This is, P E[fˆ(x; h)] ∈ I, ∀x ∈ R ̸= 1 − α.
44 CHAPTER 2. DENSITY ESTIMATION
3. We are approximating f (x) in the variance by fˆ(x; h)√= f (x) + OP (h2 + (nh)−1 ). Additionally, the
convergence to a normal distribution happens at rate nh. Hence both h and nh need to small and
large, respectively, for a good coverage.
4. The CI is for a deterministic bandwidth h (i.e., not selected from the sample), which is not usually
the case in practise. If a bandwidth selector is employed, the coverage will be affected, specially for
small n.
We illustrate the construction of the (2.29) in for the situation where the reference density is a N (µ, σ 2 ) and
the normal kernel is employed. This allows to use (2.27) and (2.28) in combination (2.6) and (2.7) with to
obtain:
The following piece of code evaluates the proportion of times that E[fˆ(x; h)] belongs to I for each x ∈ R,
both estimating and knowing the variance in the asymptotic distribution.
# R(K) for a normal
Rk <- 1 / (2 * sqrt(pi))
# Selected bandwidth
h <- kde$bw
Estimate
CI estimated var
0.8
CI known var
Smoothed density
0.6
Density
0.4
0.2
0.0
−4 −2 0 2 4
# Simulation
46 CHAPTER 2. DENSITY ESTIMATION
set.seed(12345)
for (i in 1:M) {
# Plot results
plot(kde$x, colMeans(insideCi1), ylim = c(0.25, 1), type = "l",
main = "Coverage CIs", xlab = "x", ylab = "Coverage")
lines(kde$x, colMeans(insideCi2), col = 4)
abline(h = 1 - alpha, col = 2)
abline(h = 1 - alpha + c(-1, 1) * qnorm(0.975) *
sqrt(alpha * (1 - alpha) / M), col = 2, lty = 2)
legend(x = "bottom", legend = c("CI estimated var", "CI known var",
"Nominal level",
"95% CI for the nominal level"),
col = c(1, 4, 2, 2), lwd = 2, lty = c(1, 1, 1, 2))
2.6. PRACTICAL ISSUES 47
Coverage CIs
1.0
0.8
Coverage
0.6
CI estimated var
0.4
CI known var
Nominal level
95% CI for the nominal level
−4 −2 0 2 4
x
Exercise 2.3. Explore the coverage of the asymptotic CI for varying values of h. To that end, adapt the
previous code to work in a manipulate environment like the example given below.
# Load manipulate
# install.packages("manipulate")
library(manipulate)
# Sample
x <- rnorm(100)
In Section 2.3 we assumed certain regularity conditions for f . Assumption A1 stated that f should be twice
continuously differentiable (on R). It is simple to think a counterexample for that: take any density with
48 CHAPTER 2. DENSITY ESTIMATION
bounded support, for example a LN (0, 1) in (0, ∞), as seen below. The the kde will run into troubles.
# Sample from a LN(0, 1)
set.seed(123456)
samp <- rlnorm(n = 500)
density.default(x = samp)
1.0
0.8
0.6
Density
0.4
0.2
0.0
0 5 10 15
1∑
n
fˆT (x; h, t) := Kh (t(x) − t(Xi ))t′ (x). (2.30)
n i=1
Note that h is in the scale of t(Xi ), not Xi . Hence another plus of this approach is that bandwidth selection
can be done transparently in terms of the previously seen bandwidth selectors applied to t(X1 ), . . . , t(Xn ).
A table with some common transformations is:
2.6. PRACTICAL ISSUES 49
Construction of the transformation kde for the log and probit transformations. The left panel shows the kde
(2.4) applied to the transformed data. The right plot shows the transformed kde (2.30). Application also
available here.
The code below illustrates how to compute a transformation kde in practice.
# kde with log-transformed data
kde <- density(log(samp))
plot(kde, main = "kde of transformed data")
rug(log(samp))
0.2
0.1
0.0
−4 −2 0 2
# Transformed kde
plot(kdeTransf, main = "Transformed kde")
curve(dlnorm(x), col = 2, add = TRUE)
rug(samp)
Transformed kde
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Density
0 5 10 15 20 25 30
2.6.2 Sampling
1∑
n
fˆ(x; h) = Kh (x − Xi )
n i=1
as a mixture density made of n mixture components in which each of them is sampled independently. The
only part that might require special treatment is sampling from the density K, although for most of the
implemented kernels R contains specific sampling functions.
The algorithm for sampling N points goes as follows:
1. Choose i ∈ {1, . . . , n} at random.
( )
2. Sample from Kh (· − Xi ) = h1 K ·−X h
i
.
3. Repeat the previous steps N times.
Let’s see a quick application.
# Sample the Claw
n <- 100
set.seed(23456)
2.6. PRACTICAL ISSUES 51
# Kde
h <- 0.1
plot(density(samp, bw = h))
density.default(x = samp, bw = h)
0.6
0.4
Density
0.2
0.0
−3 −2 −1 0 1 2
2.7 Exercises
This is the list of evaluable exercises for Chapter 2. The number of stars represents an estimate of their
difficulty: easy (⋆), medium (⋆⋆), and hard (⋆ ⋆ ⋆).
Exercise 2.4. (theoretical, ⋆) Prove that the histogram (2.1) is a proper density estimate (a nonnegative
density that integrates one). Obtain its associated distribution function. What is its difference with respect
to the ecdf (1.1)?
Exercise 2.5. (theoretical, ⋆, adapted from Exercise 2.1 in Wand and Jones (1995)) Derive the result (2.7).
Then obtain the exact MSE and MISE using (2.6) and (2.7).
Exercise 2.6. (theoretical, ⋆⋆) Conditionally on the sample X1 , . . . , Xn , compute the expectation and
variance of the kde (2.4) and compare them with the sample mean and variance. What is the effect of h in
them?
Exercise 2.7. (theoretical, ⋆⋆, Exercise 3.3 in Wand and Jones (1995)) Show that
1
Bias[fˆT (x; h, t)] = µ2 (K)g ′′ (t(x))t′ (x)h2 + o(h2 ),
2
R(K)
Var[fˆT (x; h, t)] = g(t(x))t′ (x)2 + o((nh)−1 ),
nh
∫
1 2 R(K) ′
AMISE[fˆT (·; h, t)] = µ (K) t′ (t−1 (x))g ′′ (x)2 dxh4 + E[t (X)].
4 2 nh
Exercise 2.10. (practical, ⋆) The kde can be used to smoothly resample a dataset. To that end, first
cumpute the kde of the dataset and then employ the algorithm of Section 2.6. Implement this resampling
as a function that takes as arguments the dataset, the bandwidth h, and the number of sampled points M
wanted from the dataset. Use the normal kernel for simplicity. Test the implementation with the faithful
dataset and different bandwidths.
Exercise 2.11. (practical, ⋆⋆, Exercise 6.5 in Wasserman (2006)) Data on the salaries of the chief executive
officer of 60 companies is available at http://lib.stat.cmu.edu/DASL/Datafiles/ceodat.html (alternative link).
Investigate the distribution of salaries using a kde. Use ĥLSCV to choose the amount of smoothing. Also
consider ĥRT . There appear to be a few bumps in the density. Are they real? Use confidence bands to
address this question. Finally, comment on the resulting estimates.
Exercise 2.12. (practical, ⋆⋆) Implement your own version of the transformation kde (2.30) for the three
transformations given in Section 2.6. You can tweak the output of the density function in R and add an
extra argument for selecting the kind of transformation. Or you can implement it directly from scratch.
Exercise 2.13. (practical, ⋆ ⋆ ⋆) A bandwidth selector is a random variable. Visualizing its density can
help to understand its behavior, especially if it is compared with the asymptotic optimal bandwidth hAMISE .
Create a script that does the following steps:
1. For j = 1, . . . , M = 10000:
2.7. EXERCISES 53
1∑
n
p
fˆ(x; h) = Kh1 (x1 − Xi,1 )× · · · ×Khp (xp − Xi,p ),
n i=1
Regression estimation
The relation of two random variables X and Y can be completely characterized by their joint cdf F , or
equivalently, by the joint pdf f if (X, Y ) is continuous, the case we will address. In the regression setting,
we are interested in predicting/explaining the response Y by means of the predictor X from a sample
(X1 , Y1 ), . . . , (Xn , Yn ). The role of the variables is not symmetric: X is used to predict/explain Y .
The complete knowledge of Y when X = x is given by the conditional pdf: fY |X=x (y) = ffX (x,y)
(x) . While this
pdf provides full knowledge about Y |X = x, it is also a challenging task to estimate it: for each x we have
to estimate a curve! A simpler approach, yet still challenging, is to estimate the conditional mean (a scalar)
for each x. This is the so-called regression function1
∫ ∫
m(x) := E[Y |X = x] = ydFY |X=x (y) = yfY |X=x (y)dy.
Y = m(X) + σ(X)ε,
where σ 2 (x) := Var[Y |X = x] and ε is independent from X and such that E[ε] = 0 and Var[ε] = 1.
The multiple linear regression employs multiple predictors X1 , . . . , Xp 2 for explaining a single response Y by
assuming that a linear relation of the form
Y = β0 + β1 X1 + . . . + βp Xp + ε (3.1)
1 Recall that we assume that (X, Y ) is continuous.
2 Not to confuse with a sample!
55
56 CHAPTER 3. REGRESSION ESTIMATION
holds between the predictors X1 , . . . , Xp and the response Y . In (3.1), β0 is the intercept and β1 , . . . , βp are
the slopes, respectively. ε is a random variable with mean zero and independent from X1 , . . . , Xp . Another
way of looking at (3.1) is
since E[ε|X1 = x1 , . . . , Xp = xp ] = 0. Therefore, the mean of Y is changing in a linear fashion with respect
to the values of X1 , . . . , Xp . Hence the interpretation of the coefficients:
Figure 3.1 illustrates the geometrical interpretation of a multiple linear model: a plane in the (p + 1)-
dimensional space. If p = 1, the plane is the regression line for simple linear regression. If p = 2, then the
plane can be visualized in a three-dimensional plot. TODO: add another figure.
Figure 3.1: The regression plane (blue) of Y on X1 and X2 , and its relation with the regression lines (green
lines) of Y on X1 (left) and Y on X2 (right). The red points represent the sample for (X1 , X2 , Y ) and
the black points the subsamples for (X1 , X2 ) (bottom), (X1 , Y ) (left) and (X2 , Y ) (right). Note that the
regression plane is not a direct extension of the marginal regression lines.
The estimation of β0 , β1 , . . . , βp is done by minimizing the so-called Residual Sum of Squares (RSS). First
we need to introduce some helpful matrix notation:
3.1. REVIEW ON PARAMETRIC REGRESSION 57
• A sample of (X1 , . . . , Xp , Y ) is denoted by (X11 , . . . , X1p , Y1 ), . . . , (Xn1 , . . . , Xnp , Yn ), where Xij de-
notes the i-th observation of the j-th predictor Xj . We denote with Xi = (Xi1 , . . . , Xip ) to the i-th
observation of (X1 , . . . , Xp ), so the sample simplifies to (X1 , Y1 ), . . . , (Xn , Yn ).
• The design matrix contains all the information of the predictors and a column of ones
1 X11 · · · X1p
..
X = ... ..
.
..
. . .
1 Xn1 · · · Xnp n×(p+1)
• The vector of responses Y, the vector of coefficients β and the vector of errors are, respectively,
β0
Y1 β1 ε1
.. ..
Y= . , β= . , and ε = . .
..
Yn n×1 εn n×1
βp (p+1)×1
Thanks to the matrix notation, we can turn the sample version of the multiple linear model, namely
Yi = β0 + β1 Xi1 + . . . + βp Xik + εi , i = 1, . . . , n,
Y = Xβ + ε.
∑
n
RSS(β) := (Yi − β0 − β1 Xi1 − . . . − βp Xik )2
i=1
= (Y − Xβ)′ (Y − Xβ). (3.3)
The RSS aggregates the squared vertical distances from the data to a regression plane given by β. Note that
the vertical distances are considered because we want to minimize the error in the prediction of Y . Thus,
the treatment of the variables is not symmetrical 3 ; see Figure 3.1.1. The least squares estimators are the
minimizers of the RSS:
β̂ := arg min
p+1
RSS(β).
β∈R
Luckily, thanks to the matrix form of (3.3), it is simple to compute a closed-form expression for the least
squares estimates:
The least squares regression plane y = β̂0 + β̂1 x1 + β̂2 x2 and its dependence on the kind of squared distance
considered. Application also available here.
Let’s check that indeed the coefficients given by R’s lm are the ones given by (3.4) in a toy linear model.
# Create the data employed in Figure 3.1
# Responses
yLin <- -0.5 + 0.5 * x1 + 0.5 * x2 + eps
yQua <- -0.5 + x1^2 + 0.5 * x2 + eps
yExp <- -0.5 + 0.5 * exp(x2) + x3 + eps
# Data
dataAnimation <- data.frame(x1 = x1, x2 = x2, yLin = yLin,
yQua = yQua, yExp = yExp)
# Call lm
# lm employs formula = response ~ predictor1 + predictor2 + ...
# (names according to the data frame names) for denoting the regression
# to be done
mod <- lm(yLin ~ x1 + x2, data = dataAnimation)
summary(mod)
##
## Call:
## lm(formula = yLin ~ x1 + x2, data = dataAnimation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.37003 -0.54305 0.06741 0.75612 1.63829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5703 0.1302 -4.380 6.59e-05 ***
## x1 0.4833 0.1264 3.824 0.000386 ***
## x2 0.3215 0.1426 2.255 0.028831 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9132 on 47 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.2452
## F-statistic: 8.958 on 2 and 47 DF, p-value: 0.0005057
# Coefficients
mod$coefficients
## (Intercept) x1 x2
## -0.5702694 0.4832624 0.3214894
# Matrix X
X <- cbind(1, x1, x2)
# Vector Y
Y <- yLin
# Coefficients
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
beta
## [,1]
## -0.5702694
## x1 0.4832624
## x2 0.3214894
Exercise 3.2. Compute β for the regressions yLin ~ x1 + x2, yQua ~ x1 + x2, and yExp ~ x2 + x3
using equation (3.4) and the function lm. Check that the fitted plane and the coefficient estimates are
coherent.
Once we have the least squares estimates β̂, we can define the next two concepts:
• The fitted values Ŷ1 , . . . , Ŷn , where
They are the vertical projections of Y1 , . . . , Yn into the fitted line (see Figure 3.1.1). In a matrix form,
inputting (3.3)
Ŷ = Xβ̂ = X(X′ X)−1 X′ Y = HY,
where H := X(X′ X)−1 X′ is called the hat matrix because it “puts the hat into Y”. What it does is to
project Y into the regression plane (see Figure 3.1.1).
• The residuals (or estimated errors) ε̂1 , . . . , ε̂n , where
ε̂i := Yi − Ŷi , i = 1, . . . , n.
They are the vertical distances between actual data and fitted data.
Model assumptions
Up to now, we have not made any probabilistic assumption on the data generation process. β̂ was derived
from purely geometrical arguments, not probabilistic ones. However, some probabilistic assumptions are
required for inferring the unknown population coefficients β from the sample (X1 , Y1 ), . . . , (Xn , Yn ).
The assumptions of the multiple linear model are:
i. Linearity: E[Y |X1 = x1 , . . . , Xp = xp ] = β0 + β1 x1 + . . . + βp xp .
ii. Homoscedasticity: Var[εi ] = σ 2 , with σ 2 constant for i = 1, . . . , n.
iii. Normality: εi ∼ N (0, σ 2 ) for i = 1, . . . , n.
iv. Independence of the errors: ε1 , . . . , εn are independent (or uncorrelated, E[εi εj ] = 0, i ̸= j, since
they are assumed to be normal).
60 CHAPTER 3. REGRESSION ESTIMATION
Figure 3.2: The key concepts of the simple linear model. The blue densities denote the conditional density
of Y for each cut in the X axis. The yellow band denotes where the 95% of the data is, according to the
model. The red points represent data following the model.
A good one-line summary of the linear model is the following (independence is assumed)
Inference on the parameters β and σ can be done, conditionally4 on X1 , . . . , Xn , from (3.5). We do not
explore this further, referring the interested reader to these notes. Instead of that, we remark the connection
between least squares estimation and the maximum likelihood estimator derived from (3.5).
First, note that (3.5) is the population version of the linear model (it is expressed in terms of the random
variables, not in terms of their samples). The sample version that summarizes assumptions i–iv is
∑
n
ℓ(β) = log ϕσ2 I (Y − Xβ) = log ϕσ (Yi − (Xβ)i ). (3.6)
i=1
Finally, the next result justifies the consideration of the least squares estimate: it equals the maximum
likelihood estimator derived under assumptions i–iv.
Theorem 3.1. Under assumptions i–iv, the maximum likelihood estimate of β is the least squares estimate
(3.4):
β̂ ML = arg max
p+1
ℓ(β) = (X′ X)−1 XY.
β∈R
Model formulation
When the response Y can take only two values, codified for convenience as 1 (success) and 0 (failure), it
is called a binary variable. A binary variable, known also as a Bernoulli variable, is a B(1, p). Recall that
E[B(1, p)] = P[B(1, p) = 1] = p.
If Y is a binary variable and X1 , . . . , Xp are predictors associated to Y , the purpose in logistic regression is
to estimate
this is, how the probability of Y = 1 is changing according to particular values, denoted by x1 , . . . , xp , of
the predictors X1 , . . . , Xp . A tempting possibility is to consider a linear model for (3.7), p(x1 , . . . , xp ) =
β0 +β1 x1 +. . .+βp xp . However, such a model will run into serious problems inevitably: negative probabilities
and probabilities larger than one will arise.
A solution is to consider a function to encapsulate the value of z = β0 + β1 x1 + . . . + βp xp , in R, and map
it back to [0, 1]. There are several alternatives to do so, based on distribution functions F : R −→ [0, 1] that
deliver y = F (z) ∈ [0, 1]. Different choices of F give rise to different models, the most common being the
logistic distribution function:
ez 1
logistic(z) := = .
1 + ez 1 + e−z
p
logit(p) := logistic−1 (p) = log .
1−p
62 CHAPTER 3. REGRESSION ESTIMATION
This is a link function, this is, a function that maps a given space (in this case [0, 1]) into R. The term link
function is employed in generalized linear models, which follow exactly the same philosophy of the logistic
regression – mapping the domain of Y to R in order to apply there a linear model. As said, different link
functions are possible, but we will concentrate here exclusively on the logit as a link function.
The logistic model is defined as the following parametric form for (3.7):
p(x1 , . . . , xp ) = logistic(β0 + β1 x1 + . . . + βp xp )
1
= . (3.8)
1 + e−(β0 +β1 x1 +...+βp xp )
p P[Y = 1]
odds(Y ) = = . (3.9)
1−p P[Y = 0]
The odds is the ratio between the probability of success and the probability of failure. It is extensively used
in betting due to its better interpretability. For example, if a horse Y has a probability p = 2/3 of winning
a race (Y = 1), then the odds of the horse is
p 2/3
odds = = = 2.
1−p 1/3
This means that the horse has a probability of winning that is twice larger than the probability of losing. This
is sometimes written as a 2 : 1 or 2 × 1 (spelled “two-to-one”). Conversely, if the odds of Y is given, we can
easily know what is the probability of success p, using the inverse of (3.9):
odds(Y )
p = P[Y = 1] = .
1 + odds(Y )
For example, if the odds of the horse were 5, that would correspond to a probability of winning p = 5/6.
Remark. Recall that the odds is a number in [0, +∞]. The 0 and +∞ values are attained for p = 0 and
p = 1, respectively. The log-odds (or logit) is a number in [−∞, +∞].
We can rewrite (3.8) in terms of the odds (3.9). If we do so, we have:
odds(Y |X1 = x1 , . . . , Xp = xp )
p(x1 , . . . , xp )
=
1 − p(x1 , . . . , xp )
= eβ0 +β1 x1 +...+βp xp
= eβ0 eβ1 x1 . . . eβp xp .
Some probabilistic assumptions are required for performing inference on the model parameters β from the
sample (X1 , Y1 ), . . . , (Xn , Yn ). These assumptions are somehow simpler than the ones for linear regression.
Figure 3.3: The key concepts of the logistic model. The blue bars represent the conditional distribution of
probability of Y for each cut in the X axis. The red points represent data following the model.
∑
n
( )
ℓ(β) = log p(Xi )Yi (1 − p(Xi ))1−Yi
i=1
∑
n
= {Yi log(p(Xi )) + (1 − Yi ) log(1 − p(Xi ))} . (3.10)
i=1
Unfortunately, due to the non-linearity of the optimization problem there are no explicit expressions for β̂.
These have to be obtained numerically by means of an iterative procedure, which may run into problems in
low sample situations with perfect classification. Unlike in the linear model, inference is not exact from the
assumptions, but approximate in terms of maximum likelihood theory. We do not explore this further and
refer the interested reader to these notes.
Figure 3.1.2 shows how the log-likelihood changes with respect to the values for (β0 , β1 ) in three data patterns.
The logistic regression fit and its dependence on β0 (horizontal displacement) and β1 (steepness of the curve).
Recall the effect of the sign of β1 in the curve: if positive, the logistic curve has an ‘s’ form; if negative, the
form is a reflected ‘s’. Application also available here.
The data of the illustration has been generated with the following code:
Let’s check that indeed the coefficients given by R’s glm are the ones that maximize the likelihood of the
animation of Figure 3.1.2. We do so for y ~ x1.
# Create the data employed in Figure 3.4
# Data
set.seed(34567)
x <- rnorm(50, sd = 1.5)
y1 <- -0.5 + 3 * x
y2 <- 0.5 - 2 * x
y3 <- -2 + 5 * x
y1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))
y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))
y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))
# Data
dataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)
# Call glm
# glm employs formula = response ~ predictor1 + predictor2 + ...
# (names according to the data frame names) for denoting the regression
# to be done. We need to specify family = "binomial" to make a
# logistic regression
mod <- glm(y1 ~ x, family = "binomial", data = dataMle)
summary(mod)
##
3.1. REVIEW ON PARAMETRIC REGRESSION 65
## Call:
## glm(formula = y1 ~ x, family = "binomial", data = dataMle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.47853 -0.40139 0.02097 0.38880 2.12362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1692 0.4725 -0.358 0.720274
## x 2.4282 0.6599 3.679 0.000234 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 69.315 on 49 degrees of freedom
## Residual deviance: 29.588 on 48 degrees of freedom
## AIC: 33.588
##
## Number of Fisher Scoring iterations: 6
# Coefficients
mod$coefficients
## (Intercept) x
## -0.1691947 2.4281626
1.0
0.8
0.6
y
0.4
0.2
0.0
−4 −2 0 2 4
x
Exercise 3.3. For the regressions y ~ x2 and y ~ x3, do the following:
• Check that β is indeed maximizing the likelihood as compared with Figure 3.1.2.
• Plot the fitted logistic curve and compare it with the one in Figure 3.1.2.
Our objective is to estimate the regression function m nonparametrically. Due to its definition, we can
rewrite it as follows:
m(x) = E[Y |X = x]
∫
= yfY |X=x (y)dy
∫
yf (x, y)dy
= .
fX (x)
This expression shows an interesting point: the regression function can be computed from the joint density f
and the marginal fX . Therefore, given a sample (X1 , Y1 ), . . . , (Xn , Yn ), an estimate of m follows by replacing
the previous densities by their kdes. To that aim, recall that in Exercise 2.15 we defined a multivariate
extension of the kde based on product kernels. For the two dimensional case, the kde with equal bandwidths
h = (h, h) is
1∑
n
fˆ(x, y; h) = Kh (x1 − Xi )Kh (y − Yi ). (3.11)
n i=1
Using (3.11),
3.2. KERNEL REGRESSION ESTIMATION 67
∫
y fˆ(x, y; h)dy
m(x) ≈
fˆX (x; h)
∫
y fˆ(x, y; h)dy
=
fˆX (x; h)
∫ 1 ∑n
y n i=1 Kh (x − Xi )Kh (y − Yi )dy
= ∑n
i=1 Kh (x − Xi )
1
∑n n
∫
i=1 Kh (x − Xi ) yKh (y − Yi )dy
1
= n
∑n
i=1 Kh (x − Xi )
1
n
∑n
1
Kh (x − Xi )Yi
= n1 ∑i=1 n .
n i=1 Kh (x − Xi )
1∑ ∑
n n
Kh (x − Xi )
m̂(x; 0, h) := ∑n Yi = Wi0 (x)Yi , (3.12)
i=1 Kh (x − Xi )
n i=1 1
n i=1
# Arguments
# x: evaluation points
# X: vector (size n) with the predictors
# Y: vector (size n) with the response variable
# h: bandwidth
# K: kernel
# Weights
W <- Kx / rowSums(Kx) # Column recycling!
# Bandwidth
h <- 0.5
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("topright", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)
15
True regression
Nadaraya−Watson
10
5
Y
0
−5
−10
−4 −2 0 2 4
X
Exercise 3.4. Implement your own version of the Nadaraya-Watson estimator in R and compare it with mNW.
You may focus only on the normal kernel and reduce the accuracy of the final computation up to 1e-7 to
achieve better efficiency. Are you able to improve the speed of mNW? Use system.time or the microbenchmark
package to measure the running times for a sample size of n = 10000.
Winner-takes-all bonus: the significative fastest version of the Nadaraya-Watson estimator (under the
above conditions) will get a bonus of 0.5 points.
The code below illustrates the effect of varying h for the Nadaraya-Watson estimator using manipulate.
# Simple plot of N-W for varying h's
manipulate({
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
3.2. KERNEL REGRESSION ESTIMATION 69
Nadaraya-Watson can be seen as a particular case of a local polynomial fit, specifically, the one corresponding
to a local constant fit. The motivation for the local polynomial fit comes from attempting to the minimize
the RSS
∑
n
(Yi − m(Xi ))2 . (3.13)
i=1
This is not achievable directly, since no knowledge on m is available. However, by a p-th order Taylor
expression it is possible to obtain that, for x close to Xi ,
m′′ (x)
m(Xi ) ≈ m(x) + m′ (x)(Xi − x) + (Xi − x)2
2
m(p) (x)
+ ··· + (Xi − x)p . (3.14)
p!
2
∑
n ∑
p
m(j) (x)
Yi − (Xi − x)j . (3.15)
i=1 j=0
j!
This expression is still not workable: it depends on m(j) (x), j = 0, . . . , p, which of course are unknown!
(j)
The great idea is to set βj := m j!(x) and turn (3.15) into a linear regression problem where the unknown
parameters are β = (β0 , β1 , . . . , βp )′ :
2
∑
n ∑
p
Yi − βj (Xi − x)j . (3.16)
i=1 j=0
While doing so, an estimate of β automatically will yield estimates for m(j) (x), j = 0, . . . , p, and we know
how to obtain β̂ by minimizing (3.16). The final touch is to make the contributions of Xi dependent on the
distance to x by weighting with kernels:
2
∑
n ∑
p
β̂ := arg min Yi − βj (Xi − x)j Kh (x − Xi ). (3.17)
p+1
β∈R
i=1 j=0
70 CHAPTER 3. REGRESSION ESTIMATION
Denoting
1 X1 − x ··· (X1 − x)p
.. .. .. ..
X := . . . .
1 Xn − x · · · (Xn − x)p n×(p+1)
and
Y1
W := diag(Kh (X1 − x), . . . , Kh (Xn − x)), Y := ... ,
Yn n×1
we can re-express (3.17) into a weighted least squares problem whose exact solution is
β̂ = arg min
p+1
(Y − Xβ)′ W(Y − Xβ) (3.18)
β∈R
Exercise 3.5. Using the equalities given in Exercise 3.1, prove (3.19).
The estimate for m(x) is then computed as
m̂(x; p, h) : = β̂0
= e′1 (X′ WX)−1 X′ WY
∑n
= Wip (x)Yi ,
i=1
where Wip (x) := e′1 (X′ WX)−1 X′ Wei and ei is the i-th canonical vector. Just as the Nadaraya-Watson, the
local polynomial estimator is a linear combination of the responses. Two cases deserve special attention:
• p = 0 is the local constant estimator or the Nadaraya-Watson estimator (Exercise 3.6). In this situation,
the estimator has explicit weights, as we saw before:
Kh (x − Xi )
Wi0 (x) = ∑n .
j=1 Kh (x − Xj )
• p = 1 is the local linear estimator, which has weights equal to (Exercise 3.9):
# Bandwidth
h <- 0.25
lp0 <- locpoly(x = X, y = Y, bandwidth = h, degree = 0, range.x = c(-10, 10),
gridsize = 500)
lp1 <- locpoly(x = X, y = Y, bandwidth = h, degree = 1, range.x = c(-10, 10),
gridsize = 500)
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(lp0$x, lp0$y, col = 2)
lines(lp1$x, lp1$y, col = 3)
legend("bottom", legend = c("True regression", "Local constant",
"Local linear"),
lwd = 2, col = 1:3)
10
0
Y
−10
−20
True regression
Local constant
Local linear
−3 −2 −1 0 1 2 3
X
# Simple plot of local polynomials for varying h's
manipulate({
# Plot data
lpp <- locpoly(x = X, y = Y, bandwidth = h, degree = p, range.x = c(-10, 10),
gridsize = 500)
72 CHAPTER 3. REGRESSION ESTIMATION
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(lpp$x, lpp$y, col = p + 2)
legend("bottom", legend = c("True regression", "Local polynomial fit"),
lwd = 2, col = c(1, p + 2))
where { }
µ2 (K) m′ (x)f ′ (x)
Bp (x) = m′′ (x) + 2 1{p=0} .
2 f (x)
The bias and variance expressions (3.21) and (3.22) yield interesting insights:
• The bias decreases with h quadratically for p = 0, 1. The bias at x is directly proportional to m′′ (x) if
p = 1 and affected by m′′ (x) if p = 0. This has the same interpretation as in the density setting:
– The bias is negative in concave regions, i.e. {x ∈ R : m(x)′′ < 0}. These regions correspond to
peaks and modes of m
– Conversely, the bias is positive in convex regions, i.e. {x ∈ R : m(x)′′ > 0}. These regions
correspond to valleys of m.
– The wilder the curvature m′′ , the harder to estimate m.
• The bias for p = 0 at x is affected by m′ (x), f ′ (x), and f (x). Precisely, the lower the density f (x),
the larger the bias. And the faster m and f change at x, the larger the bias. Thus the bias
3.4. BANDWIDTH SELECTION 73
of the local constant estimator is much more sensible to m(x) and f (x) than the local linear (which
is only sensible to m′′ (x)). Particularly, the fact that it depends on f ′ (x) and f (x) is referred as the
design bias since it depends merely on the predictor’s distribution.
2
• The variance depends directly on σf (x)
(x)
for p = 0, 1. As a consequence, the lower the density and
larger the conditional variance, the more variable is m̂(·; p, h). The variance decreases at a
factor of (nh)−1 due to the effective sample size.
An extended version of Theorem 3.2, given in Theorem 3.1 of Fan and Gijbels (1996), shows that odd order
polynomial fits are preferable to even order polynomial fits. The reason is that odd orders introduce
an extra coefficient for the polynomial fit that allows to reduce the bias, while at the same time they keep
the variance unchanged. In summary, the conclusions of the above analysis of p = 0 vs. p = 1, namely that
p = 1 has smaller bias than p = 0 (but of the same order) and the same variance as p = 0, extend to the
case p = 2ν vs. p = 2ν + 1, ν ∈ N. This allows to claim that local polynomial fitting is an odd world (Fan
and Gijbels (1996)).
Finally, we have the asymptotic pointwise normality of the estimator.
Theorem 3.3. Assume that E[(Y − m(x))2+δ |X = x] < ∞ for some δ > 0. Then, under A1–A5,
( )
√ d R(K)σ 2 (x)
nh(m̂(x; p, h) − E[m̂(x; p, h)]) −→ N 0, , (3.23)
f (x)
( )
√ ( ) d R(K)σ 2 (x)
nh m̂(x; p, h) − m(x) − Bp (x)h −→ N 0,
2
. (3.24)
f (x)
Bandwidth selection is, as in kernel density estimation, of key practical importance. Several bandwidth
selectors, in the spirit of the plug-in and cross-validation ideas discussed in Section 2.4 have been proposed.
There are, for example, a rule-of-thumb analogue (see Section 4.2 in Fan and Gijbels (1996)) and a direct
plug-in analogue (see Section 5.8 in Wand and Jones (1995)). For simplicity, we will focus only on the
cross-validation bandwidth selector.
Following an analogy with the fit of the linear model, we could look for the bandwidth h such that it
minimizes the RSS of the form
1∑
n
(Yi − m̂(Xi ; p, h))2 . (3.25)
n i=1
However, this is a bad idea. Attempting to minimize (3.25) always leads to h ≈ 0 that results in a useless
interpolation of the data. Let’s see an example.
# Grid for representing (3.22)
hGrid <- seq(0.1, 1, l = 200)^2
error <- sapply(hGrid, function(h)
mean((Y - mNW(x = X, X = X, Y = Y, h = h))^2))
# Error curve
plot(hGrid, error, type = "l")
rug(hGrid)
abline(v = hGrid[which.min(error)], col = 2)
74 CHAPTER 3. REGRESSION ESTIMATION
15
10
error
5
0
hGrid The
root of the problem is the comparison of Yi with m̂(Xi ; p, h), since there is nothing that forbids h → 0 and
as a consequence m̂(Xi ; p, h) → Yi . We can change this behavior if we compare Yi with m̂−i (Xi ; p, h), the
leave-one-out estimate computed without the i-th datum (Xi , Yi ):
1∑
n
CV(h) := (Yi − m̂−i (Xi ; p, h))2 ,
n i=1
hCV := arg min CV(h).
h>0
The optimization of the above criterion might seem to be computationally expensive, since it is required to
compute n regressions for a single evaluation of the objective function. ∑n p
Proposition 3.1. The weights of the leave-one-out estimator m̂−i (x; p, h) = j=1 W−i,j (x)Yj can be ob-
∑n p
j̸ =i
tained from m̂(x; p, h) = i=1 Wi (x)Yi :
( )2
1∑
n
Yi − m̂(Xi ; p, h)
CV(h) = .
n i=1 1 − Wip (Xi )
# Objective function
cvNW <- function(X, Y, h, K = dnorm) {
sum(((Y - mNW(x = X, X = X, Y = Y, h = h, K = K)) /
(1 - K(0) / colSums(K(outer(X, X, "-") / h))))^2)
}
# Bandwidth
h <- bw.cv.grid(X = X, Y = Y, plot.cv = TRUE)
1400
obj
800 1000
600
0 1 2 3 4 5 6 7
h.grid
h
## [1] 0.3117806
# Plot result
76 CHAPTER 3. REGRESSION ESTIMATION
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("topright", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)
15
True regression
Nadaraya−Watson
10
Y
5
0
−5
−3 −2 −1 0 1 2 3 4
was equivalent to the least squares estimate, β̂ ML = (X′ X)−1 XY. The reason was the form of the conditional
(on X1 , . . . , Xp ) likelihood:
n
ℓ(β) = − log(2πσ 2 )
2
1 ∑
n
− 2 (Yi − β0 − β1 Xi1 − . . . − βp Xip )2 .
2σ i=1
If there is a single predictor X, polynomial fitting of order p of the conditional meal can be achieved by the
well-known trick of identifying the j-th predictor Xj in (3.26) by X j . This results in
3.5. LOCAL LIKELIHOOD 77
Y |X ∼ N (β0 + β1 X + . . . + βp X p , σ 2 ). (3.27)
Therefore, maximizing with respect to β the weighted log-likelihood of the linear model around x of (3.27),
n
ℓx,h (β) = − log(2πσ 2 )
2
1 ∑
n
− 2 (Yi − β0 − β1 Xi − . . . − βp Xip )2 Kh (x − Xi ),
2σ i=1
provides β̂0 = m̂(x; p, h), the local polynomial estimator, as it was obtained in (3.17).
The same idea can be applied for other parametric models. The family of generalized linear models8 , which
presents an extension of the linear model to different kinds of response variables, provides a particularly
convenient parametric framework. Generalized linear models are constructed by mapping the support of
Y to R by a link function g, and then modeling the transformed expectation by a linear model. Thus, a
generalized linear model for the predictors X1 , . . . , Xp , assumes
g (E[Y |X1 = x1 , . . . , Xp = xp ]) = β0 + β1 x1 + . . . + βp xp
or, equivalently,
together with a distribution assumption for Y |(X1 , . . . , Xp ). The following table lists some useful transfor-
mations and distributions that are adequate to model responses in different supports. Recall that the linear
and logistic models of Sections 3.1.1 and 3.1.2 are obtained from the first and second rows, respectively.
All the distributions of the table above are members of the exponential family of distributions, which is the
family of distributions with pdf expressable as
{ }
yθ − b(θ)
f (y; θ, ϕ) = exp + c(y, ϕ) ,
a(ϕ)
where a(·), b(·), and c(·, ·) are specific functions, ϕ is the scale parameter and θ is the canonial parameter.
If the canonical link function g (the ones in the table above are all) is employed, then θ = g(µ) and s a
consequence
θ(x1 , . . . , xp ) := g(E[Y |X1 = x1 , . . . , Xp = xp ]).
Recall that, again, if there is only one predictor, identifying the j-th predictor Xj by X j in the above
expressions allows to consider p-th order polynomial fits for g (E[Y |X1 = x1 , . . . , Xp = xp ]).
8 The logistic model is a generalized linear model, as seen in Section 3.1.2
78 CHAPTER 3. REGRESSION ESTIMATION
Construction of the local likelihood estimator. The animation shows how local likelihood fits in a neighbor-
hood of x are combined to provide an estimate of the regression function for binary response, which depends
on the polynomial degree, bandwidth, and kernel (gray density at the bottom). The data points are shaded
according to their weights for the local fit at x. Application also available here.
We illustrate the local likelihood principle for the logistic regression, the simplest non-linear model. In this
case, (X1 , Y1 ), . . . , (Xn , Yn ) with
∑
n
ℓ(β) = {Yi log(logistic(θ(Xi ))) + (1 − Yi ) log(1 − logistic(θ(Xi )))}
i=1
∑n
= ℓ(Yi , θ(Xi )),
i=1
ℓ(y, θ) = yθ − log(1 + eθ )
and make explicit the dependence on θ(x) for clarity in the next developments, and implicit the dependence
on β. The local log-likelihood of β around x is then
∑
n
ℓx,h (β) = ℓ(Yi , θ(Xi − x))Kh (x − Xi ). (3.28)
i=1
β̂ = arg max
p+1
ℓx,h (β).
β∈R
( )
m̂ℓ (x; h, p) := g −1 θ̂(x) = logistic(β̂0 ). (3.29)
The code below shows three different ways of implementing the local logistic regression (of first degree) in R.
# Simulate some data
n <- 200
logistic <- function(x) 1 / (1 + exp(-x))
p <- function(x) logistic(1 - 3 * sin(x))
set.seed(123456)
9 If p = 1, then we have the usual logistic model.
10 No analytical solution for the optimization problem, numerical optimization is needed.
3.5. LOCAL LIKELIHOOD 79
# Using locfit
# Bandwidth can not be controlled explicitly - only though nn in ?lp
library(locfit)
fitLocfit <- locfit(Y ~ lp(X, deg = 1, nn = 0.25), family = "binomial",
kern = "gauss")
# Compare fits
plot(x, p(x), ylim = c(0, 1.5), type = "l", lwd = 2)
lines(x, logistic(fitGlm), col = 2)
lines(x,logistic(fitNlm), col = 2, lty = 2)
plot(fitLocfit, add = TRUE, col = 4)
legend("topright", legend = c("p(x)", "glm", "nlm", "locfit"), lwd = 2,
col = c(1, 2, 2, 4), lty = c(1, 2, 1, 1))
80 CHAPTER 3. REGRESSION ESTIMATION
1.5
p(x)
glm
nlm
locfit
1.0
p(x)
0.5
0.0
−3 −2 −1 0 1 2 3
x
Bandwidth selection can be done by means of likelihood cross-validation. The objective is to maximize the
local likelihood fit at (Xi , Yi ) but removing the influence by the datum itself. That is, maximizing
∑
n
LCV(h) = ℓ(Yi , θ̂−i (Xi )), (3.30)
i=1
where θ̂−i (Xi ) represents the local fit at Xi without the i-th datum (Xi , Yi ). Unfortunately, the nonlinearity
of (3.29) forbids a simplifying result as Proposition 3.1. Thus, in principle, it is required to fit n local
likelihoods for sample size n − 1 for obtaining a single evaluation of (3.30).
There is, however, an approximation (see Sections 4.3.3 and 4.4.3 of Loader (1999)) to (3.30) that only
requires a local likelihood fit for a single sample. We sketch its basis as follows, without aiming to go in full
detail. The approximation considers the first and second derivatives of ℓ(y, θ) with respect to θ, ℓ̇(y, θ) and
ℓ̈(y, θ). In the case of the logistic model, these are:
ℓ̇(y, θ) = y − logistic(θ),
ℓ̈(y, θ) = −logistic(θ)(1 − logistic(θ)).
( )2
˙ i , θ̂(Xi )) ,
θ̂−i (Xi ) ≈ θ̂(Xi ) − infl(Xi ) l(Y (3.31)
where θ̂(Xi ) represents the local fit at Xi and the influence function is defined as (page 75 of Loader (1999))
1 X1 − x ··· (X1 − x)p /p!
.. .. .. ..
Xx := . . . .
1 Xn − x ··· (Xn − x)p /p! n×(p+1)
and
( )2
ℓ(Yi , θ̂−i (Xi )) = ℓ(Yi , θ̂(Xi )) − infl(Xi ) ℓ̇(Yi , θ(Xi )) .
Therefore:
∑
n
LCV(h) = ℓ(Yi , θ̂−i (Xi ))
i=1
∑n ( )2
≈ ℓ(Yi , θ̂(Xi )) + infl(Xi ) ℓ̇(Yi , θ̂(Xi )) .
i=1
Recall that θ(Xi ) are unknown and hence they must be estimated.
We conclude by illustrating how to compute the LCV function and optimize it (keep in mind that much
more efficient implementations are possible!).
# Exact LCV - recall that we *maximize* the LCV!
h <- seq(0.1, 2, by = 0.1)
suppressWarnings(
LCV <- sapply(h, function(h) {
sum(sapply(1:n, function(i) {
K <- dnorm(x = X[i], mean = X[-i], sd = h)
nlm(f = function(beta) {
-sum(K * (Y[-i] * (beta[1] + beta[2] * (X[-i] - X[i])) -
log(1 + exp(beta[1] + beta[2] * (X[-i] - X[i])))))
}, p = c(0, 0))$minimum
}))
})
)
plot(h, LCV, type = "o")
abline(v = h[which.max(LCV)], col = 2)
82 CHAPTER 3. REGRESSION ESTIMATION
2000
LCV
1900
1800
3.6 Exercises
This is the list of evaluable exercises for Chapter 3. The number of stars represents an estimate of their
difficulty: easy (⋆), medium (⋆⋆), and hard (⋆ ⋆ ⋆).
Exercise 3.6. (theoretical, ⋆) Show that the local polynomial estimator yields the Nadaraya-Watson when
p = 0. Use (3.19) to obtain (3.12).
Exercise 3.7. (theoretical, ⋆⋆) Obtain the optimization problem for the local Poisson regression (for the
first degree) and the local binomial regression (of first degree also).
Exercise 3.8. (theoretical, ⋆⋆) Show that the Nadaraya-Watson is unbiased (in conditional expectation
with respect to X1 , . . . , Xn ) when the regression function is constant: m(x) = c, c ∈ R. Show the same for
the local linear estimator for a linear regression function m(x) = a + bx, a, b ∈ R. Hint: use (3.20).
Exercise 3.9. (theoretical, ⋆ ⋆ ⋆) Obtain the weight expressions (3.20) of the local linear estimator. Hint:
use the matrix inversion formula for 2 × 2 matrices.
Exercise 3.10. (theoretical, ⋆ ⋆ ⋆) Prove the two implications of Proposition 3.1 for the Nadaraya-Watson
estimator (p = 0).
Exercise 3.11. (practical, ⋆⋆, Example 4.6 in Wasserman (2006)) The dataset at http://www.stat.cmu.
edu/~larry/all-of-nonpar/=data/bpd.dat (alternative link) contains information about the presence of bron-
chopulmonary dysplasia (binary response) and the birth weight in grams (predictor) of 223 babies. Use
the function locfit of the locfit library with the argument family = "binomial" and plot its output.
Explore and comment on the resulting estimates, providing insights about the data.
Exercise 3.12. (practical, ⋆⋆) The ChickWeight dataset in R contains 578 observations of weight and Times
of chicks. Fit a local binomial or local Poisson regression of weight on Times. Use the function locfit
of the locfit library with the argument family = "binomial" or family = "poisson" and explore the
bandwidth effect. Explore and comment on the resulting estimates. What is the estimated expected time of
a chick that weights 200 grams?
Exercise 3.13. (practical, ⋆ ⋆ ⋆) Implement your own version of the local linear estimator. The function
must take a sample X, a sample Y, the points x at which the estimate should be obtained, the bandwidth h
and the kernel K. Test its correct behavior by estimating an artificial dataset that follows a linear model.
Exercise 3.14. (practical, ⋆ ⋆ ⋆) Implement your own version of the local likelihood estimator of first degree
for exponential response. The function must take a sample X, a sample Y, the points x at which the estimate
3.6. EXERCISES 83
should be obtained, the bandwidth h and the kernel K. Test its correct behavior by estimating an artificial
dataset that follows a generalized linear model with exponential response, this is,
using a cross-validated bandwidth. Hint: use optim or nlm for optimizing a function in R.
84 CHAPTER 3. REGRESSION ESTIMATION
Appendix A
This is what you have to do in order to install R and RStudio in your own computer:
1. In Mac OS X, download and install first XQuartz and log out and back on your Mac OS X account
(this is an important step that is required for 3D graphics to work). Be sure that your Mac OS X
system is up-to-date.
2. Download the latest version of R at https://cran.r-project.org/. For Windows, you can download it
directly here. For Mac OS X you can download the latest version (at the time of writing this, 3.4.1)
here.
3. Install R. In Windows, be sure to select the 'Startup options' and then choose 'SDI' in the 'Display
Mode' options. Leave the rest of installation options as default.
4. Download the latest version of RStudio for your system at https://www.rstudio.com/products/rstudio/
download/#download and install it.
If there is any Linux user, kindly follow the corresponding instructions here for installing R, download RStudio
(only certain Ubuntu and Fedora versions are supported), and install it using your package manager.
85
86 APPENDIX A. INSTALLATION OF R AND RSTUDIO
Appendix B
Introduction to RStudio
RStudio is the most employed Integrated Development Environment (IDE) for R nowadays. When you start
RStudio you will see a window similar to Figure B.1. There are a lot of items in the GUI, most of them
described in the RStudio IDE Cheat Sheet. The most important things to keep in mind are:
1. The code is written in scripts in the source panel (upper-right panel in Figure B.1);
2. for running a line or code selection from the script in the console (first tab in the lower-right panel in
Figure B.1), you do it with the keyboard shortcut 'Ctrl+Enter' (Windows and Linux) or 'Cmd+Enter'
(Mac OS X).
Figure B.1: Main window of ‘RStudio‘. The red shows the code panel and the yellow shows the console output.
Extracted from [here](https://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf).
87
88 APPENDIX B. INTRODUCTION TO RSTUDIO
Appendix C
Introduction to R
This section provides a collection of self-explainable snippets of the programming language R (R Core Team,
2017) that show the very basics of the language. It is not meant to be an exhaustive introduction to R, but
rather a reminder/panoramic of a collection of basic functions and methods.
In the following, # denotes comments to the code and ## outputs of the code.
Simple computations
89
90 APPENDIX C. INTRODUCTION TO R
## [1] 3.321928
sin(pi); cos(0); asin(0)
## [1] 1.224647e-16
## [1] 1
## [1] 0
tan(pi/3)
## [1] 1.732051
sqrt(-1)
## Warning in sqrt(-1): NaNs produced
## [1] NaN
# Remember to close the parenthesis
1 +
(1 + 3
## Error: <text>:4:0: unexpected end of input
## 2: 1 +
## 3: (1 + 3
## ^
Exercise C.1. Compute:
e2 +sin(2)
( ) .
• 1
Answer: 2.723274.
cos−1 2 +2
√
• Answer: 4.22978.
32.5 + log(10). √ √
• (20.93
− log2 (3 + 2 + sin(1)))10tan(1/3)) 32.5 + log(10). Answer: -3.032108.
# Any operation that you perform in R can be stored in a variable (or object)
# with the assignment operator "<-"
x <- 1
# Different
X <- 3
x; X
## [1] 2
## [1] 3
91
# Remove variables
rm(X)
X
## Error in eval(expr, envir, enclos): object 'X' not found
Exercise C.2. Do the following:
• Store −123 in the variable y.
• Store the log of the square of y in z.
y−z
• Store y+z 2 in y and remove z.
Vectors
# Entrywise operations
myData + 1
## [1] 2 3
myData^2
## [1] 1 4
myData[1] <- 0
myData
## [1] 0 2
# If you want to access all the elements except a position, use [-position]
myData2[-1]
## [1] 0.0 1.1 1.0 3.0 4.0
myData2[-2]
## [1] -4.12 1.10 1.00 3.00 4.00
# And also
myData2[-c(1, 2)]
## [1] 1.1 1.0 3.0 4.0
Some functions
# Functions take arguments between parenthesis and transform them into an output
sum(myData)
## [1] 2
prod(myData)
## [1] 0
# Summary of an object
summary(myData)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
93
# Usually the functions have several arguments, which are set by "argument = value"
# In this case, the second argument is a logical flag to indicate the kind of sorting
sort(myData) # If nothing is specified, decreasing = FALSE is assumed
## [1] 0 2
sort(myData, decreasing = TRUE)
## [1] 2 0
# Do not know what are the arguments of a function? Use args and help!
args(mean)
## function (x, ...)
## NULL
?mean
Exercise C.4. Do the following:
• Compute the mean, median and variance of y. Answers: 49.5, 49.5, 843.6869.
• Do the same for y + 1. What are the differences?
• What is the maximum of y? Where is it placed?
• Sort y increasingly and obtain the 5th and 76th positions. Answer: c(4,75).
• Compute the covariance between y and y. Compute the variance of y. Why do you get the same
result?
## [2,] 2 4
# Another matrix
B <- matrix(1, nrow = 2, ncol = 2, byrow = TRUE)
B
## [,1] [,2]
## [1,] 1 1
## [2,] 1 1
# Entrywise operations
A + 1
## [,1] [,2]
## [1,] 2 4
## [2,] 3 5
A * B
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# Accessing elements
A[2, 1] # Element (2, 1)
## [1] 2
A[1, ] # First row - this is a vector
## [1] 1 3
A[, 2] # First column - this is a vector
## [1] 3 4
# Matrix transpose
t(A)
95
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
# Matrix multiplication
A %*% B
## [,1] [,2]
## [1,] 4 4
## [2,] 6 6
A %*% B[, 1]
## [,1]
## [1,] 4
## [2,] 6
A %*% B[1, ]
## [,1]
## [1,] 4
## [2,] 6
# Care is needed
A %*% B[1, , drop = FALSE] # Incompatible product
## Error in A %*% B[1, , drop = FALSE]: non-conformable arguments
# The nice thing is that you can access variables by its name with
# the "$" operator
myDf$newname1
## [1] 1 2
## 1 1 3 0
## 2 2 4 1
• Create a matrix called M with rows given by y[3:5], y[3:5]^2 and log(y[3:5]).
• Create a data frame called myDataFrame with column names “y”, “y2” and “logy” containing the
vectors y[3:5], y[3:5]^2 and log(y[3:5]), respectively.
• Create a list, called l, with entries for x and M. Access the elements by their names.
## y y2 logy
## 1 9.180087 18.33997 3.242862
## 2 9.159678 18.29895 3.238784
## 3 9.139059 18.25750 3.234656
97
Vector-related functions
# Repeat number
rep(0, 5)
## [1] 0 0 0 0 0
# Reverse a vector
myVec <- c(1:5, -1:3)
rev(myVec)
## [1] 3 2 1 0 -1 5 4 3 2 1
# Another way
myVec[length(myVec):1]
## [1] 3 2 1 0 -1 5 4 3 2 1
# Smaller than
0 < 1
## [1] TRUE
# Greater than
1 > 1
## [1] FALSE
# Greater or equal to
1 >= 1 # Remember: ">="" and not "=>"" !
## [1] TRUE
# Smaller or equal to
2 <= 1 # Remember: "<="" and not "=<"" !
## [1] FALSE
# Equal
1 == 1 # Tests equality. Remember: "=="" and not "="" !
## [1] TRUE
# Unequal
1 != 0 # Tests iequality
## [1] TRUE
# In a vector-like fashion
x <- 1:5
y <- c(0, 3, 1, 5, 2)
x < y
## [1] FALSE TRUE FALSE TRUE FALSE
x == y
## [1] FALSE FALSE FALSE FALSE FALSE
x != y
## [1] TRUE TRUE TRUE TRUE TRUE
# Subsetting of vectors
x
## [1] 1 2 3 4 5
x[x >= 2]
## [1] 2 3 4 5
x[x < 3]
## [1] 1 2
## 2 1 2
## 3 3 3
## 4 3 4
## 5 0 5
# In an example
iris$Sepal.Width[iris$Sepal.Width > 3]
## NULL
# In an example
summary(iris)
## species sepal.len sepal.wid petal.len
## versicolor:50 Min. :4.900 Min. :2.000 Min. :3.000
## virginica :50 1st Qu.:5.800 1st Qu.:2.700 1st Qu.:4.375
## Median :6.300 Median :2.900 Median :4.900
## Mean :6.262 Mean :2.872 Mean :4.906
## 3rd Qu.:6.700 3rd Qu.:3.025 3rd Qu.:5.525
## Max. :7.900 Max. :3.800 Max. :6.900
## petal.wid
## Min. :1.000
## 1st Qu.:1.300
## Median :1.600
## Mean :1.676
## 3rd Qu.:2.000
## Max. :2.500
summary(iris[iris$Sepal.Width > 3, ])
## species sepal.len sepal.wid petal.len petal.wid
## versicolor:0 Min. : NA Min. : NA Min. : NA Min. : NA
## virginica :0 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
101
# Subset argument in lm
lm(Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > 3)
## Error in eval(predvars, data, env): object 'Sepal.Width' not found
lm(Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > 3)
## Error in eval(predvars, data, env): object 'Sepal.Width' not found
# Both iris$Sepal.Width and Sepal.Width in subset are fine: data = iris
# tells R to look for Sepal.Width in the iris dataset
# OR operator "|"
TRUE | TRUE
## [1] TRUE
TRUE | FALSE
## [1] TRUE
FALSE | FALSE
## [1] FALSE
# In an example
summary(iris[iris$Sepal.Width > 3 & iris$Sepal.Width < 3.5, ])
## species sepal.len sepal.wid petal.len petal.wid
## versicolor:0 Min. : NA Min. : NA Min. : NA Min. : NA
## virginica :0 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
102 APPENDIX C. INTRODUCTION TO R
Plotting functions
# Alrernatively
plot(iris[, 1:2], main = "Sepal.Length vs Sepal.Width")
# "plot" is a first level plotting function. That means that whenever is called,
# it creates a new plot. If we want to add information to an existing plot, we
# have to use a second level plotting function such as "points", "lines" or "abline"
Distributions
# If you want to have always the same result, set the seed of the random number
# generator
set.seed(45678)
rnorm(n = 10, mean = 0, sd = 1)
## [1] 1.4404800 -0.7195761 0.6709784 -0.4219485 0.3782196 -1.6665864
## [7] -0.5082030 0.4433822 -1.7993868 -0.6179521
0.1
0.0
−4 −2 0 2 4
x
# Plotting the distribution function of a N(0, 1)
x <- seq(-4, 4, l = 100)
y <- pnorm(q = x, mean = 0, sd = 1)
plot(x, y, type = "l")
105
1.0
0.8
0.6
y
0.4
0.2
0.0
−4 −2 0 2 4
x
# Computing the 95% quantile for a N(0, 1)
qnorm(p = 0.95, mean = 0, sd = 1)
## [1] 1.644854
1.0
0.8
0.6
y
0.4
0.2
0.0
−2 −1 0 1 2
x
# Computing the 95% quantile for a U(0, 1)
qunif(p = 0.95, min = 0, max = 1)
## [1] 0.95
0.25
0.20
0.15
y
0.10
0.05
0.00
0 2 4 6 8 10
x
# Plotting the distribution function of a Bi(10, 0.5)
x <- 0:10
y <- pbinom(q = x, size = 10, prob = 0.5)
plot(x, y, type = "h")
1.0
0.8
0.6
y
0.4
0.2
0.0
0 2 4 6 8 10
x
Exercise C.10. Do the following:
• Compute the 90%, 95% and 99% quantiles of a F distribution with df1 = 1 and df2 = 5. Answer:
c(4.060420, 6.607891, 16.258177).
• Plot the distribution function of a U (0, 1). Does it make sense with its density function?
• Sample 100 points from a Poisson with lambda = 5.
108 APPENDIX C. INTRODUCTION TO R
Functions
# This is a silly function that takes x and y and returns its sum
# Note the use of "return" to indicate what should be returned
add <- function(x, y) {
z <- x + y
return(z)
}
# Calling add - you need to run the definition of the function first!
add(x = 1, y = 2)
## [1] 3
add(1, 1) # Arguments names can be omitted
## [1] 2
# A more complex function: computes a linear model and its posterior summary.
# Saves us a few keystrokes when computing a lm and a summary
lmSummary <- function(formula, data) {
model <- lm(formula = formula, data = data)
summary(model)
}
# If no return(), the function returns the value of the last expression
# Usage
lmSummary(Sepal.Length ~ Petal.Width, iris)
## Error in eval(predvars, data, env): object 'Sepal.Length' not found
# Usage
plot(x, y)
addLine(x, beta0 = 0.1, beta1 = 0)
109
1.0
0.8
0.6
y
0.4
0.2
0.0
0 2 4 6 8 10
x
# The function "sapply" allows to sequentially apply a function
sapply(1:10, sqrt)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
sqrt(1:10) # The same
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
# The advantage of "sapply" is that you can use with any function
myFun <- function(x) c(x, x^2)
sapply(1:10, myFun) # Returns a 2 x 10 matrix
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 1 4 9 16 25 36 49 64 81 100
50
sapply(1:10, sumSeries)
40
30
20
10
0
2 4 6 8 10
1:10
# "apply" applies iteratively a function to rows (1) or columns (2)
# of a matrix or data frame
A <- matrix(1:10, nrow = 5, ncol = 2)
A
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
apply(A, 1, sum) # Applies the function by rows
## [1] 7 9 11 13 15
apply(A, 2, sum) # By columns
## [1] 15 40
Control structures
# The "for" statement allows to create loops that run along a given vector
# Prints 5 times a message (i varies in 1:5)
for (i in 1:5) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
# Another example
a <- 0
for (i in 1:3) {
a <- i + a
}
a
## [1] 6
for (i in 1:1000) {
x <- (x + 0.01) * x
print(x)
if (x > 100) {
break
}
}
## [1] 1.01
## [1] 1.0302
## [1] 1.071614
## [1] 1.159073
## [1] 1.35504
## [1] 1.849685
## [1] 3.439832
## [1] 11.86684
## [1] 140.9406
Exercise C.12. Do the following:
∑m
• Compute Cn×k in Cn×k = An×m Bm×k from A and B. Use that ci,j = l=1 ai,l bl,j . Test the
implementation with simple examples.
• Create a function that samples a N (0, 1) and returns the first sampled point that is larger than 4.
• Create a function that simulates N samples from the distribution of max(X1 , . . . , Xn ) where X1 , . . . , Xn
are iid U(0, 1).
Bibliography
Allaire, J. J., Cheng, J., Xie, Y., McPherson, J., Chang, W., Allen, J., Wickham, H., Atkins, A., Hyndman,
R., and Arslan, R. (2017). rmarkdown: Dynamic Documents for R. R package version 1.6.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability. Springer Texts in Statistics. Springer,
New York.
Devroye, L. and Györfi, L. (1985). Nonparametric Density Estimation. Wiley Series in Probability and
Mathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons, Inc., New York. The
L1 view.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and its Applications, volume 66 of Monographs
on Statistics and Applied Probability. Chapman & Hall, London.
Loader, C. (1999). Local regression and likelihood. Statistics and Computing. Springer-Verlag, New York.
Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. Ann. Statist., 20(2):712–736.
Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Statist., 33(3):1065–
1076.
R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statist.,
27(3):832–837.
Ruppert, D. and Wand, M. P. (1994). Multivariate locally weighted least squares regression. Ann. Statist.,
22(3):1346–1370.
Scott, D. W. (2015). Multivariate Density Estimation. Wiley Series in Probability and Statistics. John Wiley
& Sons, Inc., Hoboken, second edition. Theory, practice, and visualization.
Scott, D. W. and Terrell, G. R. (1987). Biased and unbiased cross-validation in density estimation. J. Am.
Stat. Assoc., 82(400):1131–1146.
Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density
estimation. J. Roy. Statist. Soc. Ser. B, 53(3):683–690.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Monographs on Statistics
and Applied Probability. Chapman & Hall, London.
van der Vaart, A. W. (1998). Asymptotic Statistics, volume 3 of Cambridge Series in Statistical and Proba-
bilistic Mathematics. Cambridge University Press, Cambridge.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing, volume 60 of Monographs on Statistics and Applied
Probability. Chapman & Hall, Ltd., London.
113
114 BIBLIOGRAPHY
Wasserman, L. (2004). All of Statistics. Springer Texts in Statistics. Springer-Verlag, New York. A concise
course in statistical inference.
Wasserman, L. (2006). All of Nonparametric Statistics. Springer Texts in Statistics. Springer, New York.
Watson, G. S. (1964). Smooth regression analysis. Sankhyā Ser. A, 26:359–372.
Xie, Y. (2015). Dynamic Documents with R and knitr. The R Series. Chapman & Hall/CRC, Boca Raton.
Xie, Y. (2016). bookdown: Authoring Books and Technical Documents with R Markdown. The R Series.
Chapman & Hall/CRC, Boca Raton.