Statistical Methods For Bioinformatics Lecture 5
Statistical Methods For Bioinformatics Lecture 5
On the Wage data set, a natural cubic spline with 15 degrees of freedom
is compared to a degree-15 polynomial. Polynomials can show wild
behavior, especially near the tails.
Alternatives: splitting up
We can break up the range of X into bins; an ordered
categorical variable with estimated means.
The Wage data. Left: Solid curve: fitted value from a least squares regression
of wage (in thousands) using step functions of age. Dotted curves indicate 95
% confidence interval. Right: Model of binary event wage>250k using logistic
regression with step functions of age; showing posterior probability.
In R you can fit a cubic regression spline with the gam package
using the ns function
(X − ξk )3+ − (X − ξK )3+
dk (X ) =
ξK − ξk
(from Elements of Statistical Learning)
Purpose:
Provide a good fit to the data to explore and present the
relationship between the explanatory variable and the response
variable
To obtain a curve estimate that does not display too much
rapid fluctuation
How to make a compromise between the two rather different
aims in curve estimation?
Smoothing splines penalize for roughness quantified by:
Z
00
g (t)2
i=1
Local regression , where the blue curve represents generating function, orange curve
corresponds to the local regression estimate f(x). The yellow bell-shape superimposed
on the plot indicates weights assigned to each point, decreasing to zero with distance
from the target point.
Local Regression; LOESS
Choices:
The weighting function
a continuous, bounded, and symmetric real function
a running mean is known as the box kernel
a (truncated) Gaussian is a natural candidate
The weighting function comes with range parameter
e.g. the span, the fraction of the dataset considered by the
kernel
Type of regression function
Advantages: v. flexible fit
Disadvantages:
Requires dense data to work well
No closed functional definition
a memory-based procedure
From Page. 281, I don’t really understand ’we need all the training
data each time we wish to compute a prediction.’ Why we need all
the training data?
When one may not assume that most of the genes are
unchanged between the two conditions, applying this method
may normalize out true biological differences.
Another issue of normalization involves the spread of the M
values across the array, which may depend on the array itself
and not on the biology.
In real experiments there are normally many biases and
random effects.
p
X
g (yi ) = β0 + βj fj (xij ) + εi
j=1
becomes
p
X
g (yi ) = β0 + fj (xij ) + εi
j=1
β̂ = (X T X )−1 X T y
For a GAM OLS is not defined in general
Backfitting
1 Initialize: β0 = ȳ , fj = fj0 , j = 1, · · · , p
2 Cycle: j = 1, · · · , p, 1, · · · , p, · · ·
X
fj = Sj (y − β0 − fk |xj )
k6=j
For the Wage data, plots of the relationship between each feature and the response,
wage, in the fitted model wage = β0 + f1 (year ) + f2 (age) + f3 (education) + ε. Each
plot displays the fitted function and point-wise standard errors. The first two functions
are natural splines in year and age, with four and five degrees of freedom, respectively.
The third function is a step function, fit to the qualitative variable education.
p
p(yi ) X
log ( ) = β0 + fj (xij ) + εi
1 − p(yi )
j=1
y = β0 + β2 X2
Calculated the RSS = i (yi − ŷi )2 for the complete (c) and
P
reduced (r) models, note the number of used degrees of
freedom (df) and the remaining degrees of freedom for the
start model (n − dfc ). Calculate the F-statistic:
RSSr −RSSc RSSc
F = dfc −dfr / n−dfc