0% found this document useful (0 votes)

23 views

Statistical Methods For Bioinformatics Lecture 5

Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5

Uploaded by

javabe7544

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Statistical Methods For Bioinformatics Lecture 5

Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5

Uploaded by

javabe7544

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Statistical Methods for Bioinformatics

II-4: Beyond Linearity

Statistical Methods for Bioinformatics

Today

Non-linearity in the (Generalized) Linear Model

limitations of polynomial global fits
Linear Model of Basis Functions
Splines
Cubic Regression Spline and the truncated power-basis function
Natural Cubic Regression Spline
Smoothing Spline
Non-parametric regression
LOESS
Example application of non-linear models
Generalized additive models

Statistical Methods for Bioinformatics

Beyond Linearity

When a predictor has a non-linear relationship with the

response variable the default approach is to transform the
predictor to maintain the basic linear form.
g (Y ) = β0 + β1 x1 + . . . + βm xm + ε

A simple transformation may suffice e.g. log or root

transformations
The traditional approach is to use polynomial expansions
yi = β0 + β1 xi + β2 xi2 + β3 xi3 + . . . + βd xid + ε

Statistical Methods for Bioinformatics

The problem with polynomials
A polynomial series generates a global fit; i.e. it describes the
whole range of the predictor.
Tweaking the coefficients for one region can cause the
function to flap about madly in more remote, data-sparse,
regions.

On the Wage data set, a natural cubic spline with 15 degrees of freedom
is compared to a degree-15 polynomial. Polynomials can show wild
behavior, especially near the tails.
Alternatives: splitting up
We can break up the range of X into bins; an ordered
categorical variable with estimated means.

The Wage data. Left: Solid curve: fitted value from a least squares regression
of wage (in thousands) using step functions of age. Dotted curves indicate 95
% confidence interval. Right: Model of binary event wage>250k using logistic
regression with step functions of age; showing posterior probability.

Statistical Methods for Bioinformatics

Step functions

In step functions you define a fit per interval. For a constant

response prediction per interval:

y = β0 + β1 C1 (xi ) + β2 C2 (xi ) + β3 C3 (xi ) + . . . + βn Cn (xi ) + εi

with C (X ) indicator variables that become 1 or 0 depending on

the value of X, and interval boundaries
(
1 boundlower ≤ x < boundhigher
C (x) =
0 x < boundlower ∨ x ≥ boundhigher
This can give stable fits, with flexibility based on location and
number of breaks, but normally quite terrible bias.

Statistical Methods for Bioinformatics

Fitting higher order functions per interval

Piecewise polynomial regression: fitting low level polynomial over

intervals of X.
(
β01 + β11 x1 + β21 x12 + β31 x13 x1 ≤ bound
β02 + β12 x1 + β22 x12 + β32 x13 x1 > bound
Adding more intervals (knots) makes the function more flexible.

Statistical Methods for Bioinformatics

Constraints to obtain smooth functions

If we do not insist on continuity we get awkward results

Just a constraint on the response value at the interval borders
still provides unrealistic fits.

Statistical Methods for Bioinformatics

Constraints
Ensuring continuity to the second derivative gives smoother
transitions and reduces the degrees of freedom needed for the
fit
A spline of degree D is a function formed by connecting
polynomial segments of degree D so that:
the function is continuous,
the function has D − 1 continuous derivatives (the Dth
derivative is constant between knots)
What is a spline?
Historically: a flexible ruler used to draw curves.Thin wooden
strips to interpolation from the key points of a design into
smooth curves. The strips are held in place at defined points
using weights called ”ducks”. Between the fixed points would
assume shapes defined by minimum strain energy.
In statistics etc: a ”spline” is a smooth, piecewise polynomial
approximation of a continuous function.

Statistical Methods for Bioinformatics

Form of a cubic spline: Basis functions

Polynomial and piecewise constant-regression functions are

expression of the general model:

y = β0 + β1 b1 (xi ) + β2 b2 (xi ) + β3 b3 (xi ) + . . . + βn bn (xi ) + εi

with b(.) some defined basis function

bj (x) = x j in the case of polynomials.
This approach allows to fit flexible functions, while holding on to
the linear model with its many advantages, such as paramater
estimation approaches and error/significance inference.

Statistical Methods for Bioinformatics

Form of a cubic spline

A cubic spline with k knots can be modelled as:

yi = β0 + β1 b1 (xi ) + β2 b2 (xi ) + β3 b3 (xi ) + . . . + βK +3 b+3 (xi ) + εi

One representation starts with a normal cubic polynomial: x,

x 2 , x 3 , then add truncated power basis functions per knot:

Limited increase in use of degrees of freedom: a cubic spline

with K knots uses K+4 degrees of freedom.

Statistical Methods for Bioinformatics

The truncated power basis function in action

image by Trevor Hastie, Robert Tibshirani

Statistical Methods for Bioinformatics

Practicalities around regression splines

In principle you could go to higher degree splines, e.g. with

4th degree polynomials. In practice, this is hardly ever
warranted.
The truncated power function is not too useful in practice due
to numerical instability issues.
Powers of large numbers can cause problems with
overflow/rounding
The B-spline basis is more suitable (stable), esp. with many
knots (but of a more complicated form)
B-splines are equivalent to the formulation shown here
In R you can fit a cubic regression spline with the gam
package using the bs function

Statistical Methods for Bioinformatics

Question: why this comment?

’Unfortunately, splines can have high variance at the outer range of

the predictors—that is, when X takes on either a very small or very
large value’

Statistical Methods for Bioinformatics

Natural Splines: additional constraints

We know the behavior of polynomials fit to data tends to be

erratic near the boundaries
Locally fit polynomials fit behave even more wildly there, and
inference beyond the range is unreliable
A “natural” cubic spline adds constraints, so that the function
is linear beyond the boundary knots.
The following holds for a spline g fit on n observations in
ascending order x0 · · · xn :
g 00 (x0 ) = g 00 (xn ) = 0
Boundary knots are required, but 4 degrees of freedom are
saved to a cubic spline with the same number of knots
Natural Splines: additional constraints

As can be seen below, the variability in the fit is reduced in

the boundary regions (Confidence Intervals are shown)

In R you can fit a cubic regression spline with the gam package
using the ns function

Statistical Methods for Bioinformatics

Natural Splines: expressed in base functions

A natural cubic spline model with K knots is represented by K

basis functions:

y = β0 + β1 X + β2 bk+2 (X ) + β3 bk+3 (X ) + . . . + βK bK (xi ) + εi

with bk+2 (X ) = dk (X ) − dk−1 (X ) with (X − ξk )3+ the truncated

base function as before:

(X − ξk )3+ − (X − ξK )3+
dk (X ) =
ξK − ξk
(from Elements of Statistical Learning)

Statistical Methods for Bioinformatics

Decisions with Regression Splines

1 select the order of the spline

2 the number of knots
3 placement of knots
One approach is to parameterize a family of splines by degrees
of freedom, and have the observations determine the positions
of the knots.
In practice it is common to place knots in a uniform fashion
Decide form by cross-validation

Statistical Methods for Bioinformatics

Smoothing splines: roughness penalty

Purpose:
Provide a good fit to the data to explore and present the
relationship between the explanatory variable and the response
variable
To obtain a curve estimate that does not display too much
rapid fluctuation
How to make a compromise between the two rather different
aims in curve estimation?
Smoothing splines penalize for roughness quantified by:
Z
00
g (t)2

Statistical Methods for Bioinformatics

Smoothing splines

We try to fit a function g that fits the data as good as

possible, but it should avoid overlearning. A reasonable
demand is for the function to be “smooth”. We use the
following optimization function.
n Z
00
X
(yi − g (xi )) + λ g (t)2 dt
2

i=1

if λ = 0 you’ll get a perfect match to the training data, if

λ → ∞ then you’ll get a function without inflections: a line.
Remarkably, it can be shown that this formula has an explicit,
finite-dimensional, unique minimizer which is a natural cubic
spline with knots at the unique values of the xi, i = 1,...,N

Statistical Methods for Bioinformatics

Smoothing splines: the λ parameter
The smoothing parameter controls the variance/bias balance
(image from The Elements of Statistical Learning)

Statistical Methods for Bioinformatics

Question: What does this comment refer to

“In other words, the function g(x) that minimizes (7.11) is a

natural cubic spline with knots at x1, . . . , xn! However, it is not
the same natural cubic spline that one would get if one applied the
basis function approach described in Section 7.4.3 with knots at
x1, . . . , xn—rather, it is a shrunken version of such a natural
cubic spline, where the value of the tuning parameter λ in (7.11)
controls the level of shrinkage.”

Statistical Methods for Bioinformatics

Smoothing splines: the λ parameter

The smoothing parameter constrains the degrees of freedom

of the fit. df (λ) decreases from n for λ = 0 to 2 as λ → ∞.
Assume the estimated fit ĝλ = Sλ Y , then
P the effective
degrees of freedom is given by dfλ = ni=1 {Sλ }ii
Cross-validation is a good way to estimate an adequate λ.
There is a very computationally efficient Leave-One Out
Cross-Validation solution:
n
X yi − ĝλ (xi ) 2
RSSLOOCV (λ) = ( )
1 − S(λ)ii
i=1

Statistical Methods for Bioinformatics

Non-parametric methods

Normal linear regression assumes e.g. normal distribution of

errors.
Non-parametric covers techniques that do not rely on data
belonging to any particular distribution. E.g. the
Mann–Whitney U test for the hypothesis two samples are from
the same population and is based on ranking your values. The
test can be more powerful than a t-test on non-normal
distributions .
Polynomial expansions to fit a complex function still assume a
single functional can generalize the predictor-response
relationship.
Non-parametric methods make no (less) assumptions on the
form of the functional

Statistical Methods for Bioinformatics

The simplest non-parametric regression
A prediction for a value in a range is based on a local
weighted average based on the nearby points.
The function that defines the weights for the weighted
average is dubbed a “kernel”, e.g. a Gaussian kernel.
The result is a smooth function
package np in R

image from Wikipedia (http://en.wikipedia.org/wiki/Kernel regression)

Statistical Methods for Bioinformatics
Local Regression; LOESS
The predictor and response relationship is modeled with local
linear fits: for a given x we fit f (x) = β0 + β1 x
A weighted least squares fit is made for these simple linear
models. The observations are sampled from around x and are
weighted through a specified kernel function. The observations
close to the value to be predicted are given most weight.
LO(W)ESS stands for Locally Weighted Scatterplot
Smoothing

Local regression , where the blue curve represents generating function, orange curve
corresponds to the local regression estimate f(x). The yellow bell-shape superimposed
on the plot indicates weights assigned to each point, decreasing to zero with distance
from the target point.
Local Regression; LOESS

Statistical Methods for Bioinformatics

Local Regression

Choices:
The weighting function
a continuous, bounded, and symmetric real function
a running mean is known as the box kernel
a (truncated) Gaussian is a natural candidate
The weighting function comes with range parameter
e.g. the span, the fraction of the dataset considered by the
kernel
Type of regression function
Advantages: v. flexible fit
Disadvantages:
Requires dense data to work well
No closed functional definition
a memory-based procedure

Statistical Methods for Bioinformatics

Local Regression vs Splines

Which works better?

Statistical Methods for Bioinformatics

Specifics

Standard errors can be estimated for every point, however

bootstrap estimates are often preferred
The degrees of freedom used by the smoother can be
estimated very similarly to how we did it for the smoothing
spline:
The vector of estimated values f can be expressed as: fˆ = Sy ,
S is a n × n matrix defined by our smoother and y are our
observations.
The used degrees of freedom by df = trace(S), the sum of
the diagonal values of the matrix df = ni=1 {S}ii .
P

Statistical Methods for Bioinformatics

Question

From Page. 281, I don’t really understand ’we need all the training
data each time we wish to compute a prediction.’ Why we need all
the training data?

Statistical Methods for Bioinformatics

An Application of Non-Linear Models

One important use is to remove Systematic Experimental Bias from

data; or calibration.
An example: Spotted microarrays consist of spotted DNA samples
in a regular pattern on a solid surface. Read out of relative
abundances of mRNA by hybridization of cDNA tagged with a
fluorescent dye. To compare two conditions, two dyes are used: e.g.
Cy3 (green) and Cy5 (red).
We are typically interested in the ratio between the signals as a
measure of differential expression between conditions
However the green dye often has a tendency to be stronger than the
red dye. The magnitude of this effect varies from array to array. If
we can measure this bias we can correct for it.
A standard method of displaying microarray data that visualizes the
spread between the two channels shows a G(g) as the Cy3 intensity
for a gene g, and R(g) is the Cy5 intensity for g, and we plot M =
log2(G(g)/R(g)) on the vertical axis, against A = (log2(G(g) +
log2(R(g)))/2 on the horizontal axis

Statistical Methods for Bioinformatics

M versus A plot

M is log fold (vertical axis), A is abundance (on the horizontal axis)

Statistical Methods for Bioinformatics

M versus A LOESS fit

Statistical Methods for Bioinformatics

M versus A LOESS fit subtracted

Statistical Methods for Bioinformatics

Details

When one may not assume that most of the genes are
unchanged between the two conditions, applying this method
may normalize out true biological differences.
Another issue of normalization involves the spread of the M
values across the array, which may depend on the array itself
and not on the biology.
In real experiments there are normally many biases and
random effects.

Statistical Methods for Bioinformatics

High dimensionality

Can we fit non-linearly when p is large (and n<p)?

Statistical Methods for Bioinformatics

Generalized Additive Models
Generalized Additive Models (GAMs) extend the Generalized
Linear Model so that non-linear responses can be included,
maintaining the additive form between components.

p
X
g (yi ) = β0 + βj fj (xij ) + εi
j=1

becomes
p
X
g (yi ) = β0 + fj (xij ) + εi
j=1

For natural/regression splines the non-linear function can be

represented as a normal set of basis functions and we can use
a normal least squares approach and a general linear model!
Other functionals push to alternative fitting procedures, as the
back-fitting procedure (exercise 11)
Statistical Methods for Bioinformatics
GAM fitting

A normal lm OLS fit is defined as:

β̂ = (X T X )−1 X T y
For a GAM OLS is not defined in general
Backfitting
1 Initialize: β0 = ȳ , fj = fj0 , j = 1, · · · , p
2 Cycle: j = 1, · · · , p, 1, · · · , p, · · ·
X
fj = Sj (y − β0 − fk |xj )
k6=j

Repeat till changes in f minimal.

Statistical Methods for Bioinformatics

Generalized Additive Models

Why the additive format?

For the Wage data, plots of the relationship between each feature and the response,
wage, in the fitted model wage = β0 + f1 (year ) + f2 (age) + f3 (education) + ε. Each
plot displays the fitted function and point-wise standard errors. The first two functions
are natural splines in year and age, with four and five degrees of freedom, respectively.
The third function is a step function, fit to the qualitative variable education.

Statistical Methods for Bioinformatics

GAM for Classification

A more general notation for part of the GAM formulation is

p
X
g (E (y )) = β0 + fj (xj )
j=1

where a link function g connects the predictions to a specified

exponential error function distribution (e.g. Poisson, Gaussian,
Binomial). Hence GAMs can also be used for classification
problems:

p
p(yi ) X
log ( ) = β0 + fj (xij ) + εi
1 − p(yi )
j=1

Statistical Methods for Bioinformatics

Generalized Additive Models

The GAM allows flexible fits, with relaxed assumptions, to

better represent relationships in the data. (lower bias)
This comes at some loss of interpretability.
Ease of understanding, summarization, communication
Parameterized methods give easily interpreted, simple
predictions
Overfitting can be a serious problem; though solutions exist!
Control degrees of freedom
Cross-validation
Compare GAM fits to GLM fits: is the decrease in bias higher
than the increase in variance? Are your non-linear models
significantly better?
It is usually preferable to rely on a simple well understood
model for predicting future cases, than on a complex model
that is difficult to interpret and summarize.
How about interactions between variables?
Statistical Methods for Bioinformatics
Classical comparisons of (G)LMs for model selection

In the lab they refer to doing ANOVA’s to compare linear

models.
Classical model selection approach: The General Linear
F-Test. F stands for Fisher.

Statistical Methods for Bioinformatics

F-test for linear models

You compare two linear models: a complete model, also called

the unrestricted model, and a reduced model (restricted). In
the reduced model one or more of the coefficients in the start
model are 0. For example:
y = β0 + β1 X1 + β2 X2
and a reduced (or nested) model with some coefficients 0:

y = β0 + β2 X2

You want to test that the hypothesis that the removed

coefficients are 0: H0 : β1 = 0
The basis for the comparison is the Residual Sum of Squares
of the fits, and an assumption on the normality of the
residuals.

Statistical Methods for Bioinformatics

F-test definition

Calculated the RSS = i (yi − ŷi )2 for the complete (c) and
P
reduced (r) models, note the number of used degrees of
freedom (df) and the remaining degrees of freedom for the
start model (n − dfc ). Calculate the F-statistic:
RSSr −RSSc RSSc
F = dfc −dfr / n−dfc

This statistic has an F distribution with parameters (dfc − dfr

,n − dfc )
Note RSSc ≤ RSSr
For linear regression, this is equivalent to the ANOVA F-test.
Can be used to step by step reduce a full model, a kind of
Stepwise Backward Selection with hypothesis testing.

Statistical Methods for Bioinformatics

GAM evaluation

How to compare models of different complexities:

ANOVA (if nested)
Can compare linear vs non-linear components
m1=lm(wage ∼ ns(year, df = 5) + ns(age, df = 5))
m2=lm(wage ∼ year + ns(age, df = 5))
anova(m1,m2)
GLM vs GAM
m3=gam(wage∼ s(year, df = 5) + ns(age, df = 5))
anova(m3,m2)
AIC
Cross-Validation

Statistical Methods for Bioinformatics

To do:

Preparation for next week

Read chapter 8 + videos

Send in any question day before class
Exercises
Lab chapter 7
Chapter 7, exercise 1, 2, 5, 10 & 11

Statistical Methods for Bioinformatics

(Ebook) SPSS Statistics: A Practical Guide 5e by Kellie Bennett, Dr Brody Heritage, Dr Peter Allen ISBN 9780170460163, 0170460169 - The ebook is available for quick download, easy access to content
100% (2)
(Ebook) SPSS Statistics: A Practical Guide 5e by Kellie Bennett, Dr Brody Heritage, Dr Peter Allen ISBN 9780170460163, 0170460169 - The ebook is available for quick download, easy access to content
73 pages
UNIFIED MATH 9 FIRST PERIODIC TEST With Answer Key
98% (41)
UNIFIED MATH 9 FIRST PERIODIC TEST With Answer Key
5 pages
Download full Computational Electromagnetics with Matlab, Fourth Edition Matthew N O Sadiku ebook all chapters
100% (2)
Download full Computational Electromagnetics with Matlab, Fourth Edition Matthew N O Sadiku ebook all chapters
62 pages
The Clonogenic Assay
100% (1)
The Clonogenic Assay
13 pages
Janet B. Matsen:Guide To Gibson Assembly
No ratings yet
Janet B. Matsen:Guide To Gibson Assembly
9 pages
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
Biophysical Ecology Lab Report
No ratings yet
Biophysical Ecology Lab Report
10 pages
Protein purification A Clear and Concise Reference
From Everand
Protein purification A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
Course Outline Act. Science
No ratings yet
Course Outline Act. Science
30 pages
Optimization Techniques Question Bank
100% (2)
Optimization Techniques Question Bank
14 pages
Statistical Methods For Bioinformatics Lecture 3
No ratings yet
Statistical Methods For Bioinformatics Lecture 3
33 pages
Statistical Methods For Bioinformatics Lecture 4
No ratings yet
Statistical Methods For Bioinformatics Lecture 4
29 pages
Statistical Methods For Bioinformatics Lecture 2
No ratings yet
Statistical Methods For Bioinformatics Lecture 2
47 pages
Presentation On Bioinformatics (With Animation) by Shahman Riaz (B190607006) Biochemistry & Molecular Biology
No ratings yet
Presentation On Bioinformatics (With Animation) by Shahman Riaz (B190607006) Biochemistry & Molecular Biology
8 pages
Bioinformatics and Machine Learning For Cancer Biology-5
No ratings yet
Bioinformatics and Machine Learning For Cancer Biology-5
198 pages
1 - Introduction To Computational Biology
No ratings yet
1 - Introduction To Computational Biology
22 pages
Methods For Studying Proteins
No ratings yet
Methods For Studying Proteins
96 pages
Bioinformatics Overview Gerstein PDF
No ratings yet
Bioinformatics Overview Gerstein PDF
30 pages
Linear Discriminant Analysis (Lda)
No ratings yet
Linear Discriminant Analysis (Lda)
11 pages
2022_Handbook_Statistical_Bioinformatics
No ratings yet
2022_Handbook_Statistical_Bioinformatics
406 pages
Development of A QPCR Assay For Quantification of Saccharibacteria
No ratings yet
Development of A QPCR Assay For Quantification of Saccharibacteria
15 pages
Bioinformatics KSOU
No ratings yet
Bioinformatics KSOU
260 pages
MSC Bioinformatics Syllabus
No ratings yet
MSC Bioinformatics Syllabus
42 pages
Biomedical Engineering Technology
No ratings yet
Biomedical Engineering Technology
36 pages
Principles of Synthetic Biology
No ratings yet
Principles of Synthetic Biology
21 pages
Download Synthetic Gene Networks Methods and Protocols 1st Edition Mario Andrea Marchisio (Auth.) ebook All Chapters PDF
100% (6)
Download Synthetic Gene Networks Methods and Protocols 1st Edition Mario Andrea Marchisio (Auth.) ebook All Chapters PDF
71 pages
Complete Download Engineering Genetic Circuits Chapman Hall CRC Mathematical Computational Biology 1st Edition Chris J. Myers PDF All Chapters
100% (15)
Complete Download Engineering Genetic Circuits Chapman Hall CRC Mathematical Computational Biology 1st Edition Chris J. Myers PDF All Chapters
60 pages
Lab Report 2 Bioinformatics
No ratings yet
Lab Report 2 Bioinformatics
17 pages
Survival Analysis and Interpretation Of.32
No ratings yet
Survival Analysis and Interpretation Of.32
7 pages
Clonogenic Assay of Cells in Vitro - Nature Protocol 2006
100% (1)
Clonogenic Assay of Cells in Vitro - Nature Protocol 2006
5 pages
Bioinformatics - Group21 - Report - Application of Bioinformatics in Agriculture
No ratings yet
Bioinformatics - Group21 - Report - Application of Bioinformatics in Agriculture
11 pages
DNA Barcoding and Metabarcoding of Standardized Samples Reveal Patterns of Marine Benthic Diversity
No ratings yet
DNA Barcoding and Metabarcoding of Standardized Samples Reveal Patterns of Marine Benthic Diversity
17 pages
1 What Is Bioinformatics
No ratings yet
1 What Is Bioinformatics
34 pages
Lecture 01 An Introduction To Proteins Enzymes
No ratings yet
Lecture 01 An Introduction To Proteins Enzymes
28 pages
Bioinformatics in Pharmacy
No ratings yet
Bioinformatics in Pharmacy
14 pages
Intro To Engineering Biology
No ratings yet
Intro To Engineering Biology
89 pages
Bioinformatics For Health Care: By-Daniyal Jadhav PRN No - 19010143002
No ratings yet
Bioinformatics For Health Care: By-Daniyal Jadhav PRN No - 19010143002
24 pages
202 07 Bioinformatics
No ratings yet
202 07 Bioinformatics
14 pages
Introduction To Bioinformatics Lab: 10B17BT571 Core Course Credits: 1 L0T0P2
No ratings yet
Introduction To Bioinformatics Lab: 10B17BT571 Core Course Credits: 1 L0T0P2
3 pages
Western Blot Analysis
100% (1)
Western Blot Analysis
5 pages
Guide Sheet For Tics Lab 1 - 4
No ratings yet
Guide Sheet For Tics Lab 1 - 4
17 pages
Bioinformatics: Submitted by
No ratings yet
Bioinformatics: Submitted by
19 pages
Machine Learning in Genomics Medicine
No ratings yet
Machine Learning in Genomics Medicine
22 pages
Exer 5 - BIOINFORMATICS
No ratings yet
Exer 5 - BIOINFORMATICS
21 pages
A First Course in Systems Biology 2nd Edition Eberhard Voit (Author) Ebook All Chapters PDF
100% (3)
A First Course in Systems Biology 2nd Edition Eberhard Voit (Author) Ebook All Chapters PDF
62 pages
Bioinformatics History of Bioinformatics
No ratings yet
Bioinformatics History of Bioinformatics
10 pages
Bioinformatics: Arushi Dinesh Kasi Shruthi
No ratings yet
Bioinformatics: Arushi Dinesh Kasi Shruthi
28 pages
Instruction Manual, Iscript Select cDNA Synthesis Kit, Rev B
No ratings yet
Instruction Manual, Iscript Select cDNA Synthesis Kit, Rev B
2 pages
Primer Design For PCR Assignment
100% (1)
Primer Design For PCR Assignment
5 pages
SPOC Proteomics - Hello Hello
No ratings yet
SPOC Proteomics - Hello Hello
24 pages
Sequencing Depth and Coverage: Key Considerations in Genomic Analyses
No ratings yet
Sequencing Depth and Coverage: Key Considerations in Genomic Analyses
12 pages
Bioinformatics: Applications: ZOO 4903 Fall 2006, MW 10:30-11:45 Sutton Hall, Room 312 Jonathan Wren
No ratings yet
Bioinformatics: Applications: ZOO 4903 Fall 2006, MW 10:30-11:45 Sutton Hall, Room 312 Jonathan Wren
75 pages
APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
No ratings yet
APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
105 pages
Methods in Molecular Biology Systems Medicine Volume 1386 - Modeling and Simulation Tools - From Systems Biology To Systems Me
No ratings yet
Methods in Molecular Biology Systems Medicine Volume 1386 - Modeling and Simulation Tools - From Systems Biology To Systems Me
23 pages
R GWAS Packages
No ratings yet
R GWAS Packages
18 pages
NyBerMan Cheminformatics Workshop Sept 2023
No ratings yet
NyBerMan Cheminformatics Workshop Sept 2023
7 pages
Whole Exome Sequencing (WES) in Clinical Diagnostics - Challenges and Opportunities
No ratings yet
Whole Exome Sequencing (WES) in Clinical Diagnostics - Challenges and Opportunities
15 pages
Integration Vectors For Gram Possitive
100% (1)
Integration Vectors For Gram Possitive
58 pages
Next Generation
No ratings yet
Next Generation
5 pages
Lecture12 Functional Pathway Analysis
No ratings yet
Lecture12 Functional Pathway Analysis
13 pages
Computational Molecular Biology - An Introduction Volume in Wiley Series in Mathematical and Computational Biology - Wiley (PDFDrive)
No ratings yet
Computational Molecular Biology - An Introduction Volume in Wiley Series in Mathematical and Computational Biology - Wiley (PDFDrive)
308 pages
Advances in Biosensing Technology for Medical Diagnosis
From Everand
Advances in Biosensing Technology for Medical Diagnosis
PublishDrive
5/5 (1)
Theory and Practice of Chromatographic Techniques
From Everand
Theory and Practice of Chromatographic Techniques
Sanjay B. Bari
No ratings yet
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
From Everand
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
Granino A. Korn
No ratings yet
CE9053 - Assignment 1
No ratings yet
CE9053 - Assignment 1
4 pages
Grade X - Maths - Polynomials - Worksheet
No ratings yet
Grade X - Maths - Polynomials - Worksheet
2 pages
Sec 3 Additional Mathematics
No ratings yet
Sec 3 Additional Mathematics
8 pages
CFD
No ratings yet
CFD
18 pages
Mastering SciPy - Sample Chapter
No ratings yet
Mastering SciPy - Sample Chapter
45 pages
7th Maths MCQ Adiitional
No ratings yet
7th Maths MCQ Adiitional
2 pages
Numerical and Experimental Analysis of Transient Wave Propagation Through Perforated Plates For Application To The Simulation of Loca in PWR
No ratings yet
Numerical and Experimental Analysis of Transient Wave Propagation Through Perforated Plates For Application To The Simulation of Loca in PWR
11 pages
Isesyll
No ratings yet
Isesyll
137 pages
NUMERICAL TECHQNIQUE2
No ratings yet
NUMERICAL TECHQNIQUE2
2 pages
Or Final Slides
No ratings yet
Or Final Slides
174 pages
1 To 10 Skylab
No ratings yet
1 To 10 Skylab
9 pages
Numerical Modelling Workshop
No ratings yet
Numerical Modelling Workshop
17 pages
For The Slab Waveguide Shown
No ratings yet
For The Slab Waveguide Shown
20 pages
EEE484 Note Book
No ratings yet
EEE484 Note Book
104 pages
File Rounding Worksheet 4 4th Grade 1615900984 PDF
No ratings yet
File Rounding Worksheet 4 4th Grade 1615900984 PDF
5 pages
Systems of Linear Equations and Matrices
No ratings yet
Systems of Linear Equations and Matrices
49 pages
Ode Green
100% (1)
Ode Green
1,125 pages
Assignment For Chapter Two: Operation Research
No ratings yet
Assignment For Chapter Two: Operation Research
6 pages
RDFT
No ratings yet
RDFT
20 pages
Lecture Notes 4-Interpolation
No ratings yet
Lecture Notes 4-Interpolation
35 pages
Galerkin's Method: APL705 Finite Element Method
No ratings yet
Galerkin's Method: APL705 Finite Element Method
3 pages
Problemset 2
No ratings yet
Problemset 2
3 pages
Sem 5
No ratings yet
Sem 5
116 pages
Formulation of Coupled Problems of Thermoelasticity by Finite Elkments
No ratings yet
Formulation of Coupled Problems of Thermoelasticity by Finite Elkments
12 pages
BCA Syllabus BRABU
No ratings yet
BCA Syllabus BRABU
32 pages

Statistical Methods For Bioinformatics Lecture 5

Uploaded by

Statistical Methods For Bioinformatics Lecture 5

Uploaded by

Statistical Methods for Bioinformatics

II-4: Beyond Linearity

Statistical Methods for Bioinformatics

Non-linearity in the (Generalized) Linear Model

Statistical Methods for Bioinformatics

When a predictor has a non-linear relationship with the

A simple transformation may suffice e.g. log or root

Statistical Methods for Bioinformatics

Statistical Methods for Bioinformatics

In step functions you define a fit per interval. For a constant

y = β0 + β1 C1 (xi ) + β2 C2 (xi ) + β3 C3 (xi ) + . . . + βn Cn (xi ) + εi

with C (X ) indicator variables that become 1 or 0 depending on

Statistical Methods for Bioinformatics

Piecewise polynomial regression: fitting low level polynomial over

Statistical Methods for Bioinformatics

If we do not insist on continuity we get awkward results

Statistical Methods for Bioinformatics

Statistical Methods for Bioinformatics

Polynomial and piecewise constant-regression functions are

y = β0 + β1 b1 (xi ) + β2 b2 (xi ) + β3 b3 (xi ) + . . . + βn bn (xi ) + εi

with b(.) some defined basis function

Statistical Methods for Bioinformatics

A cubic spline with k knots can be modelled as:

One representation starts with a normal cubic polynomial: x,

Limited increase in use of degrees of freedom: a cubic spline

Statistical Methods for Bioinformatics

image by Trevor Hastie, Robert Tibshirani

Statistical Methods for Bioinformatics

In principle you could go to higher degree splines, e.g. with

Statistical Methods for Bioinformatics

’Unfortunately, splines can have high variance at the outer range of

Statistical Methods for Bioinformatics

We know the behavior of polynomials fit to data tends to be

As can be seen below, the variability in the fit is reduced in

Statistical Methods for Bioinformatics

A natural cubic spline model with K knots is represented by K

y = β0 + β1 X + β2 bk+2 (X ) + β3 bk+3 (X ) + . . . + βK bK (xi ) + εi

with bk+2 (X ) = dk (X ) − dk−1 (X ) with (X − ξk )3+ the truncated

Statistical Methods for Bioinformatics

1 select the order of the spline

Statistical Methods for Bioinformatics

Statistical Methods for Bioinformatics

We try to fit a function g that fits the data as good as

if λ = 0 you’ll get a perfect match to the training data, if

Statistical Methods for Bioinformatics

Statistical Methods for Bioinformatics

“In other words, the function g(x) that minimizes (7.11) is a

Statistical Methods for Bioinformatics

The smoothing parameter constrains the degrees of freedom

Similar efficient LOOCV solutions exist for the regression

Statistical Methods for Bioinformatics

Normal linear regression assumes e.g. normal distribution of

Statistical Methods for Bioinformatics

image from Wikipedia (http://en.wikipedia.org/wiki/Kernel regression)

Statistical Methods for Bioinformatics

Statistical Methods for Bioinformatics

Which works better?

Statistical Methods for Bioinformatics

Standard errors can be estimated for every point, however

Statistical Methods for Bioinformatics

Statistical Methods for Bioinformatics

One important use is to remove Systematic Experimental Bias from

Statistical Methods for Bioinformatics

M is log fold (vertical axis), A is abundance (on the horizontal axis)

Statistical Methods for Bioinformatics

Statistical Methods for Bioinformatics

Statistical Methods for Bioinformatics

Statistical Methods for Bioinformatics

Can we fit non-linearly when p is large (and n<p)?

Statistical Methods for Bioinformatics

For natural/regression splines the non-linear function can be

A normal lm OLS fit is defined as:

Repeat till changes in f minimal.

Statistical Methods for Bioinformatics

Why the additive format?

Statistical Methods for Bioinformatics