Nonlinear Curve Fitting
Nonlinear Curve Fitting
Earl F. Glynn
Scientific Programmer Bioinformatics
11 Oct 2006
Analysis of Results
Goodness of Fit: R2 Residuals
Summary
2
Mathematical Models
Want a mathematical model to describe observations based on the independent variable(s) under experimental control Need a good understanding of underlying biology, physics, chemistry of the problem to choose the right model Use Curve Fitting to connect observed data to a mathematical model
3
20
40 Length [cm]
60
80
f ( x) =
1 2
2
Probability Density
0.03
0.04
0.05
( x )2 2 2
0.02
Coefficient of Variation
cv =
0 20 40 Length [cm] 60 80
Data are fitted by five normal distributions with constant coefficient of variation 5
0.00
0.01
http://papakilo.icmb.utexas.edu/cshl-2005/lectures/ CSHL_Lecture05_khodursky.ppt#23
DiMarco and Bombi, Mathematical functions for the representation of chromatographic peaks, Journal of Chromatography A, 931(2001), 1-30.
Bacia, Kim & Schwille, Fluorescence cross-correlation spectroscopy in living cells, Nature Methods, Vol 3, No 2, p. 86, Feb. 2006.
From discussion by Juntao Gao at Rong Lis Lab Meeting, 25 Sept 2006
10
From discussion by Juntao Gao at Rong Lis Lab Meeting, 25 Sept 2006
11
12
We want the best straight line, ( , y ), through these points, where y is thefitted value at point :
i i
y = a +b x
i
i
13
Linear Fit
y = a +b x
i
0 1 2 x 3 4
( a , b ) = y i (a
i =1
+ bx )]
i
200
( a , b ) = y i (a
i =1
+ bx )]
i
150
100
Linear Fit
3 2 1 -1 0 0 1
b
50
y = 1.2 + 0.9 x
i
i 15
y = 1.2 + 0.9x
y 0 0 1 2
2 x
N 2 = 2 yi a b xi = 0 a i =1 N 2 = 2 xi ( yi a b xi ) = 0 b i =1
17
y = a +b x
i
18
y = a +b x
i
i x y x xy 1 0 1 0 0 2 1 3 1 3 3 2 2 4 4 4 3 4 9 12 5 4 5 16 20 Sum 10 15 30 39
19
4 0.1
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.2000 0.6164 1.947 0.1468 x 0.9000 0.2517 3.576 0.0374 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7958 on 3 degrees of freedom Multiple R-Squared: 0.81, Adjusted R-squared: 0.7467 F-statistic: 12.79 on 1 and 3 DF, p-value: 0.03739 20
Assume heteroscedasticity
y ( a + b x ) ( a, b) = i
2 N i i i =1
Often weights i are assumed to be 1. 21 Experimental measurement errors can be used if known.
-1
1 x
f ( x) =
1 2
2
( x )2 2 2 22
0.4
f ( x) =
1 2
2
( x )2 2 2
( , ) = yi f ( xi )
i =1
23
Assume homoscedasticity
-4
-2
0 x
1st Derivative
0.2 -0.2 0.0 y'
-4
-2
0 x
2nd Derivative
0.2 y'' -0.4 -0.2 0.0
-4
-2
0 x
24
U:/efg/lab/R/MixturesOfDistributions/SingleGaussian.R
20
40 x
60
80
1st Derivative
y'
-0.005
0.005
20
40 x
60
80
2nd Derivative
0.002 y'' -0.006 -0.002
20
40 x
60
80
25
Given data points (xi,yi). Given desired model to fit (not always known): y = y(x; a) where there are M unknown parameters: ak, k = 1,2,..., M. The error function (merit function) is
y y (x ; a ) (a) = i
2 N i i i =1
26
27
2 2 2 N 1 y ( x i ; a ) y (x i ; a ) y (x i ; a ) = 2 2 [ yi y (xi ; a )] a k al ak al al a k i =1 i
kl = Hessian or curvature matrix (after dropping 2) From Press, et al, Numerical Recipes in C (2nd Ed), 1992, p. 682 28
l =1
kl
a l = k
Full Newton-type methods keep dropped term in second derivative considered more robust but more complicated Simplex is an alternative algorithm 29
Fitting procedure is iterative Usually need good initial guess, based on understanding of selected model No guarantee of convergence No guarantee of optimal answer Solution requires derivatives: numeric or analytic can be used by some packages
30
Mathematica
(1 limited license)
MatLab:
(1 limited license) (10 limited licenses)
OriginPro: PeakFit:
(1 limited license)
R: nls function
many statistics
symbolic derivatives (if desired) flawed implementation: exact toy problems fail unless noise added
31
http://www.itl.nist.gov/div898/strd/general/dataarchive.html
32
Analysis of Results
Goodness of Fit: R2 Residuals
33
Goodness of Fit: R2
Coefficient of Determination Percentage of Variance Explained
= 1
Residual Sum of Squares (RSS) Total Sum of Squares (SS) [Corrected for Mean]
( yi yi ) R =1 ( yi y )
2
0 R2 1
Adjusted R2 compensates for higher R2 as terms added. A good value of R2 depends on the application. In biological and social sciences with weakly correlated variables, and considerable noise, R2 ~ 0.6 might be considered good. In physical sciences in controlled experiments, R2 ~ 0.6 might be considered low.
Faraway, Linear Models with R, 2005, p.16-18
34
Residuals
Residuals are estimates of the true and unobservable errors. Residuals are not independent (they sum to 0).
Curve fitting made easy, Marko Ledvij, The Industrial Physicist, April/May 2003. http://www.aip.org/tip/INPHFA/vol-9/iss-2/p24.html 35
Analysis of Residuals
Are residuals random? Is mathematical model appropriate? Is mathematical model sufficient to characterize the experimental data? Subtle behavior in residuals may suggest significant overlooked property
Good Reference: Analysis of Residuals: Criteria for Determining Goodness-of-Fit, Straume and Johnson, Methods in Enzymology, Vol. 210, 87-105, 1992.
36
Analysis of Residuals
Synthetic FRAP Data: Fit with 1 term when 2 terms are better
Analysis of Residuals
Lomb-Scargle periodogram can indicate periodicity in the residuals
Flat line with all bad p-values would indicate random residuals 38
Analysis of Residuals
Synthetic FRAP Data: Fit with 2 terms
39
Analysis of Residuals
FCS Data and Heteroscedasticity
(a) =
i =1
y i y (x i ; a )
i
Scaling Factor
Heteroscedasticity in Residuals
Scaled Residuals
Use F Test to test for unequal variances FCS Residual Plots Courtesy of Joseph Huff, Advanced Instrumentation & Physics 40
Analysis of Residuals
Heteroscedasticity and Studentized Residuals
Studentized residual is a residual divided by an estimate of its standard deviation The leverage hii is the ith diagonal entry of a hat matrix.
Studentize d Residual =
1 hii
Externally Studentized Residuals follow Students t-distribution. Can be used to statistically reject outliers
See http://en.wikipedia.org/wiki/Studentized_residual 41
Summary
A mathematical model may or may not be appropriate for any given dataset. Linear curve fitting is deterministic. Nonlinear curve fitting is non-deterministic, involves searching a huge parameter space, and may not converge. Nonlinear curve fitting is powerful (when the technique works). The R2 and adjusted R2 statistics provide easy to understand dimensionless values to assess goodness of fit. Always study residuals to see if there may be unexplained patterns and missing terms in a model. Beware of heteroscedasticity in your data. Make sure analysis doesnt assume homoscedasticity if your data are not. Use F Test to compare the fits of two equations.
42
Acknowledgements
Advanced Instrumentation & Physics Joseph Huff Winfried Wiegraebe
43