Predictive Model - Linear and Logistic Models PDF
Predictive Model - Linear and Logistic Models PDF
Pearson correlation
Spearman rank correlation
Kendall‟s tau
Association between
categorical variables
y x x
n
i i
S xy
r i 1
1 r, 1
12
n
2
n S xx S yy
xi x yi y
2
i 1 i 1
6
4
3
2 5
Since |Z|>1.645, we reject the null hypothesis and conclude that the correlation
between Price and Luas_Lot is significantly different from zero
Day 2 - Data Science 17
Spearman Rank Correlation
It is a nonparametric method.
An alternative to the Pearson correlation when
the assumption of normality or linearity is
violated.
Eg. data presented in at least ordinal scale (e.g
Likert scale) usually do not demonstrate normal
distribution.
6 di2 n
d R X i R Yi
2
rS 1
2
i
n n2 1 i 1
Oi Ei
2
# cells
1 2
Variable 2
c
Total χ2 i 1 Ei
8.044
1 O11 O12 O1c n1.
2 O21 O22 O2c n2.
Variable 1
Excel format
Standard format
Y f ( X ) a bX Y a bX errors
Y a bX cZ Y a b1 X1 b2 X 2 errors
25 8000
20 6000
Miles
15
Y
4000
10
5 2000
0 0
0 2 4 6 8 10 12 0 1000 2000 3000 4000 5000 6000
X Ringgit
Miles
880
4000 860
yi 0 1 xi i 2000
0
840
820
800
0 1000 2000 3000 4000 5000 6000 Aug-04 Feb-05 Sep-05 Mar-06 Oct-06 Apr-07
Ringgit Date
2000
1000
500
0
500
400 50
300
yi 0 1 x 2 x 2 k x k i
6
4
2
0
0 2 4 6 8
yi 0 1 x1 2 x2 3 x3 4 x1 x2 5 x1 x3 6 x2 x3
Federal Funds Rate
110
xi
yi i
xi
1
yi
1 e xi i
4500
Monthly Personal Income
4000
3500
3000
2500
2000
1500
1000
500
0
0 10 20 30 40 50
Age (years)
6000
?
Monthly Personal Income
5000 RM4565.87
4000 ?
3000 ?
2000
1000
0
0 10 20 30 40 50 60
Age (years)
Y α βX
6000
Monthly Personal Income
5000
4000
3000
ε3
2000 ε1
β
1000 α
0
0 10 20 30 40 50 60
Age (years)
The gap between the points and the line are errors of the model.
yi α βxi εi , i 1, 2,, n
Where
yi = dependent/response variable (monthly income)
xi = independent variable/predictor/regressor (age) is a
known constant (fix)
α = line intercept at y-axis, it is an unknown constant
β = slope of the line, it is an unknown constant
εi = the random error
n = the number of observations/subjects/samples
Day 2 - Data Science
35
Assumptions:
1) The error term εi is normally distributed with mean
E(εi)=0 and constant variance Var(εi)=σ2 ;
i ~ NID0, 2
E y | x α βxi ; Var y | x σ 2
E(y|x)=α+βx
5000 E(y) is α+β(23) and
the standard
4000 deviation is σ
3000
Distribution of y at
2000 x=45. The mean
Δy E(y) is α+β(45) and
1000
the standard
Δx
0 deviation is σ
0 10 20 30 40 50 60
Age (years)
Least square
estimation
UECM2263
Applied Statistical
Models
Day 2 - Data Science
38
Find a line that ‘fit’ the data best
= Estimate the values of α and β that minimize the errors
6000
Monthly Personal Income
5000
4000
3000
2000
1000
0
0 10 20 30 40 50 60
Age (years)
yi α βxi εi , i 1, 2,, n
Re-write the model as
εi yi α βxi , i 1, 2,, n
6000
Monthly Personal Income
5000
4000
+ε
3000
2000 –ε
1000
0
0 10 20 30 40 50 60
Age (years)
yi α βxi εi , i 1, 2,, n
Re-write the model as
εi yi α βxi , i 1, 2,, n
ε yi α βxi , i 1, 2,, n
2 2
i
Consider the squared error for n pairs of sample data, we have
the error sum of squares:
n n
ε y α βx
2 2
i i i
i 1 i 1
Day 2 - Data Science
41
Label the error sum of squares as S(α,β) and called it as LS criterion:
n n
S α, β ε yi α βxi
2 2
i
i 1 i 1
S
1 0
n
2 yi αˆ βx
ˆ
(1)
α αˆ , βˆ i 1
i
S
x 0
n
2 yi αˆ βx
ˆ (2)
β αˆ , βˆ i 1
i i
x y nxy
i i
βˆ i 1
n ˆ
α̂ y βx
i
x 2
i 1
nx 2
ˆ
yˆi αˆ βxi
6000
Monthly Personal Income
5000
4000
3000
2000
1000
0
0 10 20 30 40 50 60
Age (years)
x 32.6, y 2834
n
x y nxy
i i
487940 5(32.6)(2834)
βˆ i 1
99.53
n
5575 5(32.6) 2
i
x 2
i 1
nx 2
ˆ 410.77 99.53x
yˆi αˆ βxi i
6000
Monthly Personal Income
5000
4000
3000
2000
1000
0
0 10 20 30 40 50 60
Age (years)
5000 RM4565.8725
4000
3000
2000
1000
0
0 10 20 30 40 50 60
Age (years)
Extrapolation!
Day 2 - Data Science
48
At age 30 (x=30), your predicted monthly income is
yˆi 410.7725 99.5329 30 2575.21
5000
4000
RM2575.21
3000
2000
1000
0
0 10 20 30 40 50 60
Age (years)
Interpolation!
Day 2 - Data Science
49
Simple linear
regression
Interval
estimation
i 1 i 1
2 SS Re s
MS Re s
n2
This unbiased estimator of σ2 is called the Residual
Mean Square. And its square root is called the
standard error of regression.
2
n n
SSRe s ei2 yi y i 66063.02
i 1 i 1
2 SSRe s 66063.02
MSRe s 22021.01
n2 52
Day 2 - Data Science
52
Interval Estimation in Simple Linear Regression:
If the errors are NID, then the 100(1 – α)% confidence
interval of β1, β0 and σ2 are
1 t ,n2 se 1 1 1 t ,n2 se 1
2 2
0 t ,n2 se 0 0 0 t ,n2 se 0
2 2
n 2MSRe s 2
n 2MSRe s
2 , n 2
2
12 2,n2
Remarks:
1) the term linear is used because it is a linear function of the
unknown parameters β0, β1,…,βk.
2) It is often used as empirical model or approximating function
(the true functional relationship is unknown).
1) Polynomial models: y 0 1 x 2 x 2 k x k
y Xβ ε
where y y1 y2 yn ' 1 x11 x12 x1k
x2 k
β 0 1 k '
1 x21 x22
X
ε 1 2 k '
1 xn1 xn 2 xnk
y and ε are vectors of size nx1, and β and X are vectors of size
px1, where p = n+1
Day 2 - Data Science
57
Estimation of the Model Parameters
S
2Xy 2XX β 0
β β
Thus, the least-squares estimator of β is
β XX Xy
1
Hypothesis
testing
n
2
Since 1 ci yi ~ NID 1 ,
i 1 S xx
To test the hypothesis that the slope equals a
constant, we have
H 0 : 1 10
H1 : 1 10
H 0 : 0 00
H1 : 0 00
n S xx
Checking the
appropriateness
of the model
Corrected Total n 1 SS T
SS E ee yi yˆ i
2
= y Xβ̂ y Xβ̂
= y y β̂Xy
Test statistics:
MS R
F ~ Fk , nk 1 ; Reject H 0 when F0 F ; k , nk 1
MS E
SS R SS
R2 1 Re s
SST SST
where SSRES is the residual or error sum of squares, SSR is
the regression or model sum of squares and SST is a measure
of the variability in y without considering the effect of the
regressor variables x.
2 2
n
n
SS Re s yi y i , SS R y i y , SST SS R SS Re s
i 1 i 1
SS R SS Re s SS Re s n p
R
2
1
2
RAdj 1
SST SST SST n 1
x
n
2
xij x j ij xj
zij ; i 1,2,, n; j 1,2,, k s 2j i 1
sj n 1
y
n
yi y y
2
y
*
i ; i 1,2,, n i
sy s y2 i 1
n 1
n
xij x j S jj xij x j
2
yi y
yi0 ; i 1,2,, n
SST
The regressors are the columns of the X matrix. So, exact linear
dependence would result in a singular X’X.
• Distribution analysis
Explore • Outliers treatment
response • multicollinearity
variable
• Analyse results
• Compare models
Performanc
e analysis • Performance assessment (ANOVA, R squares, etc.)
5
3 4
The fitted
model is
appropriate.
5
1000000
PRICE
0 800000
-500000 0 500000 1000000 1500000 2000000
-5 600000
-10 400000
-15 200000
-20 0
Predicted Price
0 20 40 60 80 100 120
Sample Percentile
Error variance is non-linear (curve) and
not constant (outward open funnel) House Price has heavy-tail distribution.
Day 2 - Data Science
82
Day 2 - Data Science
83
Day 2 - Data Science 84
> library(readxl)
> Property=read_excel(“Property.xlsx”,sheet=2)
> LRmodel=lm(formula =
Property$Price~Property$Area+Property$Bath+Property$Floor+Property$Bedroo
m)
> summary(LRmodel)
Call: lm(formula = Property$Price ~ Property$Area + Property$Bath + Property$Floor +
Property$Bedroom)
Residuals:
Min 1Q Median 3Q Max
-12.700 -1.616 0.984 2.510 11.759
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.7633 9.2074 2.038 0.06889 .
Property$Area 6.2698 0.7252 8.645 5.93e-06 ***
Property$Bath 30.2705 6.8487 4.420 0.00129 **
Property$Floor -16.2033 6.2121 -2.608 0.02611 *
Property$Bedroom -2.6730 4.4939 -0.595 0.56519
---
Signif. codes:
0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
Residual standard error: 6.849 on 10 degrees of freedom
Multiple R-squared: 0.9714, Adjusted R-squared: 0.9599
F-statistic: 84.8 on 4 and 10 DF, p-value: 1.128e-07
Interaction terms:
> LRmodel2=lm(formula =
Property$Price~Property$Area+Property$Bath+Property$Floor+Property$Bedroo
m)+I(Property$Area*Property$Floor))
> summary(LRmodel2)
Classification of
Binary outputs
Buy?
(No) 0
Income Income
If , predict “y = 1”
If , predict “y = 0”
Logistic Regression Model
Want
0.5
Sigmoid function 0
Logistic function
Interpretation of Hypothesis Output
Example: If
Suppose predict “ “ if
predict “ “ if
Decision Boundary
x2
3
1 2 3 x1
Predict “ “ if
Non-linear decision boundaries
x2
-1 1 x1
-1
Predict “ “ if
Method 1:
=$A$15+A2*$B$15+B2*$C$15
Product of Probabilities is = H2xH3x…xH11 = product(H2:H11)
2
1 3
4
7
Method 2
Day 2 - Data Science 101
An alternative way to look at this is that the difference
between LL1 (log-likelihood for the current model with Age
and Income) and LL0 (log-likelihood for initial constant model
where only the intercept is used) are not significant.
FALSE TRUE
0 425 78
1 219 277
ROC Curve
> library(ROCR)
> ROCRpred=prediction(forecast,CustomerChurn$Churn[1:999])
> ROCRperf=performance(ROCRpred,'tpr','fpr')
> plot(ROCRperf,colorize=TRUE,text.adj=c(-0.2,1.7))