100% found this document useful (1 vote)
33 views

Predictive Model - Linear and Logistic Models PDF

Uploaded by

FucKerWengie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
33 views

Predictive Model - Linear and Logistic Models PDF

Uploaded by

FucKerWengie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Big Data Analytics:

Data Science (Day 2)


14 – 16 September 2016

Instructor: Dr. Chang Yun Fah


Agenda
Day 1 Day 2 Day 3
Introduction to Investigating Classification
Analytics Relationship
between variables Performance
Data Exploration Evaluation and
Validation
Data Acquisition
Cluster
Predictive Modelling
Analysis
Probability Concepts using Regression
and Statistical
Inference

2 Day1 - Data Science


Relationship
and
Predictive
Modeling

Day 2 - Data Science


3
Investigating Relationship between
Variables

 Pearson correlation
 Spearman rank correlation
 Kendall‟s tau
 Association between
categorical variables

Day 2 - Data Science 4


 Correlation is a measure of „numerical‟
relationship between two variables.
 A strong positive (negative) relationship
means that an increase in the values of
one variable will be followed by an
increase (decrease) in the value of the
other variable.
 IMPORTANT: correlation does not implies
causal relationship (cause and effect)

Day 2 - Data Science


5
Covariance and Correlation Coefficient

 Covariance provides a measure of the strength of the


correlation between two or more sets of random
variates. The sample covariance for two random
variates X and Y, each with sample size N is:
(xi - x )(yi - y)
N
cov(X,Y ) = å
i=1 N -1

Day 2 - Data Science


6
Pearson Correlation (Product Moment)
 Two variables measured in
interval, numeric, ratio, percent or ordinal
scales.
 Assumption: (i) the data are randomly
sampled and follow normal distribution, (ii)
the two variables are linearly related.
 It determines the direction of the relationship
(+ve, -ve) and the strength of the relationship
(strong towards 1 or -1, weak towards 0).

Day 2 - Data Science


7
The Pearson correlation between two variables, x and
y, is a measure of the degree of linear association
between the two variables.

  y    2  x    2  y  1  x   2  
f  x, y  
1 1
exp        2     
 
1 2

2 1 2 1  2  2 1  2   1    2    1   2  


 

is the bivariate normal distribution where μ1 and σ1 are the


mean and standard deviation of y, and μ2 and σ2 are the
mean and standard deviation of x.
E y  1 x   2   12
 
 1 2  1 2
is the population correlation coefficient.

Day 2 - Data Science


8
The estimator of ρ is the sample correlation coefficient

 y x  x 
n

i i
S xy
r i 1

1  r,   1
   
12
n
2
n S xx S yy
 xi  x  yi  y 
2

 i 1 i 1 

Day 2 - Data Science


9
Examples of Approximate r Values

Day 2 - Data Science


10
Types of Various r Values
 The value of r is such that -1 < r < +1. The + and –
signs are used for positive linear correlations and
negative linear correlations, respectively.
 Positive correlation: If x and y have a strong positive
linear correlation, r is close to +1. An r value of
exactly +1 indicates a perfect positive fit. Positive
values indicate a relationship between x and y variables
such that as values for x increases values for y also
increase.
 Negative correlation: If x and y have a strong negative
linear correlation, r is close to -1. An r value of
exactly -1 indicates a perfect negative fit. Negative
values indicate a relationship between x and y such that
as values for x increase, values for y decrease.
Day 2 - Data Science
11
Types of Various r Values
 No correlation: If there is no linear correlation or a
weak linear correlation, r is close to 0. A value near
zero means that there is a random, nonlinear
relationship between the two variables

 Note that r is a dimensionless quantity

 A perfect correlation of ± 1 occurs only when the data


points all lie exactly on a straight line. If r = +1, the
slope of this line is positive. If r = -1, the slope of this
line is negative.

 A correlation greater than 0.8 is generally described as


strong, whereas a correlation less than 0.5 is generally
Day 2 - Data Science
12
Evans (1996) proposed

Day 2 - Data Science


13
Testing the significance of the correlation
coefficient
It is useful to test whether the two random variables x and y are
correlated:
H0 :   0
H1 :   0
The appropriate test statistic is t  r n  2 ~ t
0  2, n  2
1 r 2

General case (large samples):


H 0 :   0
H1 :    0
n  3  1 r 1  0 
Z  ln  ln  ~ Z
2  1 r 1  0  2

Day 2 - Data Science


14
Example: Predicting House
Selling Price
 Open text file “House_Price.xlsx”
 Perform
 Data cleansing
 Correlation study

Day 2 - Data Science


15
After cleansing

6
4

3
2 5

Make sure the data are „numeric‟, and


the factors are arranged side-by-side.
Day 2 - Data Science
16
H0 :   0 n  3  1 r 1  0 
Z  ln  ln  ~ Z
H1 :   0 2  1 r 1  0  2

Since |Z|>1.645, we reject the null hypothesis and conclude that the correlation
between Price and Luas_Lot is significantly different from zero
Day 2 - Data Science 17
Spearman Rank Correlation
 It is a nonparametric method.
 An alternative to the Pearson correlation when
the assumption of normality or linearity is
violated.
 Eg. data presented in at least ordinal scale (e.g
Likert scale) usually do not demonstrate normal
distribution.

6 di2 n

d    R  X i   R Yi 
2
rS  1 
2

 
i
n n2  1 i 1

R(Xi) is the rank for Xi.


Day 2 - Data Science
18
Day 2 - Data Science
19
Pearson‟s Correlation:
> library(readxl)
> Property=read_excel(“Property.xlsx”,sheet=2)
> cor(Property$Price,Property$Area,method=“pearson”)
* Method are “pearson” (default), “kendall” and “spearman”

Test for Pearson‟s Correlation:


>cor.test(Property$Price,Property$Area, method=“pearson”)

Pearson's product-moment correlation


data: Property$Price and Property$Area
t = 11.142, df = 13, p-value = 5.061e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8567004 0.9840714
sample estimates:
cor
0.9514249

Day 2 - Data Science 20


To compute correlation matrix on all variables:
> cor(Property)
Price Area Bath Floor Bedroom
Price 1.0000000 0.9514249 0.8335615 0.6048192 0.7459561
Area 0.9514249 1.0000000 0.7199549 0.6297639 0.7109191
Bath 0.8335615 0.7199549 1.0000000 0.7596752 0.6751223
Floor 0.6048192 0.6297639 0.7596752 1.0000000 0.3750000
Bedroom 0.7459561 0.7109191 0.6751223 0.3750000 1.0000000

To compute Matrix plot:


> pairs(Property)

Day 2 - Data Science 21


Contingency/Cross Table
• Investigating the association between two or more nominal/
categorical variables

 Oi  Ei 
2
# cells

1 2
Variable 2
c
Total χ2  i 1 Ei
 8.044
1 O11 O12 O1c n1.
2 O21 O22 O2c n2.
Variable 1

r Or1 Or2 Orc nr.


Total n.1 n.2 n.c N

Null hypothesis (H0): there is no association between


the two variables (independent)
Alternative hypothesis (H1): there is association
between the two variables (dependent)
Day 2 - Data Science
22
Example:
Null hypothesis (H0): there is no association between
Tenure and Number of Rooms (independent)
Alternative hypothesis (H1): there is association
between the two variables (dependent)

Take note that: (define the variable namas as TENURE, Bilik)


Tenure has 2 level: 0=least hold, 1=free hold
Number of Rooms has 4 levels: 2, 3, 4 and 5 rooms

Excel format

Standard format

Day 2 - Data Science


23
Day 2 - Data Science 24
Predictive Modeling using
Regression
 Simple Linear Regression
 Multivariate Linear
Regression
 Dummy variable
 Goodness of fit
 Model checking
 Model selection
 Nonlinear regression and
logistic regression
 Multicollinearity and
heteroskedasticity issues
Day 2 - Data Science
25
Functional relation: Regression Model:
 Perfect fit: all values fall • Not a perfect fit: values do
exactly on the straight not fall exactly on the line.
line. • Errors exist:

Y  f ( X )  a  bX Y  a  bX  errors
Y  a  bX  cZ Y  a  b1 X1  b2 X 2  errors

25 8000
20 6000

Miles
15
Y

4000
10
5 2000
0 0
0 2 4 6 8 10 12 0 1000 2000 3000 4000 5000 6000
X Ringgit

Day 2 - Data Science


26
Type of relationship:
1. Linear relationship (refer to the slope, not regressor)

No. Listed Companies


• Simple: one regressor
8000 940
920
6000 900

Miles
880
4000 860

yi  0  1 xi   i 2000
0
840
820
800
0 1000 2000 3000 4000 5000 6000 Aug-04 Feb-05 Sep-05 Mar-06 Oct-06 Apr-07
Ringgit Date

• Multiple: more than one regressor


yi  0  1 x1  2 x2    k xk   i 2500

2000

Size (feet squared)


1500

1000

500

0
500
400 50
300

• Polynomial: one or more regressors with higher


40
200 30
20
100 10
Price (RM1000) 0 0
Age of home (years)

degrees, e.g. quadratic, cubic etc.

Base Lending Rate


10
8

yi  0  1 x   2 x 2     k x k   i
6
4
2
0
0 2 4 6 8

yi   0  1 x1   2 x2   3 x3   4 x1 x2  5 x1 x3   6 x2 x3
Federal Funds Rate

110

2nd Board Index


  7 x1 x2 x3     i
100
90
80
70
60
Day 2 - Data Science 800 850 900 950 1000
Composite Index 27
2. Non-linear relationship in one or more variables:
exponential, logarithm, logistic etc. Eg:
xi
yi  e  i

xi
yi   i
xi  

1
yi 
1  e   xi i

Day 2 - Data Science


28
Some applications:
1. An economist wants to investigate the System
relationship between the petrol price
and the inflation rate. control
2. A sale manager is interested to predict Data
the total sale in next year based on the
number of staffs and square feet of description
space in the store
3. A policy maker wants to identify the Forecast
main factors (e.g speed limit, road
condition, weather) contribute to the
number of road accident. Data
4. A scientist wants to know at which level reduction
of sound pollution will affect human
health. Parameter
5. A computer scientist wants to compress estimation
an image for minimum storage.
Day 2 - Data Science
29
Simple linear
regression model

Day 2 - Data Science


30
Example:
You want to know if there is a relationship between the
monthly personal income and the age of a worker, and
then to forecast your monthly income when you are at
50 year-old. There are five workers in your study.

The two variables are:


1. Age of worker (years): Independent variable (X)
2. Monthly personal income (RM): Dependent
variable (Y)
3. Sample size (n): 5 workers

Day 2 - Data Science


31
Data collected from 5 working adults

Respondent Age Personal Income (Monthly)


A 34 2950
B 45 4000
C 29 2430
D 32 3000
E 23 1790

4500
Monthly Personal Income

4000
3500
3000
2500
2000
1500
1000
500
0
0 10 20 30 40 50
Age (years)

Day 2 - Data Science


32
Age increase => income increase

6000
?
Monthly Personal Income

5000 RM4565.87

4000 ?
3000 ?
2000

1000

0
0 10 20 30 40 50 60
Age (years)

What is your income when you reach 50 year-old?

Day 2 - Data Science


33
Mathematical equation for the straight line

Y  α  βX

6000
Monthly Personal Income

5000

4000

3000
ε3

2000 ε1
β
1000 α
0
0 10 20 30 40 50 60
Age (years)

The gap between the points and the line are errors of the model.

Day 2 - Data Science


34
The Simple Linear Regression model represents the straight
line with errors:

yi  α  βxi  εi , i  1, 2,, n
Where
yi = dependent/response variable (monthly income)
xi = independent variable/predictor/regressor (age) is a
known constant (fix)
α = line intercept at y-axis, it is an unknown constant
β = slope of the line, it is an unknown constant
εi = the random error
n = the number of observations/subjects/samples
Day 2 - Data Science
35
Assumptions:
1) The error term εi is normally distributed with mean
E(εi)=0 and constant variance Var(εi)=σ2 ;

2) The errors are uncorrelated with Cov  i ,  j   0; i  j

 i ~ NID0,  2 

This implies that the dependent variable Y follows a


normal distribution with

E  y | x   α  βxi ; Var  y | x   σ 2

Day 2 - Data Science


36
6000 Distribution of y at
x=23. The mean
Monthly Personal Income

E(y|x)=α+βx
5000 E(y) is α+β(23) and
the standard
4000 deviation is σ
3000
Distribution of y at
2000 x=45. The mean
Δy E(y) is α+β(45) and
1000
the standard
Δx
0 deviation is σ
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science


37
Simple linear
regression

Least square
estimation
UECM2263
Applied Statistical
Models
Day 2 - Data Science
38
Find a line that ‘fit’ the data best
= Estimate the values of α and β that minimize the errors

6000
Monthly Personal Income

5000

4000

3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science


39
Parameter Estimation of α and β: Least Square Method

yi  α  βxi  εi , i  1, 2,, n
Re-write the model as
εi  yi  α  βxi , i  1, 2,, n
6000
Monthly Personal Income

5000

4000

3000

2000 –ε
1000

0
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science


40
Parameter Estimation of α and β: Least Square Method

yi  α  βxi  εi , i  1, 2,, n
Re-write the model as
εi  yi  α  βxi , i  1, 2,, n

To eliminate the negative signs of the error term,

ε   yi  α  βxi  , i  1, 2,, n
2 2
i
Consider the squared error for n pairs of sample data, we have
the error sum of squares:
n n

 ε    y  α  βx 
2 2
i i i
i 1 i 1
Day 2 - Data Science
41
Label the error sum of squares as S(α,β) and called it as LS criterion:
n n
S  α, β    ε    yi  α  βxi 
2 2
i
i 1 i 1

To minimize the LS criterion, we differentiate S(α,β) with


respect to α and β, and set it to zero

S
   1  0
n
  2 yi  αˆ  βx
ˆ
(1)
α  αˆ , βˆ  i 1
i

S
  x   0
n
  2 yi  αˆ  βx
ˆ (2)
β  αˆ , βˆ  i 1
i i

Day 2 - Data Science


42
Result: The LS estimators for α and β are
n

 x y  nxy
i i
βˆ  i 1
n ˆ
α̂  y  βx
i
x 2

i 1
 nx 2

Thus, the fitted simple linear regression model is

ˆ
yˆi  αˆ  βxi

6000
Monthly Personal Income

5000

4000

3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science


43
Re-visit the example:
Respondent Age (x) Income (y) xy x2
A 34 2950 100300 1156
B 45 4000 180000 2025
C 29 2430 70470 841
D 32 3000 96000 1024
E 23 1790 41170 529
Sum 163 14170 487940 5575

 xi  163,  yi  14170,  xi yi  487940,  i  5575


x 2

x  32.6, y  2834
n

 x y  nxy
i i
487940  5(32.6)(2834)
βˆ  i 1
  99.53
n
5575  5(32.6) 2

i
x 2

i 1
 nx 2

Day 2 - Data Science


44
ˆ  2834  99.53(32.6)  410.77
αˆ  y  βx

Thus, the fitted SLR model is

ˆ  410.77  99.53x
yˆi  αˆ  βxi i

6000
Monthly Personal Income

5000

4000

3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science


45
 Suppose you constructed the SLR model
using n pairs of sample data
(x1,y1), (x2,y2), …,(xn,yn) and the range of x is
[a,b].

 Two types of forecasting:


 Extrapolation: predict the values of y using
x outside the range [a,b].
 Interpolation: predict the values of y using x
inside the range [a,b].

Day 2 - Data Science


46
Re-visit the example:
Respondent Age (x) Income (y) xy x2
A 34 2950 100300 1156
B 45 4000 180000 2025
C 29 2430 70470 841
D 32 3000 96000 1024
E 23 1790 41170 529
Sum 163 14170 487940 5575

Age range used: [23,45]

Day 2 - Data Science


47
At age 50 (x=50), your predicted monthly income is
yˆi  410.7725  99.5329  50   4565.87

Monthly Personal Income


6000

5000 RM4565.8725
4000

3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Extrapolation!
Day 2 - Data Science
48
At age 30 (x=30), your predicted monthly income is
yˆi  410.7725  99.5329  30   2575.21

Monthly Personal Income


6000

5000

4000
RM2575.21
3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Interpolation!
Day 2 - Data Science
49
Simple linear
regression

Interval
estimation

Day 2 - Data Science


50
Estimation of Variance σ2
Method 1: based on several observations (replication) on y for
at least one value of x.
Method 2: when prior information concerning σ2 is available.
Method 3: estimate based on the residual or error sum of
2
n n
 

SS Re s   ei    yi  y i 
squares. 2

i 1 i 1  
2 SS Re s
   MS Re s
n2
This unbiased estimator of σ2 is called the Residual
Mean Square. And its square root is called the
standard error of regression.

Day 2 - Data Science


51
ˆ  410.77  99.53x
yˆi  αˆ  βxi i

Respondent Age (x) Income (y)


A 34 2950 2973.346 -23.3461 545.0404
B 45 4000 4068.208 -68.208 4652.331
C 29 2430 2475.682 -45.6816 2086.809
D 32 3000 2774.28 225.7197 50949.38
E 23 1790 1878.484 -88.4842 7829.454
Sum 163 14170 66063.02

2
 

n n
SSRe s   ei2    yi  y i   66063.02
i 1 i 1  
2 SSRe s 66063.02
  MSRe s    22021.01
n2 52
Day 2 - Data Science
52
Interval Estimation in Simple Linear Regression:
If the errors are NID, then the 100(1 – α)% confidence
interval of β1, β0 and σ2 are

  
 
 1  t ,n2 se  1   1   1  t ,n2 se  1 
2   2  


  
 
 0  t ,n2 se  0    0   0  t ,n2 se  0 
2   2  

n  2MSRe s  2

n  2MSRe s
 2 , n  2
2
12 2,n2

Day 2 - Data Science


53
Multivariate
linear regression

Day 2 - Data Science


54
The sample multiple linear regression model with k regressor or
predictor variables:

yi  0  1 xi1   2 xi 2     k xik   i , i  1,2,, n


where βj are called the (partial) regression coefficients, represents the
expected change in the response y per unit change in xj when all the
remaining regressor variables are held constant.

Remarks:
1) the term linear is used because it is a linear function of the
unknown parameters β0, β1,…,βk.
2) It is often used as empirical model or approximating function
(the true functional relationship is unknown).

Day 2 - Data Science


55
Examples of multiple linear regression models:

1) Polynomial models: y  0  1 x   2 x 2     k x k  

2) Models with interaction effects:


y  0  1 x1   2 x2  12 x1 x2  

3) Second-order model with interaction:

y  0  1 x1   2 x2  11x12   22 x22  12 x1 x2  

How to convert these models to the general multiple linear regression


form?

Day 2 - Data Science


56
Matrix form of multiple linear regression model

It is more convenient to deal with multiple linear regression


model in matrix form:

yi  0  1 xi1   2 xi 2     k xik   i , i  1,2,, n

y  Xβ  ε
where y  y1 y2  yn  '  1 x11 x12  x1k 
  x2 k 
β  0 1 k  '
1 x21 x22
X
     
 

ε  1  2  k '
 1 xn1 xn 2  xnk 

y and ε are vectors of size nx1, and β and X are vectors of size
px1, where p = n+1
Day 2 - Data Science
57
Estimation of the Model Parameters

A) Least Squares Estimation (LSE) of the Regression Coefficients


n

S β      εε  y  Xβ 
2
i
i 1

 yy  2βXy  βXXβ


Differentiate the least squares criterion with respect to βs yield

S 
 2Xy  2XX β  0
β β
Thus, the least-squares estimator of β is


β  XX Xy
1

Day 2 - Data Science


58
Selection of
Variables

Hypothesis
testing

Day 2 - Data Science


59
Hypothesis Testing on the Parameters

Assumption: the errors εi are normally distributed.


 i ~ NID0,   2

This implies that yi ~ NID 0  1 xi ,  


2

 n
 2 
Since  1   ci yi ~ NID 1 , 
i 1  S xx 
To test the hypothesis that the slope equals a
constant, we have
H 0 : 1  10
H1 : 1  10

Day 2 - Data Science


60
and the test statistic is

 1  10
Z0  ~ N 0,1
2
S xx

Typically σ2 is unknown and the unbiased estimator


MSRes is used. Then, the test statistic becomes

 1  10
t0  ~ t
MS Re s ,n  2
2
S xx

We reject the null hypothesis if t0  t


,n2
2

Day 2 - Data Science


61
  MS Re s
se  1  
  S xx
is called the (estimated) standard error.

To test the hypothesis that the intercept equals a


constant, we have

H 0 :  0   00
H1 :  0   00

Day 2 - Data Science


62
and the test statistic is
 
 0   00  0   00
t0   ~ t
 1 x2     ,n 2
se  0 
MS Re s   
2

 n S xx   
 

We reject the null hypothesis if t0  t 


,n2
2

Day 2 - Data Science


63
Does the model
significance?

Checking the
appropriateness
of the model

Day 2 - Data Science


64
Special Case: Testing Significance of Regression

Testing: (i) t-statistic, or (ii) analysis of variance (ANOVA)


1) Accept the null hypothesis: no linear relationship
between x and y: (i) x is of little value in explaining the
variation of y, (ii) the true relationship between x and y
is not linear.

Day 2 - Data Science


65
2) Reject the null hypothesis: (i) x is of value in
explaining the variability of y, (ii) the straight-line
model is adequate or better results could be obtained
with the addition of higher order polynomial terms in x.

Day 2 - Data Science


66
ANOVA : Analysis of Variance
 The reason for doing an ANOVA is to see if there is
any difference between groups on some variable. For
example, you might have data on student performance
in non-assessed tutorial exercises as well as their final
grading. You are interested in seeing if tutorial
performance is related to final grade. ANOVA allows
you to break up the group according to the grade and
then see if performance is different across these
grades. ANOVA is available for both parametric (score
data) and non-parametric (ranking/ordering) data.

Day 2 - Data Science


67
ANOVA : Analysis of Variance
 A typical ANOVA table looks like:
Sum of Mean
Source DF Squares Square F*
Regression k SS R MS R MS R
F
MS E
Error n  k 1 SS E MS E

Corrected Total n 1 SS T

SS E  ee    yi  yˆ i 
2

 

= y  Xβ̂ y  Xβ̂ 
= y y  β̂Xy

Day 2 - Data Science


68
H 0 :  0  1  ...   k  0
H1 :  j  0 for at least one j.

Test statistics:
MS R
F ~ Fk , nk 1 ; Reject H 0 when F0  F ; k , nk 1
MS E

Reject null hypothesis implies that at least one of


the explanatory variables contribute significantly to
the model.

Day 2 - Data Science


69
Coefficient of Determination :

SS R SS
R2   1 Re s
SST SST
where SSRES is the residual or error sum of squares, SSR is
the regression or model sum of squares and SST is a measure
of the variability in y without considering the effect of the
regressor variables x.
2 2
n
 
 
n 

SS Re s    yi  y i  , SS R    y i  y  , SST  SS R  SS Re s
i 1   i 1  

R2 measures the proportion of variation explained by the


regressor x.

Day 2 - Data Science


70
B) Assessing the overall adequacy using R2 and adjusted R2

SS R SS Re s SS Re s n  p 
R 
2
 1
2
RAdj  1
SST SST SST n  1

In general, R2 always increases when a new regressor is added to


the model, regardless of its contribution to the model.

Since SSRes/(n-p) is the residual mean square and SST/(n-1) is


constant regardless of how many variables are in the model, the
adjusted R2 will only increase on adding a variable to the model if
the addition of the variable reduces the residual mean square.

Day 2 - Data Science


71
STANDARDIZED REGRESSION COEFFICIENTS

Sometimes it is difficult to directly compare regression coefficients


because the magnitude of βs are different. We use standardized
regression coefficients to scale and produce dimensionless
coefficients.

A) Unit Normal Scaling

 x 
n
2
xij  x j ij xj
zij  ; i  1,2,, n; j  1,2,, k s 2j  i 1
sj n 1

 y 
n
yi  y y
2
y 
*
i ; i  1,2,, n i
sy s y2  i 1
n 1

Day 2 - Data Science


72
using these new variables, the regression model becomes

yi*  b1 zi1  b2 zi 2    bk zik  εi , i  1,2,, n



b  ZZ  Zy *
1
The LSE of b0 is 0, and

B) Unit Length Scaling

 
n
xij  x j S jj   xij  x j
2

wij  ; i  1,2,, n; j  1,2,, k i 1


S jj

yi  y
yi0  ; i  1,2,, n
SST

Day 2 - Data Science


73
MULTICOLLINEARITY (Near-Linear Dependence)

Multicollinearity may dramatically impact the usefulness of a regression


model.

The regressors are the columns of the X matrix. So, exact linear
dependence would result in a singular X’X.

Variance inflation factors (VIF) is an important multicollinearity diagnostic.


It takes the main diagonal elements of (X’X)-1 in correlation form or (W’W)-1.

The VIF for the jth regression coefficient can be written as


1
VIF j 
1  R 2j
where Rj2 is the coefficient of multiple determination obtained form
regressing xj on the other regressor variables.

Day 2 - Data Science


74
Some Considerations in the Use of Regression :
1) It is intended as interpolation model
over the range of the regressor
variables. We must be careful if we
extrapolate outside of this range.
2) The disposition of the x values
plays an important role in the
least squares fit.

Day 2 - Data Science


75
3) Outliers can seriously disturb the least
squares fit.

4) When a regression analysis has


indicated a strong relationship
between two variables, this does not
imply that the variables are related in
any causal sense (cause and effect).

5) In some applications, the value of the regressor variable


x required to predict y is unknown. Thus, to predict
y, we must first predict x. The accuracy of the prediction
on y depends on the accuracy of the prediction on x.

Day 2 - Data Science


76
Modelling Process
• Identify response variable
• Identify explanatory variables
Variable • Variable categorization (numeric, categorical, discrete, continuous etc.)
Identificatio • Create data dictionary
n

• Distribution analysis
Explore • Outliers treatment
response • multicollinearity
variable

• Identify prospective explanatory variables that can explain response variable


• Bivariate analysis between response and explanatory variables
Explore • Variable treatment (transformation, grouping etc.)
independen • Heteroskedascity
t variables • Data cleaning

Day 2 - Data Science


77
Modelling Process
• Select appropriate model (linear, nonlinear, interaction etc.)
• Fit model
Fitting • model (variables) selection
model • Checking assumption (normality, constant variance ect.)

• Analyse results
• Compare models
Performanc
e analysis • Performance assessment (ANOVA, R squares, etc.)

• Validate the model using new observations (accuracy, specification etc.)


Validation • Repeat the process if needed
and • Execute the model
Execution

Day 2 - Data Science


78
Example: Predicting House
Selling Price
 Open text file “House_Price.xlsx”
 Conduct analysis
 Fit
the models
 Construct ANOVA
 Construct CI
 Model selection

Day 2 - Data Science


79
Using Excel for Linear Regression
1

5
3 4

Day 2 - Data Science


80
Output

The fitted model explained 74% of the total


variability in house price. The model is
reasonably good and no important factor is
excluded.

The fitted
model is
appropriate.

Which factors are significant (p-value<alpha)?


Luas_Lot, TENURE, Luas, CBD distance, Shopping Mall distance
Day 2 - Data Science
81
Checking Constant Error Variance Checking Normality

Residual Plot for Predicted Price vs Normal Probability Plot


Standard Residual
1600000
15
1400000
10
1200000
Standard Residual

5
1000000

PRICE
0 800000
-500000 0 500000 1000000 1500000 2000000
-5 600000
-10 400000
-15 200000

-20 0
Predicted Price
0 20 40 60 80 100 120
Sample Percentile
Error variance is non-linear (curve) and
not constant (outward open funnel) House Price has heavy-tail distribution.
Day 2 - Data Science
82
Day 2 - Data Science
83
Day 2 - Data Science 84
> library(readxl)
> Property=read_excel(“Property.xlsx”,sheet=2)
> LRmodel=lm(formula =
Property$Price~Property$Area+Property$Bath+Property$Floor+Property$Bedroo
m)
> summary(LRmodel)
Call: lm(formula = Property$Price ~ Property$Area + Property$Bath + Property$Floor +
Property$Bedroom)

Residuals:
Min 1Q Median 3Q Max
-12.700 -1.616 0.984 2.510 11.759

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.7633 9.2074 2.038 0.06889 .
Property$Area 6.2698 0.7252 8.645 5.93e-06 ***
Property$Bath 30.2705 6.8487 4.420 0.00129 **
Property$Floor -16.2033 6.2121 -2.608 0.02611 *
Property$Bedroom -2.6730 4.4939 -0.595 0.56519
---
Signif. codes:
0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
Residual standard error: 6.849 on 10 degrees of freedom
Multiple R-squared: 0.9714, Adjusted R-squared: 0.9599
F-statistic: 84.8 on 4 and 10 DF, p-value: 1.128e-07

Day 2 - Data Science 85


> anv=anova(LRmodel)

Analysis of Variance Table


Response: Property$Price
Df Sum Sq Mean Sq F value Pr(>F)
Property$Area 1 14829.3 14829.3 316.1025 6.76e-09 ***
Property$Bath 1 750.8 750.8 16.0046 0.002516 **
Property$Floor 1 316.3 316.3 6.7428 0.026642 *
Property$Bedroom 1 16.6 16.6 0.3538 0.565189
Residuals 10 469.1 46.9
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1

Interaction terms:
> LRmodel2=lm(formula =
Property$Price~Property$Area+Property$Bath+Property$Floor+Property$Bedroo
m)+I(Property$Area*Property$Floor))
> summary(LRmodel2)

Day 2 - Data Science 86


Logistic Regression

Classification of
Binary outputs

Day 2 - Data Science


87
Classification

Email: Spam / Not Spam?


Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?
Customer: Buy / Don’t buy ?

0: “Negative Class” (e.g., don’t buy)


1: “Positive Class” (e.g., buy)
Why Logistic Regression, not Linear
Regression?
(Yes) 1

Buy?

(No) 0
Income Income

Threshold classifier output at 0.5:

If , predict “y = 1”

If , predict “y = 0”
Logistic Regression Model
Want

0.5

Sigmoid function 0
Logistic function
Interpretation of Hypothesis Output

= estimated probability that y = 1 on input x

Example: If

Tell manager that 70% chance of customer will buy.

“probability that y = 1, given x,


parameterized by ”
Logistic regression 1

Suppose predict “ “ if

predict “ “ if
Decision Boundary
x2
3

1 2 3 x1

Predict “ “ if
Non-linear decision boundaries
x2

-1 1 x1

-1
Predict “ “ if
Method 1:

Given the age and income of customer, we want


to predict the buying decision of this customer
Simply assign values to coefficients intercept=1, beta1=0=beta2

=$A$15+A2*$B$15+B2*$C$15
Product of Probabilities is = H2xH3x…xH11 = product(H2:H11)

Our objective is to find the optimum parameters


(intercept, beta1, beta2) that maximize the log likelihood (zero)
This objective can be achieved by using Solver function in Excel!

2
1 3
4

Beware, you will get error message


in this example! 5
6

7
Method 2
Day 2 - Data Science 101
An alternative way to look at this is that the difference
between LL1 (log-likelihood for the current model with Age
and Income) and LL0 (log-likelihood for initial constant model
where only the intercept is used) are not significant.

The Chi-square test for goodness of fit also indicates


the model with Age and Income is not significant.

Day 2 - Data Science 102


Day 2 - Data Science 103
ROC and AUC
the receiver operating characteristic (ROC) curve can be constructed by
plotting the probability of detecting the signal of interest (sensitivity) against the
probability of getting a false signal (1-specificity) for a range of possible cutoff
values. The area under the ROC curve (AUC) lies between zero and one and
measures the ability of the model to discriminate between observations that will
lead to the response of interest and those that will not.
The model with Age and Income has
an acceptable discrimination of
buying or not buying with the overall
AUC of 0.72.

You may try different cutoff values to


compare the AUC values.

Day 2 - Data Science 105


Logistic Regression in R
%Change working directory
> setwd("C:/Users/user/Desktop/DataScieneUUM/Rcode/")

% Call library and excel file “CustomerChurn.xlsx”


> library(readxl)
> CustomerChurn=read_excel("CustomerChurn.xlsx")

Split data into training set and testing set


> sample_size=floor(0.7*nrow(CustomerChurn))
> set.seed(123)
> bahagi=sample(seq_len(nrow(CustomerChurn)),size=sample_size)
> train=CustomerChurn[bahagi, ]
> test=CustomerChurn[-bahagi, ]

Day 2 - Data Science 106


Construct Logistic Regression
> model = glm (CustomerChurn$Churn ~
CustomerChurn$Age+CustomerChurn$Sex+CustomerChurn$Payment, dat
a = train, family = binomial)
> summary(model)

Day 2 - Data Science 107


Confusion Matrix
> forecast=predict.glm(model,type=c("link","response","terms"))
> table(CustomerChurn$Churn[1:999],forecast>0.5)

FALSE TRUE
0 425 78
1 219 277

ROC Curve
> library(ROCR)
> ROCRpred=prediction(forecast,CustomerChurn$Churn[1:999])
> ROCRperf=performance(ROCRpred,'tpr','fpr')
> plot(ROCRperf,colorize=TRUE,text.adj=c(-0.2,1.7))

Day 2 - Data Science 108

You might also like