100% found this document useful (1 vote)

33 views

Predictive Model - Linear and Logistic Models PDF

Uploaded by

FucKerWengie

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

33 views

Predictive Model - Linear and Logistic Models PDF

Uploaded by

FucKerWengie

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 108

Big Data Analytics:

Data Science (Day 2)

14 – 16 September 2016

Instructor: Dr. Chang Yun Fah

Agenda
Day 1 Day 2 Day 3
Introduction to Investigating Classification
Analytics Relationship
between variables Performance
Data Exploration Evaluation and
Validation
Data Acquisition
Cluster
Predictive Modelling
Analysis
Probability Concepts using Regression
and Statistical
Inference

2 Day1 - Data Science

Relationship
and
Predictive
Modeling

Day 2 - Data Science

3
Investigating Relationship between
Variables

 Pearson correlation
 Spearman rank correlation
 Kendall‟s tau
 Association between
categorical variables

Day 2 - Data Science 4

 Correlation is a measure of „numerical‟
relationship between two variables.
 A strong positive (negative) relationship
means that an increase in the values of
one variable will be followed by an
increase (decrease) in the value of the
other variable.
 IMPORTANT: correlation does not implies
causal relationship (cause and effect)

Day 2 - Data Science

5
Covariance and Correlation Coefficient

 Covariance provides a measure of the strength of the

correlation between two or more sets of random
variates. The sample covariance for two random
variates X and Y, each with sample size N is:
(xi - x )(yi - y)
N
cov(X,Y ) = å
i=1 N -1

Day 2 - Data Science

6
Pearson Correlation (Product Moment)
 Two variables measured in
interval, numeric, ratio, percent or ordinal
scales.
 Assumption: (i) the data are randomly
sampled and follow normal distribution, (ii)
the two variables are linearly related.
 It determines the direction of the relationship
(+ve, -ve) and the strength of the relationship
(strong towards 1 or -1, weak towards 0).

Day 2 - Data Science

7
The Pearson correlation between two variables, x and
y, is a measure of the degree of linear association
between the two variables.

  y    2  x    2  y  1  x   2  
f  x, y  
1 1
exp        2     
 
1 2

2 1 2 1  2  2 1  2   1    2    1   2  

 

is the bivariate normal distribution where μ1 and σ1 are the

mean and standard deviation of y, and μ2 and σ2 are the
mean and standard deviation of x.
E y  1 x   2   12
 
 1 2  1 2
is the population correlation coefficient.

Day 2 - Data Science

8
The estimator of ρ is the sample correlation coefficient

 y x  x 
n

i i
S xy
r i 1

1  r,   1
   
12
n
2
n S xx S yy
 xi  x  yi  y 
2

 i 1 i 1 

Day 2 - Data Science

9
Examples of Approximate r Values

Day 2 - Data Science

10
Types of Various r Values
 The value of r is such that -1 < r < +1. The + and –
signs are used for positive linear correlations and
negative linear correlations, respectively.
 Positive correlation: If x and y have a strong positive
linear correlation, r is close to +1. An r value of
exactly +1 indicates a perfect positive fit. Positive
values indicate a relationship between x and y variables
such that as values for x increases values for y also
increase.
 Negative correlation: If x and y have a strong negative
linear correlation, r is close to -1. An r value of
exactly -1 indicates a perfect negative fit. Negative
values indicate a relationship between x and y such that
as values for x increase, values for y decrease.
Day 2 - Data Science
11
Types of Various r Values
 No correlation: If there is no linear correlation or a
weak linear correlation, r is close to 0. A value near
zero means that there is a random, nonlinear
relationship between the two variables

 Note that r is a dimensionless quantity

 A perfect correlation of ± 1 occurs only when the data

points all lie exactly on a straight line. If r = +1, the
slope of this line is positive. If r = -1, the slope of this
line is negative.

 A correlation greater than 0.8 is generally described as

strong, whereas a correlation less than 0.5 is generally
Day 2 - Data Science
12
Evans (1996) proposed


Day 2 - Data Science

13
Testing the significance of the correlation
coefficient
It is useful to test whether the two random variables x and y are
correlated:
H0 :   0
H1 :   0
The appropriate test statistic is t  r n  2 ~ t
0  2, n  2
1 r 2

General case (large samples):

H 0 :   0
H1 :    0
n  3  1 r 1  0 
Z  ln  ln  ~ Z
2  1 r 1  0  2

Day 2 - Data Science

14
Example: Predicting House
Selling Price
 Open text file “House_Price.xlsx”
 Perform
 Data cleansing
 Correlation study

Day 2 - Data Science

15
After cleansing

6
4

3
2 5

Make sure the data are „numeric‟, and

the factors are arranged side-by-side.
Day 2 - Data Science
16
H0 :   0 n  3  1 r 1  0 
Z  ln  ln  ~ Z
H1 :   0 2  1 r 1  0  2

Since |Z|>1.645, we reject the null hypothesis and conclude that the correlation
between Price and Luas_Lot is significantly different from zero
Day 2 - Data Science 17
Spearman Rank Correlation
 It is a nonparametric method.
 An alternative to the Pearson correlation when
the assumption of normality or linearity is
violated.
 Eg. data presented in at least ordinal scale (e.g
Likert scale) usually do not demonstrate normal
distribution.

6 di2 n

d    R  X i   R Yi 
2
rS  1 
2

 
i
n n2  1 i 1

R(Xi) is the rank for Xi.

Day 2 - Data Science
18
Day 2 - Data Science
19
Pearson‟s Correlation:
> library(readxl)
> Property=read_excel(“Property.xlsx”,sheet=2)
> cor(Property$Price,Property$Area,method=“pearson”)
* Method are “pearson” (default), “kendall” and “spearman”

Test for Pearson‟s Correlation:

>cor.test(Property$Price,Property$Area, method=“pearson”)

Pearson's product-moment correlation

data: Property$Price and Property$Area
t = 11.142, df = 13, p-value = 5.061e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8567004 0.9840714
sample estimates:
cor
0.9514249

Day 2 - Data Science 20

To compute correlation matrix on all variables:
> cor(Property)
Price Area Bath Floor Bedroom
Price 1.0000000 0.9514249 0.8335615 0.6048192 0.7459561
Area 0.9514249 1.0000000 0.7199549 0.6297639 0.7109191
Bath 0.8335615 0.7199549 1.0000000 0.7596752 0.6751223
Floor 0.6048192 0.6297639 0.7596752 1.0000000 0.3750000
Bedroom 0.7459561 0.7109191 0.6751223 0.3750000 1.0000000

To compute Matrix plot:

> pairs(Property)

Day 2 - Data Science 21

Contingency/Cross Table
• Investigating the association between two or more nominal/
categorical variables

 Oi  Ei 
2
# cells

1 2
Variable 2
c
Total χ2  i 1 Ei
 8.044
1 O11 O12 O1c n1.
2 O21 O22 O2c n2.
Variable 1

r Or1 Or2 Orc nr.

Total n.1 n.2 n.c N

Null hypothesis (H0): there is no association between

the two variables (independent)
Alternative hypothesis (H1): there is association
between the two variables (dependent)
Day 2 - Data Science
22
Example:
Null hypothesis (H0): there is no association between
Tenure and Number of Rooms (independent)
Alternative hypothesis (H1): there is association
between the two variables (dependent)

Take note that: (define the variable namas as TENURE, Bilik)

Tenure has 2 level: 0=least hold, 1=free hold
Number of Rooms has 4 levels: 2, 3, 4 and 5 rooms

Excel format

Standard format

Day 2 - Data Science

23
Day 2 - Data Science 24
Predictive Modeling using
Regression
 Simple Linear Regression
 Multivariate Linear
Regression
 Dummy variable
 Goodness of fit
 Model checking
 Model selection
 Nonlinear regression and
logistic regression
 Multicollinearity and
heteroskedasticity issues
Day 2 - Data Science
25
Functional relation: Regression Model:
 Perfect fit: all values fall • Not a perfect fit: values do
exactly on the straight not fall exactly on the line.
line. • Errors exist:

Y  f ( X )  a  bX Y  a  bX  errors
Y  a  bX  cZ Y  a  b1 X1  b2 X 2  errors

25 8000
20 6000

Miles
15
Y

4000
10
5 2000
0 0
0 2 4 6 8 10 12 0 1000 2000 3000 4000 5000 6000
X Ringgit

Day 2 - Data Science

26
Type of relationship:
1. Linear relationship (refer to the slope, not regressor)

No. Listed Companies

• Simple: one regressor
8000 940
920
6000 900

Miles
880
4000 860

yi  0  1 xi   i 2000
0
840
820
800
0 1000 2000 3000 4000 5000 6000 Aug-04 Feb-05 Sep-05 Mar-06 Oct-06 Apr-07
Ringgit Date

• Multiple: more than one regressor

yi  0  1 x1  2 x2    k xk   i 2500

2000

Size (feet squared)

1500

1000

500

0
500
400 50
300

• Polynomial: one or more regressors with higher

40
200 30
20
100 10
Price (RM1000) 0 0
Age of home (years)

degrees, e.g. quadratic, cubic etc.

Base Lending Rate

10
8

yi  0  1 x   2 x 2     k x k   i
6
4
2
0
0 2 4 6 8

yi   0  1 x1   2 x2   3 x3   4 x1 x2  5 x1 x3   6 x2 x3
Federal Funds Rate

110

2nd Board Index

  7 x1 x2 x3     i
100
90
80
70
60
Day 2 - Data Science 800 850 900 950 1000
Composite Index 27
2. Non-linear relationship in one or more variables:
exponential, logarithm, logistic etc. Eg:
xi
yi  e  i

xi
yi   i
xi  

1
yi 
1  e   xi i

Day 2 - Data Science

28
Some applications:
1. An economist wants to investigate the System
relationship between the petrol price
and the inflation rate. control
2. A sale manager is interested to predict Data
the total sale in next year based on the
number of staffs and square feet of description
space in the store
3. A policy maker wants to identify the Forecast
main factors (e.g speed limit, road
condition, weather) contribute to the
number of road accident. Data
4. A scientist wants to know at which level reduction
of sound pollution will affect human
health. Parameter
5. A computer scientist wants to compress estimation
an image for minimum storage.
Day 2 - Data Science
29
Simple linear
regression model

Day 2 - Data Science

30
Example:
You want to know if there is a relationship between the
monthly personal income and the age of a worker, and
then to forecast your monthly income when you are at
50 year-old. There are five workers in your study.

The two variables are:

1. Age of worker (years): Independent variable (X)
2. Monthly personal income (RM): Dependent
variable (Y)
3. Sample size (n): 5 workers

Day 2 - Data Science

31
Data collected from 5 working adults

Respondent Age Personal Income (Monthly)

A 34 2950
B 45 4000
C 29 2430
D 32 3000
E 23 1790

4500
Monthly Personal Income

4000
3500
3000
2500
2000
1500
1000
500
0
0 10 20 30 40 50
Age (years)

Day 2 - Data Science

32
Age increase => income increase

6000
?
Monthly Personal Income

5000 RM4565.87

4000 ?
3000 ?
2000

1000

0
0 10 20 30 40 50 60
Age (years)

What is your income when you reach 50 year-old?

Day 2 - Data Science

33
Mathematical equation for the straight line

Y  α  βX

6000
Monthly Personal Income

5000

4000

3000
ε3

2000 ε1
β
1000 α
0
0 10 20 30 40 50 60
Age (years)

The gap between the points and the line are errors of the model.

Day 2 - Data Science

34
The Simple Linear Regression model represents the straight
line with errors:

yi  α  βxi  εi , i  1, 2,, n
Where
yi = dependent/response variable (monthly income)
xi = independent variable/predictor/regressor (age) is a
known constant (fix)
α = line intercept at y-axis, it is an unknown constant
β = slope of the line, it is an unknown constant
εi = the random error
n = the number of observations/subjects/samples
Day 2 - Data Science
35
Assumptions:
1) The error term εi is normally distributed with mean
E(εi)=0 and constant variance Var(εi)=σ2 ;

2) The errors are uncorrelated with Cov  i ,  j   0; i  j

 i ~ NID0,  2 

This implies that the dependent variable Y follows a

normal distribution with

E  y | x   α  βxi ; Var  y | x   σ 2

Day 2 - Data Science

36
6000 Distribution of y at
x=23. The mean
Monthly Personal Income

E(y|x)=α+βx
5000 E(y) is α+β(23) and
the standard
4000 deviation is σ
3000
Distribution of y at
2000 x=45. The mean
Δy E(y) is α+β(45) and
1000
the standard
Δx
0 deviation is σ
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science

37
Simple linear
regression

Least square
estimation
UECM2263
Applied Statistical
Models
Day 2 - Data Science
38
Find a line that ‘fit’ the data best
= Estimate the values of α and β that minimize the errors

6000
Monthly Personal Income

5000

4000

3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science

39
Parameter Estimation of α and β: Least Square Method

yi  α  βxi  εi , i  1, 2,, n
Re-write the model as
εi  yi  α  βxi , i  1, 2,, n
6000
Monthly Personal Income

5000

4000
+ε
3000

2000 –ε
1000

0
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science

40
Parameter Estimation of α and β: Least Square Method

yi  α  βxi  εi , i  1, 2,, n
Re-write the model as
εi  yi  α  βxi , i  1, 2,, n

To eliminate the negative signs of the error term,

ε   yi  α  βxi  , i  1, 2,, n
2 2
i
Consider the squared error for n pairs of sample data, we have
the error sum of squares:
n n

 ε    y  α  βx 
2 2
i i i
i 1 i 1
Day 2 - Data Science
41
Label the error sum of squares as S(α,β) and called it as LS criterion:
n n
S  α, β    ε    yi  α  βxi 
2 2
i
i 1 i 1

To minimize the LS criterion, we differentiate S(α,β) with

respect to α and β, and set it to zero

S
   1  0
n
  2 yi  αˆ  βx
ˆ
(1)
α  αˆ , βˆ  i 1
i

S
  x   0
n
  2 yi  αˆ  βx
ˆ (2)
β  αˆ , βˆ  i 1
i i

Day 2 - Data Science

42
Result: The LS estimators for α and β are
n

 x y  nxy
i i
βˆ  i 1
n ˆ
α̂  y  βx
i
x 2

i 1
 nx 2

Thus, the fitted simple linear regression model is

ˆ
yˆi  αˆ  βxi

6000
Monthly Personal Income

5000

4000

3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science

43
Re-visit the example:
Respondent Age (x) Income (y) xy x2
A 34 2950 100300 1156
B 45 4000 180000 2025
C 29 2430 70470 841
D 32 3000 96000 1024
E 23 1790 41170 529
Sum 163 14170 487940 5575

 xi  163,  yi  14170,  xi yi  487940,  i  5575

x 2

x  32.6, y  2834
n

 x y  nxy
i i
487940  5(32.6)(2834)
βˆ  i 1
  99.53
n
5575  5(32.6) 2

i
x 2

i 1
 nx 2

Day 2 - Data Science

44
ˆ  2834  99.53(32.6)  410.77
αˆ  y  βx

Thus, the fitted SLR model is

ˆ  410.77  99.53x
yˆi  αˆ  βxi i

6000
Monthly Personal Income

5000

4000

3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Day 2 - Data Science

45
 Suppose you constructed the SLR model
using n pairs of sample data
(x1,y1), (x2,y2), …,(xn,yn) and the range of x is
[a,b].

 Two types of forecasting:

 Extrapolation: predict the values of y using
x outside the range [a,b].
 Interpolation: predict the values of y using x
inside the range [a,b].

Day 2 - Data Science

46
Re-visit the example:
Respondent Age (x) Income (y) xy x2
A 34 2950 100300 1156
B 45 4000 180000 2025
C 29 2430 70470 841
D 32 3000 96000 1024
E 23 1790 41170 529
Sum 163 14170 487940 5575

Age range used: [23,45]

Day 2 - Data Science

47
At age 50 (x=50), your predicted monthly income is
yˆi  410.7725  99.5329  50   4565.87

Monthly Personal Income

6000

5000 RM4565.8725
4000

3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Extrapolation!
Day 2 - Data Science
48
At age 30 (x=30), your predicted monthly income is
yˆi  410.7725  99.5329  30   2575.21

Monthly Personal Income

6000

5000

4000
RM2575.21
3000

2000

1000

0
0 10 20 30 40 50 60
Age (years)

Interpolation!
Day 2 - Data Science
49
Simple linear
regression

Interval
estimation

Day 2 - Data Science

50
Estimation of Variance σ2
Method 1: based on several observations (replication) on y for
at least one value of x.
Method 2: when prior information concerning σ2 is available.
Method 3: estimate based on the residual or error sum of
2
n n
 

SS Re s   ei    yi  y i 
squares. 2

i 1 i 1  
2 SS Re s
   MS Re s
n2
This unbiased estimator of σ2 is called the Residual
Mean Square. And its square root is called the
standard error of regression.

Day 2 - Data Science

51
ˆ  410.77  99.53x
yˆi  αˆ  βxi i

Respondent Age (x) Income (y)

A 34 2950 2973.346 -23.3461 545.0404
B 45 4000 4068.208 -68.208 4652.331
C 29 2430 2475.682 -45.6816 2086.809
D 32 3000 2774.28 225.7197 50949.38
E 23 1790 1878.484 -88.4842 7829.454
Sum 163 14170 66063.02

2
 

n n
SSRe s   ei2    yi  y i   66063.02
i 1 i 1  
2 SSRe s 66063.02
  MSRe s    22021.01
n2 52
Day 2 - Data Science
52
Interval Estimation in Simple Linear Regression:
If the errors are NID, then the 100(1 – α)% confidence
interval of β1, β0 and σ2 are

  
 
 1  t ,n2 se  1   1   1  t ,n2 se  1 
2   2  


  
 
 0  t ,n2 se  0    0   0  t ,n2 se  0 
2   2  

n  2MSRe s  2

n  2MSRe s
 2 , n  2
2
12 2,n2

Day 2 - Data Science

53
Multivariate
linear regression

Day 2 - Data Science

54
The sample multiple linear regression model with k regressor or
predictor variables:

yi  0  1 xi1   2 xi 2     k xik   i , i  1,2,, n

where βj are called the (partial) regression coefficients, represents the
expected change in the response y per unit change in xj when all the
remaining regressor variables are held constant.

Remarks:
1) the term linear is used because it is a linear function of the
unknown parameters β0, β1,…,βk.
2) It is often used as empirical model or approximating function
(the true functional relationship is unknown).

Day 2 - Data Science

55
Examples of multiple linear regression models:

1) Polynomial models: y  0  1 x   2 x 2     k x k  

2) Models with interaction effects:

y  0  1 x1   2 x2  12 x1 x2  

3) Second-order model with interaction:

y  0  1 x1   2 x2  11x12   22 x22  12 x1 x2  

How to convert these models to the general multiple linear regression

form?

Day 2 - Data Science

56
Matrix form of multiple linear regression model

It is more convenient to deal with multiple linear regression

model in matrix form:

yi  0  1 xi1   2 xi 2     k xik   i , i  1,2,, n

y  Xβ  ε
where y  y1 y2  yn  '  1 x11 x12  x1k 
  x2 k 
β  0 1 k  '
1 x21 x22
X
     
 

ε  1  2  k '
 1 xn1 xn 2  xnk 

y and ε are vectors of size nx1, and β and X are vectors of size
px1, where p = n+1
Day 2 - Data Science
57
Estimation of the Model Parameters

A) Least Squares Estimation (LSE) of the Regression Coefficients

n

S β      εε  y  Xβ 
2
i
i 1

 yy  2βXy  βXXβ

Differentiate the least squares criterion with respect to βs yield

S 
 2Xy  2XX β  0
β β
Thus, the least-squares estimator of β is


β  XX Xy
1

Day 2 - Data Science

58
Selection of
Variables

Hypothesis
testing

Day 2 - Data Science

59
Hypothesis Testing on the Parameters

Assumption: the errors εi are normally distributed.

 i ~ NID0,   2

This implies that yi ~ NID 0  1 xi ,  

 n
 2 
Since  1   ci yi ~ NID 1 , 
i 1  S xx 
To test the hypothesis that the slope equals a
constant, we have
H 0 : 1  10
H1 : 1  10

Day 2 - Data Science

60
and the test statistic is

 1  10
Z0  ~ N 0,1
2
S xx

Typically σ2 is unknown and the unbiased estimator

MSRes is used. Then, the test statistic becomes

 1  10
t0  ~ t
MS Re s ,n  2
2
S xx

We reject the null hypothesis if t0  t

,n2
2

Day 2 - Data Science

61
  MS Re s
se  1  
  S xx
is called the (estimated) standard error.

To test the hypothesis that the intercept equals a

constant, we have

H 0 :  0   00
H1 :  0   00

Day 2 - Data Science

62
and the test statistic is
 
 0   00  0   00
t0   ~ t
 1 x2     ,n 2
se  0 
MS Re s   
2

 n S xx   
 

We reject the null hypothesis if t0  t 

,n2
2

Day 2 - Data Science

63
Does the model
significance?

Checking the
appropriateness
of the model

Day 2 - Data Science

64
Special Case: Testing Significance of Regression

Testing: (i) t-statistic, or (ii) analysis of variance (ANOVA)

1) Accept the null hypothesis: no linear relationship
between x and y: (i) x is of little value in explaining the
variation of y, (ii) the true relationship between x and y
is not linear.

Day 2 - Data Science

65
2) Reject the null hypothesis: (i) x is of value in
explaining the variability of y, (ii) the straight-line
model is adequate or better results could be obtained
with the addition of higher order polynomial terms in x.

Day 2 - Data Science

66
ANOVA : Analysis of Variance
 The reason for doing an ANOVA is to see if there is
any difference between groups on some variable. For
example, you might have data on student performance
in non-assessed tutorial exercises as well as their final
grading. You are interested in seeing if tutorial
performance is related to final grade. ANOVA allows
you to break up the group according to the grade and
then see if performance is different across these
grades. ANOVA is available for both parametric (score
data) and non-parametric (ranking/ordering) data.

Day 2 - Data Science

67
ANOVA : Analysis of Variance
 A typical ANOVA table looks like:
Sum of Mean
Source DF Squares Square F*
Regression k SS R MS R MS R
F
MS E
Error n  k 1 SS E MS E

Corrected Total n 1 SS T

SS E  ee    yi  yˆ i 
2

 

= y  Xβ̂ y  Xβ̂ 
= y y  β̂Xy

Day 2 - Data Science

68
H 0 :  0  1  ...   k  0
H1 :  j  0 for at least one j.

Test statistics:
MS R
F ~ Fk , nk 1 ; Reject H 0 when F0  F ; k , nk 1
MS E

Reject null hypothesis implies that at least one of

the explanatory variables contribute significantly to
the model.

Day 2 - Data Science

69
Coefficient of Determination :

SS R SS
R2   1 Re s
SST SST
where SSRES is the residual or error sum of squares, SSR is
the regression or model sum of squares and SST is a measure
of the variability in y without considering the effect of the
regressor variables x.
2 2
n
 
 
n 

SS Re s    yi  y i  , SS R    y i  y  , SST  SS R  SS Re s
i 1   i 1  

R2 measures the proportion of variation explained by the

regressor x.

Day 2 - Data Science

70
B) Assessing the overall adequacy using R2 and adjusted R2

SS R SS Re s SS Re s n  p 
R 
2
 1
2
RAdj  1
SST SST SST n  1

In general, R2 always increases when a new regressor is added to

the model, regardless of its contribution to the model.

Since SSRes/(n-p) is the residual mean square and SST/(n-1) is

constant regardless of how many variables are in the model, the
adjusted R2 will only increase on adding a variable to the model if
the addition of the variable reduces the residual mean square.

Day 2 - Data Science

71
STANDARDIZED REGRESSION COEFFICIENTS

Sometimes it is difficult to directly compare regression coefficients

because the magnitude of βs are different. We use standardized
regression coefficients to scale and produce dimensionless
coefficients.

A) Unit Normal Scaling

 x 
n
2
xij  x j ij xj
zij  ; i  1,2,, n; j  1,2,, k s 2j  i 1
sj n 1

 y 
n
yi  y y
2
y 
*
i ; i  1,2,, n i
sy s y2  i 1
n 1

Day 2 - Data Science

72
using these new variables, the regression model becomes

yi*  b1 zi1  b2 zi 2    bk zik  εi , i  1,2,, n


b  ZZ  Zy *
1
The LSE of b0 is 0, and

B) Unit Length Scaling

 
n
xij  x j S jj   xij  x j
2

wij  ; i  1,2,, n; j  1,2,, k i 1

S jj

yi  y
yi0  ; i  1,2,, n
SST

Day 2 - Data Science

73
MULTICOLLINEARITY (Near-Linear Dependence)

Multicollinearity may dramatically impact the usefulness of a regression

model.

The regressors are the columns of the X matrix. So, exact linear
dependence would result in a singular X’X.

Variance inflation factors (VIF) is an important multicollinearity diagnostic.

It takes the main diagonal elements of (X’X)-1 in correlation form or (W’W)-1.

The VIF for the jth regression coefficient can be written as

1
VIF j 
1  R 2j
where Rj2 is the coefficient of multiple determination obtained form
regressing xj on the other regressor variables.

Day 2 - Data Science

74
Some Considerations in the Use of Regression :
1) It is intended as interpolation model
over the range of the regressor
variables. We must be careful if we
extrapolate outside of this range.
2) The disposition of the x values
plays an important role in the
least squares fit.

Day 2 - Data Science

75
3) Outliers can seriously disturb the least
squares fit.

4) When a regression analysis has

indicated a strong relationship
between two variables, this does not
imply that the variables are related in
any causal sense (cause and effect).

5) In some applications, the value of the regressor variable

x required to predict y is unknown. Thus, to predict
y, we must first predict x. The accuracy of the prediction
on y depends on the accuracy of the prediction on x.

Day 2 - Data Science

76
Modelling Process
• Identify response variable
• Identify explanatory variables
Variable • Variable categorization (numeric, categorical, discrete, continuous etc.)
Identificatio • Create data dictionary
n

• Distribution analysis
Explore • Outliers treatment
response • multicollinearity
variable

• Identify prospective explanatory variables that can explain response variable

• Bivariate analysis between response and explanatory variables
Explore • Variable treatment (transformation, grouping etc.)
independen • Heteroskedascity
t variables • Data cleaning

Day 2 - Data Science

77
Modelling Process
• Select appropriate model (linear, nonlinear, interaction etc.)
• Fit model
Fitting • model (variables) selection
model • Checking assumption (normality, constant variance ect.)

• Analyse results
• Compare models
Performanc
e analysis • Performance assessment (ANOVA, R squares, etc.)

• Validate the model using new observations (accuracy, specification etc.)

Validation • Repeat the process if needed
and • Execute the model
Execution

Day 2 - Data Science

78
Example: Predicting House
Selling Price
 Open text file “House_Price.xlsx”
 Conduct analysis
 Fit
the models
 Construct ANOVA
 Construct CI
 Model selection

Day 2 - Data Science

79
Using Excel for Linear Regression
1

5
3 4

Day 2 - Data Science

80
Output

The fitted model explained 74% of the total

variability in house price. The model is
reasonably good and no important factor is
excluded.

The fitted
model is
appropriate.

Which factors are significant (p-value<alpha)?

Luas_Lot, TENURE, Luas, CBD distance, Shopping Mall distance
Day 2 - Data Science
81
Checking Constant Error Variance Checking Normality

Residual Plot for Predicted Price vs Normal Probability Plot

Standard Residual
1600000
15
1400000
10
1200000
Standard Residual

5
1000000

PRICE
0 800000
-500000 0 500000 1000000 1500000 2000000
-5 600000
-10 400000
-15 200000

-20 0
Predicted Price
0 20 40 60 80 100 120
Sample Percentile
Error variance is non-linear (curve) and
not constant (outward open funnel) House Price has heavy-tail distribution.
Day 2 - Data Science
82
Day 2 - Data Science
83
Day 2 - Data Science 84
> library(readxl)
> Property=read_excel(“Property.xlsx”,sheet=2)
> LRmodel=lm(formula =
Property$Price~Property$Area+Property$Bath+Property$Floor+Property$Bedroo
m)
> summary(LRmodel)
Call: lm(formula = Property$Price ~ Property$Area + Property$Bath + Property$Floor +
Property$Bedroom)

Residuals:
Min 1Q Median 3Q Max
-12.700 -1.616 0.984 2.510 11.759

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.7633 9.2074 2.038 0.06889 .
Property$Area 6.2698 0.7252 8.645 5.93e-06 ***
Property$Bath 30.2705 6.8487 4.420 0.00129 **
Property$Floor -16.2033 6.2121 -2.608 0.02611 *
Property$Bedroom -2.6730 4.4939 -0.595 0.56519
---
Signif. codes:
0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
Residual standard error: 6.849 on 10 degrees of freedom
Multiple R-squared: 0.9714, Adjusted R-squared: 0.9599
F-statistic: 84.8 on 4 and 10 DF, p-value: 1.128e-07

Day 2 - Data Science 85

> anv=anova(LRmodel)

Analysis of Variance Table

Response: Property$Price
Df Sum Sq Mean Sq F value Pr(>F)
Property$Area 1 14829.3 14829.3 316.1025 6.76e-09 ***
Property$Bath 1 750.8 750.8 16.0046 0.002516 **
Property$Floor 1 316.3 316.3 6.7428 0.026642 *
Property$Bedroom 1 16.6 16.6 0.3538 0.565189
Residuals 10 469.1 46.9
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1

Interaction terms:
> LRmodel2=lm(formula =
Property$Price~Property$Area+Property$Bath+Property$Floor+Property$Bedroo
m)+I(Property$Area*Property$Floor))
> summary(LRmodel2)

Day 2 - Data Science 86

Logistic Regression

Classification of
Binary outputs

Day 2 - Data Science

87
Classification

Email: Spam / Not Spam?

Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?
Customer: Buy / Don’t buy ?

0: “Negative Class” (e.g., don’t buy)

1: “Positive Class” (e.g., buy)
Why Logistic Regression, not Linear
Regression?
(Yes) 1

Buy?

(No) 0
Income Income

Threshold classifier output at 0.5:

If , predict “y = 1”

If , predict “y = 0”
Logistic Regression Model
Want

0.5

Sigmoid function 0
Logistic function
Interpretation of Hypothesis Output

= estimated probability that y = 1 on input x

Example: If

Tell manager that 70% chance of customer will buy.

“probability that y = 1, given x,

parameterized by ”
Logistic regression 1

Suppose predict “ “ if

predict “ “ if
Decision Boundary
x2
3

1 2 3 x1

Predict “ “ if
Non-linear decision boundaries
x2

-1 1 x1

-1
Predict “ “ if
Method 1:

Given the age and income of customer, we want

to predict the buying decision of this customer
Simply assign values to coefficients intercept=1, beta1=0=beta2

=$A$15+A2*$B$15+B2*$C$15
Product of Probabilities is = H2xH3x…xH11 = product(H2:H11)

Our objective is to find the optimum parameters

(intercept, beta1, beta2) that maximize the log likelihood (zero)
This objective can be achieved by using Solver function in Excel!

2
1 3
4

Beware, you will get error message

in this example! 5
6

7
Method 2
Day 2 - Data Science 101
An alternative way to look at this is that the difference
between LL1 (log-likelihood for the current model with Age
and Income) and LL0 (log-likelihood for initial constant model
where only the intercept is used) are not significant.

The Chi-square test for goodness of fit also indicates

the model with Age and Income is not significant.

Day 2 - Data Science 102

Day 2 - Data Science 103
ROC and AUC
the receiver operating characteristic (ROC) curve can be constructed by
plotting the probability of detecting the signal of interest (sensitivity) against the
probability of getting a false signal (1-specificity) for a range of possible cutoff
values. The area under the ROC curve (AUC) lies between zero and one and
measures the ability of the model to discriminate between observations that will
lead to the response of interest and those that will not.
The model with Age and Income has
an acceptable discrimination of
buying or not buying with the overall
AUC of 0.72.

You may try different cutoff values to

compare the AUC values.

Day 2 - Data Science 105

Logistic Regression in R
%Change working directory
> setwd("C:/Users/user/Desktop/DataScieneUUM/Rcode/")

% Call library and excel file “CustomerChurn.xlsx”

> library(readxl)
> CustomerChurn=read_excel("CustomerChurn.xlsx")

Split data into training set and testing set

> sample_size=floor(0.7*nrow(CustomerChurn))
> set.seed(123)
> bahagi=sample(seq_len(nrow(CustomerChurn)),size=sample_size)
> train=CustomerChurn[bahagi, ]
> test=CustomerChurn[-bahagi, ]

Day 2 - Data Science 106

Construct Logistic Regression
> model = glm (CustomerChurn$Churn ~
CustomerChurn$Age+CustomerChurn$Sex+CustomerChurn$Payment, dat
a = train, family = binomial)
> summary(model)

Day 2 - Data Science 107

Confusion Matrix
> forecast=predict.glm(model,type=c("link","response","terms"))
> table(CustomerChurn$Churn[1:999],forecast>0.5)

FALSE TRUE
0 425 78
1 219 277

ROC Curve
> library(ROCR)
> ROCRpred=prediction(forecast,CustomerChurn$Churn[1:999])
> ROCRperf=performance(ROCRpred,'tpr','fpr')
> plot(ROCRperf,colorize=TRUE,text.adj=c(-0.2,1.7))

Day 2 - Data Science 108

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6129)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (627)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (935)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8215)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2923)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (484)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4972)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2061)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1987)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1993)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2641)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2544)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
Econometrics 1 Cumulative Final Study Guide
No ratings yet
Econometrics 1 Cumulative Final Study Guide
35 pages
Undergraduate Econometric
No ratings yet
Undergraduate Econometric
29 pages
Hayashi Econometrics
50% (2)
Hayashi Econometrics
686 pages
Google Search
No ratings yet
Google Search
2 pages
GGG - Google Search
No ratings yet
GGG - Google Search
2 pages
908 - Google Search
No ratings yet
908 - Google Search
2 pages
3 Reasons Why You Are Seeing 2:22 - The Meaning of 222 ... : People Also Ask
No ratings yet
3 Reasons Why You Are Seeing 2:22 - The Meaning of 222 ... : People Also Ask
3 pages
CCCC - Google Search
No ratings yet
CCCC - Google Search
2 pages
Facebook Video Downloader - Download Facebook Videos Online
No ratings yet
Facebook Video Downloader - Download Facebook Videos Online
2 pages
Google Search
No ratings yet
Google Search
2 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
Buying Decision
No ratings yet
Buying Decision
2 pages
Rownumber Area Age Sex Payment Nodays Status
No ratings yet
Rownumber Area Age Sex Payment Nodays Status
22 pages
DDD - Google Search
No ratings yet
DDD - Google Search
3 pages
Predictive Model - Linear and Logistic Models PDF
100% (1)
Predictive Model - Linear and Logistic Models PDF
108 pages
VBA Programming Workshop - Tic Tac Toe: Mr. Chew Chun Yong, Ms. Choo Ley Ya
No ratings yet
VBA Programming Workshop - Tic Tac Toe: Mr. Chew Chun Yong, Ms. Choo Ley Ya
8 pages
Dwe - Google Search
No ratings yet
Dwe - Google Search
2 pages
Postalcode Hashcode Age Gender Payment Method Rownumber Lasttransaction
No ratings yet
Postalcode Hashcode Age Gender Payment Method Rownumber Lasttransaction
44 pages
Topic 4
No ratings yet
Topic 4
10 pages
Cyclone Excel Calculation - Google Search
100% (1)
Cyclone Excel Calculation - Google Search
2 pages
Rownumber Area Age Sex Payment Status Nodays Status
No ratings yet
Rownumber Area Age Sex Payment Status Nodays Status
22 pages
Introduction and ETL
No ratings yet
Introduction and ETL
125 pages
Topic 1 T
No ratings yet
Topic 1 T
20 pages
MEME19403 Data Analytics and Visualization: Module Three: Visualization With Dashboard
No ratings yet
MEME19403 Data Analytics and Visualization: Module Three: Visualization With Dashboard
51 pages
Platform - Google Form
No ratings yet
Platform - Google Form
4 pages
Descriptive and Predictive Analytics
0% (1)
Descriptive and Predictive Analytics
45 pages
Continuous Reaction
No ratings yet
Continuous Reaction
28 pages
05 Data Loading, Storage and Wrangling-1
No ratings yet
05 Data Loading, Storage and Wrangling-1
22 pages
Api 650
No ratings yet
Api 650
14 pages
Api 650
No ratings yet
Api 650
14 pages
User Guide: Co-Curricular Fee Payment System
No ratings yet
User Guide: Co-Curricular Fee Payment System
11 pages
單元操作PPT- chapter 20 PDF
No ratings yet
單元操作PPT- chapter 20 PDF
52 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (109)
Using Econometrics A Practical Guide 6th Edition Studenmund Solutions Manual pdf download
100% (3)
Using Econometrics A Practical Guide 6th Edition Studenmund Solutions Manual pdf download
43 pages
A Contribution To The Empirics of Economic Growth
No ratings yet
A Contribution To The Empirics of Economic Growth
32 pages
Estimating The Size of The Cypriot Under
No ratings yet
Estimating The Size of The Cypriot Under
20 pages
Lecture10 7012 Logit
No ratings yet
Lecture10 7012 Logit
45 pages
Module 3
No ratings yet
Module 3
20 pages
Impact of HR Practices On Organizational Performance in Bangladesh
No ratings yet
Impact of HR Practices On Organizational Performance in Bangladesh
5 pages
CBE Practice Exam Revised July 2015
No ratings yet
CBE Practice Exam Revised July 2015
8 pages
Where Can Buy Design and Analysis of Experiments 2nd Edition Angela Dean Ebook With Cheap Price
100% (11)
Where Can Buy Design and Analysis of Experiments 2nd Edition Angela Dean Ebook With Cheap Price
62 pages
Shazam Reference Manual 11
No ratings yet
Shazam Reference Manual 11
565 pages
Machine Learning Applications For Building Structural Design and Performance Assessment
No ratings yet
Machine Learning Applications For Building Structural Design and Performance Assessment
41 pages
[Ebooks PDF] download Econometrics by Example 2nd Edition Damodar Gujarati full chapters
100% (5)
[Ebooks PDF] download Econometrics by Example 2nd Edition Damodar Gujarati full chapters
50 pages
Construction Cost Estimation - A Parametric Approach For
No ratings yet
Construction Cost Estimation - A Parametric Approach For
11 pages
Chapter 6-Linear Regression With Multiple Regressors
No ratings yet
Chapter 6-Linear Regression With Multiple Regressors
68 pages
Mengatasi Heteroskedastisitas Pada Regresi Dengan
No ratings yet
Mengatasi Heteroskedastisitas Pada Regresi Dengan
7 pages
Problem in Regression Analysis
No ratings yet
Problem in Regression Analysis
7 pages
Comparing Ordinary Least Square Regression and GWR For Modelling NDVI-Precipitation Relationships Over Crop/Grassland Ecosystem in Northwestern Nigeria
No ratings yet
Comparing Ordinary Least Square Regression and GWR For Modelling NDVI-Precipitation Relationships Over Crop/Grassland Ecosystem in Northwestern Nigeria
7 pages
Is There Social Capital in Cities? Indonesia
No ratings yet
Is There Social Capital in Cities? Indonesia
26 pages
Chapter 4 ECON NOTES
No ratings yet
Chapter 4 ECON NOTES
8 pages
Financial Analysis, Planning and Forecasting Theory and Application
No ratings yet
Financial Analysis, Planning and Forecasting Theory and Application
43 pages
Income and Democracy: Daron Acemoglu, Simon Johnson, James A. Robinson, and Pierre Yared
No ratings yet
Income and Democracy: Daron Acemoglu, Simon Johnson, James A. Robinson, and Pierre Yared
35 pages
A Knowledge Based Approach For Elastomer Cure Kinetic Parameters Estimation
No ratings yet
A Knowledge Based Approach For Elastomer Cure Kinetic Parameters Estimation
6 pages
63 - 2005
100% (2)
63 - 2005
50 pages
61460-65454-1-PB1
No ratings yet
61460-65454-1-PB1
10 pages
Effects On Teachers' Self-Efficacy and Job Satisfaction - Teacher Gender, Years of Experience, and Job Stress
No ratings yet
Effects On Teachers' Self-Efficacy and Job Satisfaction - Teacher Gender, Years of Experience, and Job Stress
16 pages
Futures and Forward Contract
No ratings yet
Futures and Forward Contract
25 pages
SEM 4 - 10 - BA-BSc - HONS - ECONOMICS - CC-10 - INTRODUCTORYECONOMETRI C - 10957
No ratings yet
SEM 4 - 10 - BA-BSc - HONS - ECONOMICS - CC-10 - INTRODUCTORYECONOMETRI C - 10957
3 pages
econometrics notes 2024
100% (1)
econometrics notes 2024
46 pages