Data Science Unit-3
Data Science Unit-3
UNIT – III
13.12.2022
Predictive Modeling
Linear Regression
Simple Linear Regression Model Building
Multiple Linear Regression
Logistic Regression
Correlation
Preliminaries - measures of correlation
o n observations for x and y variables (xi, yi)
̅ ̅
o Sample means x , y
̅
x = Σ xi /n
̅
y = Σ yi /n
o Sample Variances Sxx, Syy
̅ 2
Sxx = 1/n Σ (xi – x )
̅ 2
Syy = 1/n Σ (yi – y )
Sample Covariance Sxy
Page 1 of 1
̅ ̅
o Sxy = 1/n Σ (xi – x ) (yi – y )
Correlation – the strength of association between two variables
Correlation does not imply causation
Visual representation of correlation: Scatter grams
Pearson’s Correlation
n observations for x and y variables (xi, yi)
Pearson’s product-moment correlation coefficient (rxy)
Page 2 of 2
Anscombe’s Data
Example - A very famous data set called the Anscombe’s data set, there are 11
data points, for each of this there are 4 dataset
https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Example – Nonlinear
Page 3 of 3
o x=125 equally spaced values between [0, 2π
o y=cos(x)
o rxy=-0.0536
Example – Nonlinear
o x = 0: 0.5: 20
o y=x2
o rxy=0.967
Example – Nonlinear
o x = 10: 0.5: 10
o y=x2
o rxy=0.0
If there exists a linear relationship between y and x then the Pearson’s correlation
coefficient will be either close to 1 or – 1.
On the other hand if it is close to 0 you cannot dismiss a relationship between y
and x.
Similarly if a value is high looking at just the value we cannot conclude that there
definitely exists a linear relationship between y and x, you can only say there
exists a relationship between y and x.
https://www.socscistatistics.com/tests/pearson/default2.aspx
0 0
0.5 0.25 10.5 110.25
1 1 11 121
1.5 2.25 11.5 132.25
2 4 12 144
2.5 6.25 12.5 156.25
3 9 13 169
3.5 12.25 13.5 182.25
4 16 14 196
4.5 20.25 14.5 210.25
5 25 15 225
5.5 30.25 15.5 240.25
6 36 16 256
6.5 42.25 16.5 272.25
7 49 17 289
7.5 56.25 17.5 306.25
8 64 18 324
8.5 72.25 18.5 342.25
9 81 19 361
9.5 90.25 19.5 380.25
10 100 20 400
Page 4 of 4
The value of R is 0.9668.
Page 5 of 5
Degree of association between two variables
Linear or nonlinear association
X increases, y increases or decreases monotonically
In this case the right hand figure is a non-linear relationship the left hand figure is
indicates a linear relationship.
This can be applied even to ordinal variables; look for degree of association
between 2 variables, the relationship may be either linear or non-linear
If x increases y increases or decreases monotonically then the Spearman’s Rank
Correlation will tend to be very high.
Page 6 of 6
In the Spearman’s rank correlation, convert the data even if it is real value data to
ranks
example – there are 10 data points, the individual values of x, assigned a rank to
it
For example, the lowest value in this case x value is 2 and it is given a rank 1 the
next highest x value is 3 that is given a rank 2 and so on and so forth.
The sixth and the first value both are tied. So, they get the rank 6 and 7 which is
the midway, the half of it.
Given it a rank of 6.5 because there is a tie.
Similarly if there are more than 2 values which are tied take all these ranks and
average them by the number of data points which have equal values
Also rank the corresponding y values
For example, in this case the 10th value has a rank 1 and so on so forth, eighth
value has a rank 2 and so, on.
Compute the difference in the ranks, square it, get the d squared values, sum
over all values and compute this coefficient.
This coefficient, rs takes a value between -1 (indicating a negative association)
and 1 (indicating a positive association between the variables)
In this case the Spearman rank correlation turns out to be 0.88.
rs = 0 means, no association
Monotonically increasing rs = 1
Monotonically decreasing rs = -1
Can be used when association is nonlinear
Can be applied for ordinal variables
Page 7 of 7
The difference is between Pearson’s and Spearman is not only can it be applied
to ordinal variables even if there is a non-linear relationship between y and x the
spearman rank correlation can be high it will not likely to be 0.
So, that can be used to distinguish the kind of relationship between y and x.
Apply it to the Anscombe data set
It is indicating that there is a really strong association between x & y
https://rogeriofvieira.com/wp-content/uploads/2016/04/Correlation-
Interpreta%C3%A7%C3%A3o-bom.pdf
The Spearman's R Correlation Test (also called the Spearman's rank correlation
coefficient) is generally used to look at the (roughly) linear relationship between
two ordinal variables e.g. satisfaction ratings for staff and level of staff training.
Never run this test without viewing a scatterplot and visually examining the basic
shape of the relationship
The test could indicate a low linear correlation and yet the data could have a very
strong and clear non-linear pattern e.g. a U shape.
The other thing to look for is outliers. The correlation coefficient could also be
very high when the relationship is monotonic but not linear.
Classified all of these pairs as either concordant or discordant and there are 6
discordant pairs and 15 concordant pairs
compute the Kendall’s τ coefficient, if this is high then there is broad agreement
between the two experts
Page 9 of 9
y and x are associated with each other also there is a strong association.
If the expert 2 completely disagrees with expert 1, might get even negative values.
This can be used for ordinal variables
To get a preliminary idea before building the model, this can be used.
By computing this correlation coefficient, a preliminary assessment regarding the
type of association that is likely to be can be obtained and then the model to be
built is selected.
Page 10 of 10
17.12.2022
Linear Regression
Regression types
o Classification of regression analysis
o Univariate vs Multivariate
Uinvariate – one dependent and one independent variable
Multivariate – multiple independent and multiple dependent
variables
o Linear vs Nonlinear
Linear – relationship is linear between dependent and independent
variables.
Nonlinear – relationship is Nonlinear between dependent and
independent variables.
o Simple vs Multiple
Simple – One dependent and One independent variable (SISO)
Multiple – One dependent and Many independent variable (MISO)
Regression analysis
o Is there a relationship between these variables?
o Is the relationship linear and how strong is the relationship?
o How accurately can we estimate the relationship?
o How good is the model for prediction purposes?
Regression Methods
o Linear regression methods
1. Simple linear regression
Page 11 of 11
2. Multiple linear regression
3. Ridge regression
4. Principal component regression
5. Lasso
6. Partial least squares
o Nonlinear regression methods
1. Polynomial regression
2. Spline regression
3. Neural networks
Depends on the kind of assumptions and the kind of problems
Regression process (iterative in nature)
Whether the assumptions were made in developing the model are acceptable or
not - residual analysis or residual plots
Bad data points might affect the quality of model
Sensitivity analysis – a small error in the data, how much it affects the response
variable
Data used in building the model is called the training data / training data set
Testing phase of the model - evaluating the fitted model using data which is
called test data
Test data is different from the training data
70 or 80 percent of the experimental data is used for training or fitting the
Page 12 of 12
parameters
20 percent of the experimental data are used to test the model
Not getting a good model - set of variables chosen, the type of experiments
conducted, experimental data – to be changed
Example - data of 14 observations - servicing problem service agents
o Service agent, visit several houses, take a certain amount of time to do a
kind of service - the unit or repair it
o They will report the total amount of time taken in minutes – time spent on
servicing different customers, and the number of units serviced in a day
o 14 such data points of the same person or multiple persons
o How much time they are spending, how many units are repaired every day
- want to judge the performance of the service agent
o To reward or to improve productivity
o Inefficiency
Ordinary least squares
o 14 observations obtained on time taken in minutes for service calls and
number of units repaired
o Objective is to find relationship between these variables (useful for
judging service agent performance)
Page 13 of 13
β0 - the intercept term
β1 - represents the slope
Observations always contain some error and that error is denoted by εi
Unable to explain is denoted as εi, called a modeling error
Ordinary least squares regression - There is no error in reporting of xi, the
dependent variable yi could contain error
The error is not a systematic error it is a random error or modeling error
Independent variable – must be accurate and error free
The units and minutes
– Number of units repaired by a service agent will be reported exactly
o Because agent will have a receipt from each customer and saying that the
unit was serviced
o The total number of receipts that the service agent has gathered precisely
represents the number of unit serviced.
– The amount of time taken could vary because of several reasons;
o One because the agent reported the total time (actually started out on the
day and when returned to the office, end of the day).
o And this could involve not just the service time, but also travel time and
depending on the location, the travel time could vary
o It could vary from time of day, depending on the traffic;
o It could also vary because of congestion or a particular even that has
happened.
Minutes as only an approximation - choose units as the x independent variable
(this is precise and no error), y where minutes as the dependent variable.
When applying ordinary least squares, ensure that the independent variable is an
extremely accurate measurement, Whereas, y could contain other factors or
errors
β0 represents the value of y when x is 0
β1 represents the slope of this regression line
The slope and the intercept will be different depending on what values were
proposed for β0 and β1
Find out how much deviation is there between the observed value and the line
Page 14 of 14
Observed value is yi corresponding to this xi, which is 8, the vertical distance is
called the estimated error
compute ei for every data point yi using the proposed parameters β0 and β1 and
the value of the independent variables, for all the observations
Minimize the sum squared error
Plug in value of xi in the estimated model which is using the estimated parameter
β̂ 0 and β̂ 1 and call this prediction ŷ 1 it is also an estimated quantity for
any given xi estimate the corresponding yi using the model.
coefficient of determination r squared: It is just 1 - difference between the
observed value and the predicted value squared difference summed over all data
points, divided by the variance of y (yi - y̅ )2.
values close to 0 indicates a poor t, values close to one indicates a good t
look at other measures before conclude, adjusted r squared
https://www.statisticshowto.com/probability-and-statistics/t-test/
https://youtu.be/fKZA5waOJ0U (for t-test)
Page 15 of 15
The R command for fitting a linear model is just called lm, indicated the minutes
as the dependent variable, and units as the dependent variable - part of the data
set
Loading of the data set
lm is the one that used to build the model
First will get the range of residuals the estimated value of εi for all the data
points (14)
(The max value min value the first quartile third quartile in the median are given)
So, from this it is concluded that, maybe a linear model is explains their
Page 16 of 16
relationship between x and y very precisely
19.12.2022
Simple Linear Regression Modeling
Loading data
Loading data – dataset ‘bonds’ is given in “.txt” format
To load data from the file the function used is read.delim()
read.delim() – reads a file in table format and creates a data frame from it
Syntax: read.delim(file, row.names=1)
o File – The name of the file which the data are to be read from.
Each row of the table appears as one line of the file.
o row.names – A vector of row names
This can be a vector giving the actual row names, or a single
number giving the column of the table which contains the row
names, or character string giving the name of the table column
containing the row names
Loading data – assuming that bonds.txt is in the current working directory
o bonds<-read.delim(“bonds.txt”, row.names=1)
o The data is saved into a data frame ‘bonds’
Page 17 of 17
Viewing data
Viewing data – View of bonds will display the data in a tabular format
View(bonds) will display the data frame in a tabular format
Description of dataset
The data has 2 variables – ‘Coupon Rate’ & ‘Bid Price’
‘Coupon Rate’ refers to the fixed interest rate that the issuer pays to the lender
‘Bid Price’ is the price someone is willing to pay for the bond
Page 18 of 18
plot(bonds$CouponRate, bonds$BidPrice,
main = “Bid Price vs Coupon Rate”,
xlab = “Coupon Rate”,
ylab = “Bid Price”)
plot(bonds$CouponRate, bonds$BidPrice,
main = “Bid Price vs Coupon Rate”,
xlab = “Coupon Rate”,
ylab = “Bid Price”)
abline(bondsmod)
A function
called ab line() and the input for the function is ‘bondsmod’ which is the linear
model
ab here refers to the intercept and slope
If the equation is of the form y = a + bx, then ‘a’ is the intercept and ‘b’ is the slope.
a is β̂ 0 and b is β̂ 1
regression lines fits badly and it is not identifying the outliers (regression lines
are affected by outliers)
Page 20 of 20
Model summary
bondsmod <- lm(BidPrice~CouponRate, data = bonds)
summary(bondsmod)
4 sections of output
o Call
o Residual
o Coefficients
o Heuristics
Page 21 of 21
BidPrice - dependent variable
CouponRate - independent variable
Residuals are nothing, but difference between the observed and predicted values
εi corresponds to residuals
Standard Error is the estimated standard deviation associated for the slope and
intercept
t value - It is the ratio of estimate by the standard error and it is also an important
criterion for the hypothesis testing.
null hypothesis = estimates will be = 0.
F statistic is again used to test the null hypothesis which is nothing, but slope = 0.
27.12.2022
Multiple Linear Regression
Multiple linear regressions which consists of one dependent variable, but several
Page 22 of 22
independent variables.
Dependent variable which is denoted by y and several independent variables
which are denoted by the symbols xj, where j = 1 to p.
There are p independent variables which we believe affect the dependent variable.
Dependent variable (y) depends on p independent variables xj, j = 1, 2,…., p
General linear model = y=ß0+ ß1x1+ ß2x2+……+ ßpxp + ε
For ith observation, yi=ß0+ ß1x1,i+ ß2x2,i+……+ ßpxp,i + εi
Objective – using ‘n’ observations, estimate regression coefficients
ß0 – intercept, β1, β2, βp represents the slope parameters or the effect of the
individual independent variables on the dependent variable.
error is due to error in the dependent variable measurement
In ordinary least squares (OLS), the assumption is that the independent variables
measurements are perfectly measured and don’t have any error
Whereas, the dependent variable may contain some error and that error is
indicated as ε.
Assume that there is a small n number of samples are obtained.
Aim is to find the values, best estimates, of β0 β1 β2 up to βp using these n sample
measurements of x’s corresponding y.
This is what we call multiple linear regression because we are fitting a linear
model and there are many independent variables and therefore, it is called as
multiple linear regression problem.
Approach similar to simple regression
o Minimize the sum of squares of the errors
Vector and matrix notations
Page 23 of 23
T T
o S(ß) = ε ε = (y – Xß) (y – Xß)
In order to find the best estimates of the parameters β0 to βp, set up the
minimization of the sum squared of errors using vectors and matrices
Define the vector y, which consists of all the n measurements of the dependent
variable y1 to yn
Subtract the mean value of all these measurements from each of the
observations
The first one represents the first sample value of the dependent variable y1 – the
mean value of y over all the measurements, y̅ .
So, the first sample is mean shifted value of the first observation
The 2nd coefficient or 2nd value in this vector is the second sample value – the
mean value of the dependent variable and so on for all the n observations
So, these are the mean shifted values of all the n samples for the dependent
variable
Similarly construct a matrix x where the first column corresponds to variable,
independent variable 1
Take the sample value of the first independent variable and subtract the mean
value of the first independent variable
Repeat this for all n measurements of the first independent variable and for all p
independent variables
Matrix x will be of size n x p (n – no. of rows, p – no. of columns)
Each row represents a sample, each column represents a variable
st st
1 column represents the 1 independent variable
th
Last column represent the p independent variable
represent all the coefficients β except β0 in a vector form β1 to βp as a row vector
th
β1 is the first coefficient, βp is the coefficient corresponding p variable
β vector which is a p x 1 vector
ε, the noise vector, as ε1 to εn corresponding to all the n observations
Write our linear model in the form y = x β + ε.
β0 is not included, linear model only involves the slope parameters β1 to βp, does
not involve the β0 parameter because that has been effectively removed from the
linear model using this mean subtraction idea
Write the linear model compactly as y = x β + ε
Assumptions about the error is 0 mean vector
So, ε expected value ε0 implies ε is a random vector with 0 mean and the
variance, covariance matrix of ε is assumed to be σ2 identity.
σ2 identity in this form it means all the epsilons, ε1 to εn, all have the same
variance σ2, homoscedastic assumption (an assumption of equal or similar
variances in different groups being compared)
εi and εj are uncorrelated if i is not equal to j, in which case write the covariance
matrix of ε as σ2i.
Under this assumption find the estimates of β so as to minimize the sum square
of the errors.
SSE:
Page 24 of 24
T T
o S(ß) = ε ε = (y – Xß) (y – Xß)
Page 25 of 25
Sum of Squares Total, Sum of Squares Regression and Sum of Squares Error
https://365datascience.com/tutorials/statistics-tutorials/sum-squares/
Page 26 of 26
differences between the predicted value and the mean of the dependent variable.
We usually want to minimize the error. The smaller the error, the better the estimation
power of the regression. It is also known as RSS or residual sum of squares. Residual
as in: remaining or unexplained.
Page 27 of 27
The rationale is the following: the total variability of the data set is equal to the
variability explained by the regression line plus the unexplained variability, known as
error.
Given a constant total variability, a lower error will cause a better regression. Conversely,
a higher error will cause a less powerful regression.
28.12.2022
Fitted model is adequate or can be reduced further?
o Test significance of individual coefficient ß hat
o A general unified test on the full model (FM) vs the reduced model (RM)
Hypothesis testing
o H0: Reduced Model is Adequate
Page 28 of 28
o H1: Full Model is Adequate
Check R squared, if the values are close to 1 maybe linear model is good but that is
not a confirmatory test, do the residual plot
In the univariate case there is only one independent variable, but here there are
several independent variables.
Maybe not all independent variables have an effect on y. Some of the independent
variables may be irrelevant.
One way of trying to find whether a particular independent variable has an effect is
to test the corresponding coefficient.
the confidence interval contains 0, in which case we can say the corresponding
independent variable does not have a significant effect on the dependent variable
and drop it
F test - whether the full model is better than the reduced model
The reduced model contains no independent variables whereas, the full model can
contain all or some of the independent variables.
Testing two models: RM with k parameters
F- Statistic
Page 29 of 29
Page 30 of 30
Example:
Page 31 of 31
independent variables.
Food vs décor – no correlation
Food vs service – strong correlation
Page 32 of 32
Apply the R function lm to this data set and examine the output.
Above image shows the output from R and it tells that the intercept term is - 24.02
and the slope parameters, the coefficient multiplying food is 1.5, the coefficient
multiplying decor is 1.9 and so on so forth.
Page 33 of 33
o Normality of the errors
o Homoscedastic vs heteroscedastic errors
The standardized residuals plotted against the predicted price value or the fitted
value
ŷ i has only one variable So, generate only one plot, (which is in red lines) the
confidence interval for the standardized residuals and anything above this outside of
this interval indicates outliers.
56, 48, 30 & 109…..may be possible outliers and there is no pattern in the
standardized residuals
29.12.2022
Logistic regression
Introduction
Logistic regression is a classification technique
Decision boundary (generally linear) derived based on probability interpretation
o Results in a nonlinear optimization problem for parameter estimation
Goal: Given a new data point, predict the class from which the data points is likely to
have originated
Input features
Page 34 of 34
Input features can be both qualitative and quantitative
If the inputs are qualitative, then there has to be a systematic way of converting
them to quantities
For example – a binary input like a “Yes” or “No” can be encoded as”1” or “0”
Some data analytics approach can handle qualitative variables directly
Data is expressed in many attributes X1 to Xn (also called as input features)
And these input features could be quantitative, or qualitative
Quantitative features can be used as they are
While using a quantitative technique, and input features which are qualitative, then
we should have some way of converting this qualitative features into quantitative
values.
Example – data point [yes, 0.1, 0.3], and another data point [no, 0.05,-2] etc.
There are some data analytics approaches that can directly handle these qualitative
features without a need to convert them into numbers and so on.
Decision function is linear
Binary classification can be performed depending on the side of the half-plane that
the data falls in
Guessing “yes” or “no” is pretty crude
Can we do better using probabilities?
Assume that all the circular data belong to one category and all the starred data
belong to another category.
Hyper plane – one side of this hyper plane is half space, the other side is a half
space
Positive value and a negative value to each side of the hyper plane
Data is inherently noisy - closer and closer to the line, there is an uncertainty about,
whether it belongs to this class 0, or class 1
Page 35 of 35
A data point, which is true value here, because of noise it could slip to the other side
and so on.
The probability, or the confidence with which deciding the particular class intuitively
come down
Logistic regression addresses this
Sigmoid function
Make p(x) a sigmoid function of x
Estimation of parameters
We find a parameter in such a way that plugging these in the model equation should
give the best possible classification for the inputs from both the classes
This can be formalized by maximizing the following likelihood function
Page 37 of 37
Log-likelihood Function
Simplifying this expression and using the definition for p(x) will result in an
expression with the parameters of the linear decision boundary
Now the parameters can be estimated by maximizing the above expression using
any nonlinear optimizer solver.
While solving, will get hyper plane
Two dimensional problem, will have 3 parameters
If there is a n dimensional problem, n + 1 decision variables will be identified through
this optimization solution.
And for any new data point, once we put that data point into the p(x) function that
sigmoidal function, then will get the probability that it belongs to class 0 or class 1.
This is the basic idea of logistic regression.
Page 38 of 38
5 2 10 4 10 1
Class 0 Class 1 Test Data
We have data for class 0 and data for class 1 and then clearly this is a 2 dimensional
problem. So, the hyper plane is going to be a line.
So, a line will separate this, these kinds of classification problems are called as
supervised classification problem, because all of this data is labeled
Given new data which is called the test data and then the question is what class
does this test data belong to? It is either class 0 or class 1?
Fraud detection – lots of records of fraudulent credit card use and all of those
instances can be described by certain attribute.
The time of the day whether the credit card was done at, the place where the person
lives, credit card transfer or credit card use and many other attributes
There are lots of records for normal use of credit card and some records for
fraudulent use of credit card
A classifier could identify the new transaction that is being initiated, what likelihood
it is of this transaction being fraudulent
Task – fill the last column with 0 & 1 (data belongs to class 0 or class 1)
Plotted the same data that was shown in the last table
Logistic regression is used to solve the problem
Page 39 of 39
Results
Input Features : X1, X2
Classes : 0, 1
Parameters : ß0 = -42.5487
ß11 = 2.95009
ß12 = 10.4012
Test results
X1 X2 Probability Clas
s
1 3 0.0002 0
2 3 0.004 0
4 4 0.999 1
5 4 0.999 1
3 3 0..076 0
6 2 0.0172 0
9 2 0.991 1
8 1 0.0002 0
7 2 0.251 0
10 1 0.0667 0
Regularization
General Objective
Page 41 of 41
Minimize the log likelihood
Now the objective is
( ) ß0
(ß0 ß11 ß12)T ß11 = ß02+ß112+ß122
ß12
Larger the value of λ, more is the regularization strength
Regularization helps the model work better on test data due to the fact that
overfitting is minimized on training data
*****
Page 42 of 42