0% found this document useful (0 votes)
4 views

Data Science Unit-3

This document covers predictive modeling techniques in data science, focusing on various regression methods including linear, multiple, and logistic regression. It discusses correlation measures such as Pearson's, Spearman's, and Kendall's rank correlation, emphasizing their applications and differences. Additionally, it highlights the importance of model assessment, validation, and the iterative nature of regression analysis.

Uploaded by

yejem28478
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Science Unit-3

This document covers predictive modeling techniques in data science, focusing on various regression methods including linear, multiple, and logistic regression. It discusses correlation measures such as Pearson's, Spearman's, and Kendall's rank correlation, emphasizing their applications and differences. Additionally, it highlights the importance of model assessment, validation, and the iterative nature of regression analysis.

Uploaded by

yejem28478
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

PE 515 CS – DATA SCIENCE

UNIT – III
13.12.2022
 Predictive Modeling
 Linear Regression
 Simple Linear Regression Model Building
 Multiple Linear Regression
 Logistic Regression

Model building - building linear models using regression techniques


 Correlation
 Pearson’s correlation
 Kendall Rank Correlation
 Spearman Rank Correlation
 Regression
o Types of regression
o Fitting a function – Criterion for best fit
o Least squares
 Simple Regression
 Multiple Regressions
 Model assessment and Validation
 Different types of correlation coefficients that are been defined
 Linear regression – basic notions of regression,
 The case of 2 variables - multi linear regression - where the several input
variables and one dependent output variable
 After building the model, assess how well the model performs, how to validate
some of the assumptions made - model assessment and validation

Correlation
 Preliminaries - measures of correlation
o n observations for x and y variables (xi, yi)
̅ ̅
o Sample means x , y
̅
 x = Σ xi /n
̅
 y = Σ yi /n
o Sample Variances Sxx, Syy
̅ 2
 Sxx = 1/n Σ (xi – x )
̅ 2
 Syy = 1/n Σ (yi – y )
 Sample Covariance Sxy

Page 1 of 1
̅ ̅
o Sxy = 1/n Σ (xi – x ) (yi – y )
 Correlation – the strength of association between two variables
 Correlation does not imply causation
 Visual representation of correlation: Scatter grams

Pearson’s Correlation
 n observations for x and y variables (xi, yi)
 Pearson’s product-moment correlation coefficient (rxy)

 rxy takes a value between -1 (negative correlation) & 1 (positive correlation)


 rxy = 0 (means no correlation)
 Normalization – division by the denominator

 A measure for the degree of linear dependence between x and y


 Cannot be applied to ordinal variables (i.e. ranked variable – variable with scale
indicated)
 Sample size: Moderate (20 – 30) for good estimate
 Robustness: Outliers can lead to misleading values

What is Anscombe dataset?


Anscombe's quartet comprises four data sets that have nearly identical simple
descriptive statistics, yet have very different distributions and appear very different
when graphed. Each dataset consists of eleven (x,y) points.

Page 2 of 2
Anscombe’s Data

 Example - A very famous data set called the Anscombe’s data set, there are 11
data points, for each of this there are 4 dataset

https://en.wikipedia.org/wiki/Anscombe%27s_quartet

 Example – Nonlinear

Page 3 of 3
o x=125 equally spaced values between [0, 2π
o y=cos(x)
o rxy=-0.0536
 Example – Nonlinear
o x = 0: 0.5: 20
o y=x2
o rxy=0.967
 Example – Nonlinear
o x = 10: 0.5: 10
o y=x2
o rxy=0.0
 If there exists a linear relationship between y and x then the Pearson’s correlation
coefficient will be either close to 1 or – 1.
 On the other hand if it is close to 0 you cannot dismiss a relationship between y
and x.
 Similarly if a value is high looking at just the value we cannot conclude that there
definitely exists a linear relationship between y and x, you can only say there
exists a relationship between y and x.

https://www.socscistatistics.com/tests/pearson/default2.aspx

0 0
0.5 0.25 10.5 110.25
1 1 11 121
1.5 2.25 11.5 132.25
2 4 12 144
2.5 6.25 12.5 156.25
3 9 13 169
3.5 12.25 13.5 182.25
4 16 14 196
4.5 20.25 14.5 210.25
5 25 15 225
5.5 30.25 15.5 240.25
6 36 16 256
6.5 42.25 16.5 272.25
7 49 17 289
7.5 56.25 17.5 306.25
8 64 18 324
8.5 72.25 18.5 342.25
9 81 19 361
9.5 90.25 19.5 380.25
10 100 20 400

Page 4 of 4
 The value of R is 0.9668.

 This is a strong positive


correlation, which means that
high X variable scores go with
high Y variable scores (and vice
versa).

 The value of R2, the


coefficient of determination, is
0.9347.

Spearman rank correlation

Page 5 of 5
 Degree of association between two variables
 Linear or nonlinear association
 X increases, y increases or decreases monotonically

 In this case the right hand figure is a non-linear relationship the left hand figure is
indicates a linear relationship.
 This can be applied even to ordinal variables; look for degree of association
between 2 variables, the relationship may be either linear or non-linear
 If x increases y increases or decreases monotonically then the Spearman’s Rank
Correlation will tend to be very high.

Page 6 of 6
 In the Spearman’s rank correlation, convert the data even if it is real value data to
ranks
 example – there are 10 data points, the individual values of x, assigned a rank to
it
 For example, the lowest value in this case x value is 2 and it is given a rank 1 the
next highest x value is 3 that is given a rank 2 and so on and so forth.
 The sixth and the first value both are tied. So, they get the rank 6 and 7 which is
the midway, the half of it.
 Given it a rank of 6.5 because there is a tie.
 Similarly if there are more than 2 values which are tied take all these ranks and
average them by the number of data points which have equal values
 Also rank the corresponding y values
 For example, in this case the 10th value has a rank 1 and so on so forth, eighth
value has a rank 2 and so, on.
 Compute the difference in the ranks, square it, get the d squared values, sum
over all values and compute this coefficient.
 This coefficient, rs takes a value between -1 (indicating a negative association)
and 1 (indicating a positive association between the variables)
 In this case the Spearman rank correlation turns out to be 0.88.
 rs = 0 means, no association
 Monotonically increasing rs = 1
 Monotonically decreasing rs = -1
 Can be used when association is nonlinear
 Can be applied for ordinal variables

Page 7 of 7
 The difference is between Pearson’s and Spearman is not only can it be applied
to ordinal variables even if there is a non-linear relationship between y and x the
spearman rank correlation can be high it will not likely to be 0.
 So, that can be used to distinguish the kind of relationship between y and x.
 Apply it to the Anscombe data set
 It is indicating that there is a really strong association between x & y
 https://rogeriofvieira.com/wp-content/uploads/2016/04/Correlation-
Interpreta%C3%A7%C3%A3o-bom.pdf

Correlation – Spearman's R Test

 The Spearman's R Correlation Test (also called the Spearman's rank correlation
coefficient) is generally used to look at the (roughly) linear relationship between
two ordinal variables e.g. satisfaction ratings for staff and level of staff training.
 Never run this test without viewing a scatterplot and visually examining the basic
shape of the relationship
 The test could indicate a low linear correlation and yet the data could have a very
strong and clear non-linear pattern e.g. a U shape.
 The other thing to look for is outliers. The correlation coefficient could also be
very high when the relationship is monotonic but not linear.

 Looking at the patterns,


o Very strong linear relationship (A)
o Less strong linear relationship (B), and
o Weaker linear relationship (C)
o (D) is barely there
Page 8 of 8
o (E) is a very strong non-linear relationship, and
o (F) is an otherwise weak relationship with an important outlier

Kendall rank correlation coefficient


 Correlation coefficient to measure association between two ordinal variables
 Concordant pair: A Pair of observations (x1,y1) and (x2,y2) that follows the property
x1>x2 and y1>y2 or x1<x2, and y1<y2
 Discordant pair: a pair of observations (x1,y1) and (x2,y2) that follows property
x1>x2 and y1<y2, or x1<x2 and y1>y2
 Third type of correlation coefficient that is used for ordinal variables is called the
Kendall’s rank correlation and this correlation coefficient also measures the
association between ordinal variable.
 Concordant and a discordant pair is required
 When x increases y increases or x decreases or y decreases then we say these 2
data pairs are concordant
 Kendall rank correlation coefficient

 The pair of which x1 = x2 and y1 = y2 are not classified as concordant or


discordant and are ignored.
 So, once we have the number of concordant pairs and number of discordant
pairs we take the difference between them divide by n into n - 1 by 2 and that is
called the Kendall’s τ.

 Classified all of these pairs as either concordant or discordant and there are 6
discordant pairs and 15 concordant pairs
 compute the Kendall’s τ coefficient, if this is high then there is broad agreement
between the two experts
Page 9 of 9
 y and x are associated with each other also there is a strong association.
 If the expert 2 completely disagrees with expert 1, might get even negative values.
 This can be used for ordinal variables

 Kendall’s rank to Anscombe data set

 To get a preliminary idea before building the model, this can be used.
 By computing this correlation coefficient, a preliminary assessment regarding the
type of association that is likely to be can be obtained and then the model to be
built is selected.

Page 10 of 10
17.12.2022
Linear Regression

 Popular technique for analyzing data and building models


 Purpose is to build a functional relationship (model) between dependent
variables and independent variables
 Example
 – Business: what is the effect of price on sales? (can be used to fix the selling
price of an item)
 – Engineering: can we infer difficult to measure properties of a product from
other easily measured variables? (Mechanical strength of a polymer from
temperature, viscosity or other process variables) – Also known as a soft sensor
 Basics – One of the widely used statistical techniques
 Dependent variables also known as response variable, regressand, predicted
variable, output variable – denoted as variable y.
 Independent variable also known as predictor variable, regressor, exploratory
variable, input variable – denoted as variable x.

 Regression types
o Classification of regression analysis
o Univariate vs Multivariate
 Uinvariate – one dependent and one independent variable
 Multivariate – multiple independent and multiple dependent
variables
o Linear vs Nonlinear
 Linear – relationship is linear between dependent and independent
variables.
 Nonlinear – relationship is Nonlinear between dependent and
independent variables.
o Simple vs Multiple
 Simple – One dependent and One independent variable (SISO)
 Multiple – One dependent and Many independent variable (MISO)

 Regression analysis
o Is there a relationship between these variables?
o Is the relationship linear and how strong is the relationship?
o How accurately can we estimate the relationship?
o How good is the model for prediction purposes?

 Regression Methods
o Linear regression methods
1. Simple linear regression
Page 11 of 11
2. Multiple linear regression
3. Ridge regression
4. Principal component regression
5. Lasso
6. Partial least squares
o Nonlinear regression methods
1. Polynomial regression
2. Spline regression
3. Neural networks
 Depends on the kind of assumptions and the kind of problems
 Regression process (iterative in nature)

 Whether the assumptions were made in developing the model are acceptable or
not - residual analysis or residual plots
 Bad data points might affect the quality of model
 Sensitivity analysis – a small error in the data, how much it affects the response
variable
 Data used in building the model is called the training data / training data set
 Testing phase of the model - evaluating the fitted model using data which is
called test data
 Test data is different from the training data
 70 or 80 percent of the experimental data is used for training or fitting the

Page 12 of 12
parameters
 20 percent of the experimental data are used to test the model
 Not getting a good model - set of variables chosen, the type of experiments
conducted, experimental data – to be changed
 Example - data of 14 observations - servicing problem service agents
o Service agent, visit several houses, take a certain amount of time to do a
kind of service - the unit or repair it
o They will report the total amount of time taken in minutes – time spent on
servicing different customers, and the number of units serviced in a day
o 14 such data points of the same person or multiple persons
o How much time they are spending, how many units are repaired every day
- want to judge the performance of the service agent
o To reward or to improve productivity
o Inefficiency
 Ordinary least squares
o 14 observations obtained on time taken in minutes for service calls and
number of units repaired
o Objective is to find relationship between these variables (useful for
judging service agent performance)

Page 13 of 13
 β0 - the intercept term
 β1 - represents the slope
 Observations always contain some error and that error is denoted by εi
 Unable to explain is denoted as εi, called a modeling error
 Ordinary least squares regression - There is no error in reporting of xi, the
dependent variable yi could contain error
 The error is not a systematic error it is a random error or modeling error
 Independent variable – must be accurate and error free
 The units and minutes
 – Number of units repaired by a service agent will be reported exactly
o Because agent will have a receipt from each customer and saying that the
unit was serviced
o The total number of receipts that the service agent has gathered precisely
represents the number of unit serviced.
 – The amount of time taken could vary because of several reasons;
o One because the agent reported the total time (actually started out on the
day and when returned to the office, end of the day).
o And this could involve not just the service time, but also travel time and
depending on the location, the travel time could vary
o It could vary from time of day, depending on the traffic;
o It could also vary because of congestion or a particular even that has
happened.
 Minutes as only an approximation - choose units as the x independent variable
(this is precise and no error), y where minutes as the dependent variable.
 When applying ordinary least squares, ensure that the independent variable is an
extremely accurate measurement, Whereas, y could contain other factors or
errors
 β0 represents the value of y when x is 0
 β1 represents the slope of this regression line
 The slope and the intercept will be different depending on what values were
proposed for β0 and β1
 Find out how much deviation is there between the observed value and the line
Page 14 of 14
 Observed value is yi corresponding to this xi, which is 8, the vertical distance is
called the estimated error
 compute ei for every data point yi using the proposed parameters β0 and β1 and
the value of the independent variables, for all the observations
 Minimize the sum squared error

Ordinary least squares (OLS): Testing Goodness of Fit

 Plug in value of xi in the estimated model which is using the estimated parameter
β̂ 0 and β̂ 1 and call this prediction ŷ 1 it is also an estimated quantity for
any given xi estimate the corresponding yi using the model.
 coefficient of determination r squared: It is just 1 - difference between the
observed value and the predicted value squared difference summed over all data
points, divided by the variance of y (yi - y̅ )2.
 values close to 0 indicates a poor t, values close to one indicates a good t
 look at other measures before conclude, adjusted r squared
 https://www.statisticshowto.com/probability-and-statistics/t-test/
 https://youtu.be/fKZA5waOJ0U (for t-test)

Page 15 of 15

 OLS – example using R

 The R command for fitting a linear model is just called lm, indicated the minutes
as the dependent variable, and units as the dependent variable - part of the data
set
 Loading of the data set
 lm is the one that used to build the model
 First will get the range of residuals the estimated value of εi for all the data
points (14)
 (The max value min value the first quartile third quartile in the median are given)
 So, from this it is concluded that, maybe a linear model is explains their

Page 16 of 16
relationship between x and y very precisely

19.12.2022
Simple Linear Regression Modeling

 Simple linear regression


o Loading the data from .txt file
o Plot the data
o Build linear model
o Look at summary of the model

Loading data
 Loading data – dataset ‘bonds’ is given in “.txt” format
 To load data from the file the function used is read.delim()
 read.delim() – reads a file in table format and creates a data frame from it
 Syntax: read.delim(file, row.names=1)
o File – The name of the file which the data are to be read from.
 Each row of the table appears as one line of the file.
o row.names – A vector of row names
 This can be a vector giving the actual row names, or a single
number giving the column of the table which contains the row
names, or character string giving the name of the table column
containing the row names
 Loading data – assuming that bonds.txt is in the current working directory
o bonds<-read.delim(“bonds.txt”, row.names=1)
o The data is saved into a data frame ‘bonds’

Page 17 of 17
Viewing data
 Viewing data – View of bonds will display the data in a tabular format
 View(bonds) will display the data frame in a tabular format

 head(bonds) and tail(bonds) will display


the first and last six rows from the data frame

Description of dataset
 The data has 2 variables – ‘Coupon Rate’ & ‘Bid Price’
 ‘Coupon Rate’ refers to the fixed interest rate that the issuer pays to the lender
 ‘Bid Price’ is the price someone is willing to pay for the bond

Structure of the data


 Each variable and its data type
 str() – input is data frame
 see whether each of the variable data types are same as you expect them to be
 if not coerce

Summary of the data


 Gives mean and five number summary

Plotting the data

Page 18 of 18
plot(bonds$CouponRate, bonds$BidPrice,
main = “Bid Price vs Coupon Rate”,
xlab = “Coupon Rate”,
ylab = “Bid Price”)

 Input to the plot


function are basically x and y
 X refers to coupon rate and y refers to bid price
 In order to access the variables, give the name of the data frame followed by a
dollar symbol
 Inside the parameter main specify the title of plot, xlab is nothing, but x label,
assigning it as x “Coupon Rate", and y label, assigning it as “Bid Price".
 Now, we see a linear trend
 There are some points which are completely outside the range of coupon rate

Building linear regression model


 Building linear model using the function lm()
 Syntax: lm(formula, data)
lm(dependent var ~ independent var)
bondsmod <- lm(bonds$BidPrice~bonds$CouponRate)
or
bondsmod <- lm(BidPrice~CouponRate, data=bonds)

 Linear model is built, saved it as an


object ‘bondsmod’
 Fitting the regression line over the plot
Page 19 of 19

plot(bonds$CouponRate, bonds$BidPrice,
main = “Bid Price vs Coupon Rate”,
xlab = “Coupon Rate”,
ylab = “Bid Price”)
abline(bondsmod)

 A function
called ab line() and the input for the function is ‘bondsmod’ which is the linear
model
 ab here refers to the intercept and slope
 If the equation is of the form y = a + bx, then ‘a’ is the intercept and ‘b’ is the slope.
 a is β̂ 0 and b is β̂ 1
 regression lines fits badly and it is not identifying the outliers (regression lines
are affected by outliers)

Page 20 of 20
Model summary
bondsmod <- lm(BidPrice~CouponRate, data = bonds)
summary(bondsmod)

 4 sections of output
o Call
o Residual
o Coefficients
o Heuristics

 Formula BidPrice versus CouponRate

Page 21 of 21
 BidPrice - dependent variable
 CouponRate - independent variable
 Residuals are nothing, but difference between the observed and predicted values
 εi corresponds to residuals
 Standard Error is the estimated standard deviation associated for the slope and
intercept

 t value - It is the ratio of estimate by the standard error and it is also an important
criterion for the hypothesis testing.
 null hypothesis = estimates will be = 0.
 F statistic is again used to test the null hypothesis which is nothing, but slope = 0.

27.12.2022
Multiple Linear Regression

 Multiple linear regressions which consists of one dependent variable, but several

Page 22 of 22
independent variables.
 Dependent variable which is denoted by y and several independent variables
which are denoted by the symbols xj, where j = 1 to p.
 There are p independent variables which we believe affect the dependent variable.
 Dependent variable (y) depends on p independent variables xj, j = 1, 2,…., p
 General linear model = y=ß0+ ß1x1+ ß2x2+……+ ßpxp + ε
 For ith observation, yi=ß0+ ß1x1,i+ ß2x2,i+……+ ßpxp,i + εi
 Objective – using ‘n’ observations, estimate regression coefficients
 ß0 – intercept, β1, β2, βp represents the slope parameters or the effect of the
individual independent variables on the dependent variable.
 error is due to error in the dependent variable measurement
 In ordinary least squares (OLS), the assumption is that the independent variables
measurements are perfectly measured and don’t have any error
 Whereas, the dependent variable may contain some error and that error is
indicated as ε.
 Assume that there is a small n number of samples are obtained.
 Aim is to find the values, best estimates, of β0 β1 β2 up to βp using these n sample
measurements of x’s corresponding y.
 This is what we call multiple linear regression because we are fitting a linear
model and there are many independent variables and therefore, it is called as
multiple linear regression problem.
 Approach similar to simple regression
o Minimize the sum of squares of the errors
 Vector and matrix notations

 The linear model in matrix form


o y=Xß+ε, E(ε) = 0, Var(ε) = σ2I
 SSE

Page 23 of 23
T T
o S(ß) = ε ε = (y – Xß) (y – Xß)
 In order to find the best estimates of the parameters β0 to βp, set up the
minimization of the sum squared of errors using vectors and matrices
 Define the vector y, which consists of all the n measurements of the dependent
variable y1 to yn
 Subtract the mean value of all these measurements from each of the
observations
 The first one represents the first sample value of the dependent variable y1 – the
mean value of y over all the measurements, y̅ .
 So, the first sample is mean shifted value of the first observation
 The 2nd coefficient or 2nd value in this vector is the second sample value – the
mean value of the dependent variable and so on for all the n observations
 So, these are the mean shifted values of all the n samples for the dependent
variable
 Similarly construct a matrix x where the first column corresponds to variable,
independent variable 1
 Take the sample value of the first independent variable and subtract the mean
value of the first independent variable
 Repeat this for all n measurements of the first independent variable and for all p
independent variables
 Matrix x will be of size n x p (n – no. of rows, p – no. of columns)
 Each row represents a sample, each column represents a variable
st st
 1 column represents the 1 independent variable
th
 Last column represent the p independent variable
 represent all the coefficients β except β0 in a vector form β1 to βp as a row vector
th
 β1 is the first coefficient, βp is the coefficient corresponding p variable
 β vector which is a p x 1 vector
 ε, the noise vector, as ε1 to εn corresponding to all the n observations
 Write our linear model in the form y = x β + ε.
 β0 is not included, linear model only involves the slope parameters β1 to βp, does
not involve the β0 parameter because that has been effectively removed from the
linear model using this mean subtraction idea
 Write the linear model compactly as y = x β + ε
 Assumptions about the error is 0 mean vector
 So, ε expected value ε0 implies ε is a random vector with 0 mean and the
variance, covariance matrix of ε is assumed to be σ2 identity.
 σ2 identity in this form it means all the epsilons, ε1 to εn, all have the same
variance σ2, homoscedastic assumption (an assumption of equal or similar
variances in different groups being compared)
 εi and εj are uncorrelated if i is not equal to j, in which case write the covariance
matrix of ε as σ2i.
 Under this assumption find the estimates of β so as to minimize the sum square
of the errors.
 SSE:
Page 24 of 24
T T
o S(ß) = ε ε = (y – Xß) (y – Xß)

 Assume that X transpose X which is a square matrix is invertible it is a full rank


matrix and then easily find the solution for β̂
-1 T T
 Solve this linear set of equations by taking a b which is exactly X X inverse X Y
 So, β̂ the coefficient vector can be found
 And this is the solution that minimizes the sum squared errors
 After estimating σ hat, the variance of the error used from the data, construct
confidence intervals for each slope parameter
 The true slope parameter lies in this confidence interval
 For any confidence interval choose 1 - α, α represents a level of significance.
 If α = 0.05, 1 - α would represents 0.95, a 95 percent confidence interval
 The square root of the diagonal element which represents standard deviation of
the estimated value of β which is used in order to construct this confidence
interval.

Page 25 of 25
Sum of Squares Total, Sum of Squares Regression and Sum of Squares Error
https://365datascience.com/tutorials/statistics-tutorials/sum-squares/

There are three terms –


The sum of squares total, the sum of squares regression, and
The sum of squares error

What is the SST?


The sum of squares total, denoted SST, is the squared differences between the
observed dependent variable and its mean.

It is a measure of the total variability of the dataset.


There is another notation for the SST. It is TSS or total sum of squares.

What is the SSR?


The second term is the sum of squares due to regression, or SSR. It is the sum of the

Page 26 of 26
differences between the predicted value and the mean of the dependent variable.

If this value of SSR is equal to the sum of squares total, it means


our regression model captures all the observed variability and is perfect. Once again, we
have to mention that another common notation is ESS or explained sum of squares.

What is the SSE?


The last term is the sum of squares error, or SSE. The error is the difference between
the observed value and the predicted value.

We usually want to minimize the error. The smaller the error, the better the estimation
power of the regression. It is also known as RSS or residual sum of squares. Residual
as in: remaining or unexplained.

How Are They Related?


Mathematically, SST = SSR + SSE.

Page 27 of 27
The rationale is the following: the total variability of the data set is equal to the
variability explained by the regression line plus the unexplained variability, known as
error.

Given a constant total variability, a lower error will cause a better regression. Conversely,
a higher error will cause a less powerful regression.

28.12.2022
 Fitted model is adequate or can be reduced further?
o Test significance of individual coefficient ß hat
o A general unified test on the full model (FM) vs the reduced model (RM)
 Hypothesis testing
o H0: Reduced Model is Adequate
Page 28 of 28
o H1: Full Model is Adequate
 Check R squared, if the values are close to 1 maybe linear model is good but that is
not a confirmatory test, do the residual plot
 In the univariate case there is only one independent variable, but here there are
several independent variables.
 Maybe not all independent variables have an effect on y. Some of the independent
variables may be irrelevant.
 One way of trying to find whether a particular independent variable has an effect is
to test the corresponding coefficient.
 the confidence interval contains 0, in which case we can say the corresponding
independent variable does not have a significant effect on the dependent variable
and drop it
 F test - whether the full model is better than the reduced model
 The reduced model contains no independent variables whereas, the full model can
contain all or some of the independent variables.
 Testing two models: RM with k parameters
 F- Statistic

Page 29 of 29
Page 30 of 30
Example:

Menu Pricing in Restaurants of NYC


y: price of dinner
x1: Customer rating of the food (Food)
x2¬: Customer rating of the decor (Decor)
x3¬: Customer rating of the service (Service)
x4¬: If the restaurant is east or west (East)

Objective: Build a model

 Before building model, scatter pot – visualization.


 Price is y.
 y vs x1 – shows scatter plot for price versus food
 y vs x2 – shows scatter plot for price versus décor
 y vs x3 – shows scatter plot for price versus service
 y vs x4 – shows scatter plot for price versus location
 also scatter plot can be developed between food and décor, food and service and so
on
 Effect of collinearity - Even though we consider all these variables like food, decor,
service, location as independent, it is possible that they are not truly independent,
there might be interdependencies between
 But a scatter plot may reveal some dependencies, inter dependencies, between the

Page 31 of 31
independent variables.
 Food vs décor – no correlation
 Food vs service – strong correlation

Page 32 of 32
 Apply the R function lm to this data set and examine the output.
 Above image shows the output from R and it tells that the intercept term is - 24.02
and the slope parameters, the coefficient multiplying food is 1.5, the coefficient
multiplying decor is 1.9 and so on so forth.

 Residual plots – Standardized residuals for assessing


o Linear vs nonlinear model

Page 33 of 33
o Normality of the errors
o Homoscedastic vs heteroscedastic errors

 The standardized residuals plotted against the predicted price value or the fitted
value
 ŷ i has only one variable So, generate only one plot, (which is in red lines) the
confidence interval for the standardized residuals and anything above this outside of
this interval indicates outliers.
 56, 48, 30 & 109…..may be possible outliers and there is no pattern in the
standardized residuals

29.12.2022
Logistic regression
Introduction
 Logistic regression is a classification technique
 Decision boundary (generally linear) derived based on probability interpretation
o Results in a nonlinear optimization problem for parameter estimation
 Goal: Given a new data point, predict the class from which the data points is likely to
have originated

Binary classification problem


 Classification is the task of identifying a category that a new observation belongs to
based on the data with known categories
 When the number of categories is 2, it becomes a binary classification problem
 Binary classification is a simple “Yes” or “No” Problem

Input features

Page 34 of 34
 Input features can be both qualitative and quantitative
 If the inputs are qualitative, then there has to be a systematic way of converting
them to quantities
 For example – a binary input like a “Yes” or “No” can be encoded as”1” or “0”
 Some data analytics approach can handle qualitative variables directly
 Data is expressed in many attributes X1 to Xn (also called as input features)
 And these input features could be quantitative, or qualitative
 Quantitative features can be used as they are
 While using a quantitative technique, and input features which are qualitative, then
we should have some way of converting this qualitative features into quantitative
values.
 Example – data point [yes, 0.1, 0.3], and another data point [no, 0.05,-2] etc.
 There are some data analytics approaches that can directly handle these qualitative
features without a need to convert them into numbers and so on.
 Decision function is linear
 Binary classification can be performed depending on the side of the half-plane that
the data falls in
 Guessing “yes” or “no” is pretty crude
 Can we do better using probabilities?

 Assume that all the circular data belong to one category and all the starred data
belong to another category.
 Hyper plane – one side of this hyper plane is half space, the other side is a half
space
 Positive value and a negative value to each side of the hyper plane
 Data is inherently noisy - closer and closer to the line, there is an uncertainty about,
whether it belongs to this class 0, or class 1

Page 35 of 35
 A data point, which is true value here, because of noise it could slip to the other side
and so on.
 The probability, or the confidence with which deciding the particular class intuitively
come down
 Logistic regression addresses this

Why model probabilities?


 The probability of “Yes” or “No” gives a better understanding of the sample’s
membership to a particular category
 Estimating the binary outputs from the probabilities is straight forward through
simple thresholding
Linear and log models
 Make p(x) a linear function of x
o P(x) = ß0 + ß1x
 The solution is written in the vector form
 Expand it in terms of x1 and x2,
o = ß0 + ß11x1+ ß12x2 (the equation of line in this two dimensional space)
 This makes p(x) unbounded below 0 and above 1 (probability has to be bounded
between 0 and 1)
 Might give nonsensical results making it difficult to interpret them as probabilities
 Find some function which is bounded between 0 and 1
 Make log(p(x) a linear function of x
o Log(p(x)) = ß0 + ß1x
 Bounded only on one side
 Instead of just looking at this decision boundary and then saying yes and no + and -,
use this equation itself to come up with some probabilistic interpretation.
 log (p (x ))= β0 + β1 x, ensures that p of x never becomes negative;
 On the positive side p of x can go to ∞ that again is a problem because we need to
bound p (x) between 0 and 1.

Sigmoid function
 Make p(x) a sigmoid function of x

 p(x) bounded above by 1 and below by 0


Page 36 of 36
 Good modeling choice for real life scenarios
 The LHS can be interpreted as log of odds-ratio in the second equation
 The sigmoidal function has relevance in many areas, used in neutral networks and
other very interesting applications
 Convert that hyper plane into probability interpretation.
 Argument β0 + β1 x is depending on the value of X, it could go all the way from - ∞ to
∞.
 From the equation for the hyper plane, we have been able to come up with the
definition of a probability, which makes sense, which is bounded between 0 and 1.

Estimation of parameters
 We find a parameter in such a way that plugging these in the model equation should
give the best possible classification for the inputs from both the classes
 This can be formalized by maximizing the following likelihood function

When xi belongs to class 0, yi=0, When xi belongs to class 1, yi=1

 Hyper plane equation β0+ β11X1+ β12X2


 Identify values for the parameters β0 , β11 and β12
 All machine learning techniques can be interpreted in some senses an optimization
problem.

 For any data point


on this side belonging
to class 0, we want to
minimize p(x) when x is
substituted into that
probability function and,
for any point on this
side (class 1) when we
substitute these data
points into the
probability function, we
want to maximize the
probability.

Page 37 of 37
Log-likelihood Function

 Simplifying this expression and using the definition for p(x) will result in an
expression with the parameters of the linear decision boundary
 Now the parameters can be estimated by maximizing the above expression using
any nonlinear optimizer solver.
 While solving, will get hyper plane
 Two dimensional problem, will have 3 parameters
 If there is a n dimensional problem, n + 1 decision variables will be identified through
this optimization solution.
 And for any new data point, once we put that data point into the p(x) function that
sigmoidal function, then will get the probability that it belongs to class 0 or class 1.
 This is the basic idea of logistic regression.

 Modeled the probability as a sigmoidal Function.


 And the sigmoidal function is hyper plane equation.
 It is scalar in ‘n’ dimension (β0 + β11 x1 + β12 x2 ……β1n xn)
 If this quantity is a very large negative number, then the probability is 0 and if this
quantity is a very large positive number the probability is 1.
 Use a threshold of 0.5, because probabilities go from 0 and 1.
 p(x) = 0.5 exactly when β0+β1X=0.
0 0
 This is because p(x) = e /1+e which is equal to ½, in the n dimensional case will
have an equal probability of belonging to either class 0 or class 1
Example – 1
X1 X2 X3 X4 X1 X2 Class
?
1 1 6 3 1 3
2 1 7 3 2 3
3 1 8 3 4 4
4 1 9 3 5 4
5 1 10 3 3 3
1 2 6 4 6 2
2 2 7 4 9 2
3 2 8 4 8 1
4 2 9 4 7 2

Page 38 of 38
5 2 10 4 10 1
Class 0 Class 1 Test Data
 We have data for class 0 and data for class 1 and then clearly this is a 2 dimensional
problem. So, the hyper plane is going to be a line.
 So, a line will separate this, these kinds of classification problems are called as
supervised classification problem, because all of this data is labeled
 Given new data which is called the test data and then the question is what class
does this test data belong to? It is either class 0 or class 1?
 Fraud detection – lots of records of fraudulent credit card use and all of those
instances can be described by certain attribute.
 The time of the day whether the credit card was done at, the place where the person
lives, credit card transfer or credit card use and many other attributes
 There are lots of records for normal use of credit card and some records for
fraudulent use of credit card
 A classifier could identify the new transaction that is being initiated, what likelihood
it is of this transaction being fraudulent
 Task – fill the last column with 0 & 1 (data belongs to class 0 or class 1)

 Plotted the same data that was shown in the last table
 Logistic regression is used to solve the problem

Page 39 of 39
Results
Input Features : X1, X2
Classes : 0, 1
Parameters : ß0 = -42.5487
ß11 = 2.95009
ß12 = 10.4012

Test results
X1 X2 Probability Clas
s
1 3 0.0002 0
2 3 0.004 0
4 4 0.999 1
5 4 0.999 1
3 3 0..076 0
6 2 0.0172 0
9 2 0.991 1
8 1 0.0002 0
7 2 0.251 0
10 1 0.0667 0

 Parameters values – Through the optimization problem maximizing log likelihood


with β0, β11 and β12 as decision variables
 3 decision variables for 2 dimensional problem
 1 coefficient for each dimension and then 1 constant.
 Plug the test data into this sigmoidal function and get the probability
 Threshold = 0.5, anything less than 0.5 is going to belong to class 0, and anything
greater than 0.5 is going to belong to class 1
 The process of identifying these parameters is called in machine learning algorithms
as Training
 The data used while these parameters are being identified are called the Training
Data and this is called the test data (X1 & X2 in the above table)
 Data with class labels – split this into training data and the test data
 Reason – classifier was built based on some data, tested it on some other data, and
there is no way for verifying the result obtained
 Use some portion of the data to build a classifier
 Retain some portion of the data for testing and the reason for retaining this is
because the labels are already known in this.
 Can be used for comparing the obtained result with the existing data
Page 40 of 40
 Used for verifying the classifier
 What portion of data to be used for training and testing?
 Different ways of doing validation are there – K-fold validation is one among them
 In 2 dimension it is easy to visualize the data points
 It is difficult in multi-dimension
 Performs linear classification

Regularization

General Objective

 L = Log likelihood objective function


 Θ = constant in the hyper plane (of all the decision variables)
 If there are n variables / features / attributes in the problem, then number of
decision variable identified is n + 1.
 When ‘n’ is very large number of independent variables, it leads to overfitting
 When the coefficients are not really contributing to the solution or efficacy of the
solution, then penalize it, for preventing overfitting, this is known as Regularization
 Regularization helps in building non-complex models, or in other words
regularization avoids building complex models
 Overfitting effects can be reduced

Page 41 of 41
 Minimize the log likelihood
 Now the objective is

 Where λ – is regularization parameter, h(Θ) – is regularization function


 Depending on h(Θ), the regularization can be classified as L1 or L2 type.
 When Θ values is large, the penalty will be more
 It should lead to improvement
 h(Θ) = ΘTΘ for L2 type regularization


( ) ß0
(ß0 ß11 ß12)T ß11 = ß02+ß112+ß122
ß12
 Larger the value of λ, more is the regularization strength
 Regularization helps the model work better on test data due to the fact that
overfitting is minimized on training data

*****

Page 42 of 42

You might also like