Predictive Modelling Monograph Final
Predictive Modelling Monograph Final
1
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Contents
1. Introduction .............................................................................................................................................................. 4
1.1 Introduction to Predictive Modelling ................................................................................................. 4
1.2 Supervised and Unsupervised Learning .................................................................................................. 5
2. The Problem of Prediction: Bias Variance Trade-off................................................................................ 7
2.1 Bias of a model.................................................................................................................................................... 7
2.2 Variance of a model .......................................................................................................................................... 8
2.3 Bias Variance Trade-off .................................................................................................................................. 9
2.4 Training and Test data sets ....................................................................................................................... 10
2.5 Cross-Validation..................................................................................................................................... 10
3. Model Selection..................................................................................................................................................... 12
3.1 Transformation of the response .............................................................................................................. 12
3.2 Information criteria: AIC and BIC ........................................................................................................... 18
3.3 Forward Selection Algorithm ................................................................................................................... 19
3.4 Backward Elimination ................................................................................................................................. 22
[email protected]
21YORICED7
3.5 Stepwise Regression ...................................................................................................................................... 25
3.6 All Possible Regression or Regression Subset Selection and Mallows’ Cp Criterion ......... 28
3.7 Choosing the Final Model............................................................................................................................ 31
2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Fig. 4: Residual plot for 1/alcohol vs. other predictor variables .......................................................... 15
Fig. 5: Residual plot for 1/alcohol2 vs. other predictor variables ......................................................... 16
Fig. 6: Flowchart for Regression Model Selection .............................................................................. 33
List of Tables
Table 1: Description of the data (Data Dictionary):............................................................................. 13
[email protected]
21YORICED7
3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Technology has given us the power to capture data from almost everything − road accidents,
grocery store bar codes, customer loyalty programs as well as political opinions expressed on
Twitter and Facebook. However, amassing data does not help us in any way unless we can
understand the information contained in the data. Predictive modeling and data mining help to
detect the hidden patterns in the data and to forecast for yet unobserved situations.
In this monograph and few subsequent monographs, various topics in predictive modeling and
data mining will be taken up. But before we go into the details of predictive modeling, several
major concepts need to be addressed.
What is the objective of predictive modeling?
Let us examine two different cases. Identification of spam or detection of fraudulent credit card
transactions is two situations where the application of predictive modeling is important. In the
former case, the objective is to detect spam email through a filter and classify it as such. In the
latter case, the banks would want to identify frauds immediately, and red flag the transactions.
In both cases, prediction accuracy needs to be very high, though how the spam was filtered or
fraud was detected may not be so important. In this case, the model may be complex and the
[email protected]
21YORICED7interpretability of the model low. Often these models are known as ‘black box’ models.
Automation will work well here.
Consider another case where a physician needs to predict whether a post-menopausal woman
above 65 years of age is at a risk of knee replacement surgery given various other health
indicators. Here a ‘black box’ model may not be acceptable despite having high accuracy. The
reason being, it is not enough to know who is at risk but it is mandatory to mitigate her risk of
knee replacement. Unless a physician can understand which health indicators are more
important, she will not be able to provide an effective treatment regime for her patient.
Interpretability is a must in this situation. A black box model may be rejected in favor of an
interpretable model, albeit accuracy in the former may be higher.
That is not to say predictive accuracy needs to be sacrificed totally to improve interpretability.
Somewhere a balance needs to be struck. The suitability of the predictive model is an important
issue that may need to be addressed case by case.
4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
All problems of data mining, pattern recognition, or predictive modeling come under the
umbrella known as Statistical Learning.
Statistical learning problems can be partitioned into Supervised or Unsupervised Learning.
The problems discussed in the previous section belong to the first category. The objective in
each case is to predict a response, typically denoted by Y. Corresponding to each unit of
observation there is a set of independent variables or predictors (X), based on which Y is
estimated (see the monograph on Regression). Whenever in the data set the response is
available, the problem falls under supervised learning.
Supervised learning can be further divided into two classes, depending on the nature of the
response Y. If Y is a continuous variable, the problem falls under the category Regression. On
the other hand, if the response is qualitative, binary, or multi-class, the problem falls under the
category Classification. Spam identification (Yes/No) and detection of fraud (Yes/No) are both
classification problems.
The problem of assigning a risk value to a patient may be considered a classification problem
if the risk is defined as low, medium, or high. If, however, a continuous risk probability is to be
estimated for each patient, the problem is considered a regression problem.
Unsupervised learning problems are those where there is no response. One example of an
unsupervised learning problem is to categorize loyalty customers in a Gold, Silver, and Bronze
[email protected]
21YORICED7classification, depending on their propensity of spending in a store. Detection of possible
clusters in a multivariate data set is an unsupervised learning problem.
Another example of unsupervised learning is the extraction of factors from a complex data set
(see the monograph on Dimension Reduction).
An illustrative figure is shown on the next page with a few techniques of different types of learning.
5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Regression
Linear regression
Regression Tree
Learning
[email protected]
Unsupervised
21YORICED7 Classification
Hierarchical clustering
K-means clustering
Analysis
PCA-FA
machine
6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Identification of a suitable prediction model involves a trade-off between its bias and variance.
Gaining a proper understanding of bias and variance would help in avoiding the mistakes of
overfitting or under-fitting. The problem of bias and variance are intimately associated with the
problem of under-fitting and overfitting. A good model is one for which the bias and variance
are as small as possible, and for which, the predictive ability is good.
Let us consider a model 𝑌=𝑓(𝑋)+𝜖, where 𝑓 is an arbitrary function of the independent variables
𝑋 and 𝜖 is the random error component. (Most models in predictive problems are of this type).
The function 𝑓 may be completely unknown or one may only have partial knowledge about its
[email protected]
21YORICED7
form. For example, suppose 𝑓 to be linear, i.e. 𝑓(𝑥)=𝑎+𝑏𝑥. Unless the numerical values of 𝑎 or
𝑏 are known, only the type of the function 𝑓 is known but not the complete function. Let an
estimate 𝑓̂ of 𝑓 is obtained. For any given value of 𝑋, the predicted value 𝑌̂=𝑓̂ (𝑋). Usually, 𝑓̂
is computed using the data 𝑦, and hence 𝑓̂ (𝑥) is a random variable.
All the subsequent definitions and explanations in the next three sections will be based on this
model.
Bias is the average difference between the predicted values of a model and the observed values.
𝐵𝑖𝑎𝑠=𝐸(𝑓̂ (𝑥)−𝑓(𝑥))
where E denotes expectation or mean. A model is called unbiased if 𝐵𝑖𝑎𝑠=0. Bias can be both
negative and positive, so a desirable condition for bias is that the absolute value of Bias is close
to 0.
Recall here that a multiple linear regression model is unbiased since the average value of the
least-squares estimate of residuals is 0.
7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Variance is defined to be the variability of predicted values which quantifies the instability of
the model.
Formally,
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒=𝐸(𝑓̂ (𝑥)−𝐸(𝑓̂ (𝑥)))2
This is the error variance and is often denoted by the mean square error (MSE). Unlike bias,
the variance can never be negative.
A model with high variance pays a lot of attention to the available data. Such models perform
very well on data used to fit the model but have limited power to predict for the data that has not
been observed. Hence it has high error rates on data used for prediction. This phenomenon is
called overfitting. Overfitting happens when a model is too close to the observed data and
captures the noise along with the underlying pattern in data.
[email protected]
Observe n paired data points (𝑥1,𝑦1),(𝑥2,𝑦2),…,(𝑥𝑛,𝑦𝑛). The model 𝑦 = 𝑓(𝑥) + 𝜖 is fitted to the
21YORICED7
data and the estimate 𝑓̂ is obtained. If a new observation 𝑥 is one of 𝑥1,𝑥2,…,𝑥𝑛 then 𝑓̂ can be
chosen so that 𝐸(𝑓̂ (𝑥)) = 𝑓(𝑥) exactly and hence the bias is zero when evaluated at the observed
𝑥 value. But if 𝑥 is different from 𝑥1, 𝑥2,…, or 𝑥𝑛, then our estimate 𝑓̂ may behave poorly. The
reason is that 𝑓̂ becomes very short-sighted since it tries to fit the observed data (𝑥𝑖,𝑦𝑖) as
perfectly as possible, and does not take into consideration any new data point which could be
used in the future.
8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
The leftmost panel shows an almost perfect fir to the observed dataset and thus bias is very
small. But for this model the variability is very high. As soon as one new observation is added
to the data, the function f(x) may change considerably.
The middle panel shows a model that completely ignores the curvature in the dataset and plots
a straight line through it. Clearly it does not fit the data at all. This model varies only slightly for
a different sample and impact of addition of a number of points may be negligible. Though in
theory variance of a model cannot be zero, for all practical purposes it may be considered such.
But the bias for this model is very high.
The rightmost panel shows a much better fit. It does not have zero bias, nor does it enjoy
negligible variance, but it strikes a balance somewhere between the two and also fits the data
well.
The goal of predictive modelling is to reduce 𝐵𝑖𝑎𝑠2 + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒. As described above, due to
the bias-variance trade-off, no model can reduce both bias and variance simultaneously, and
therefore, a good model will try to minimize the sum of 𝐵𝑖𝑎𝑠2 + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒.
9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Such a model has the best predictive power, i.e. it is able to provide the best possible estimated
value for a new observation.
The discussion above establishes an important fact. Unless predictive ability of a model is tested
on an independent data set, which is different than the one used to build the model, a vital aspect
of the model is ignored. This necessitates splitting of the existing data set into two or more parts.
Training data: A training dataset is used to fit one or more models and estimate their parameters.
Test data: A test dataset is used to assess the performance of the developed model. The test set
should be as close as possible to the training dataset (more formally, having the same
distribution) but no overlap with the training dataset.
[email protected]
21YORICED7Typically training and test data are partitions of the observed data into a random 80:20 split.
Other possible splits may be 75:25 or 70:30 or in some other ratio, all taken randomly. If the
data is very large even 50:50 split into training and test is also possible. Training data is larger
so that the model parameters are estimated with considerable accuracy. The purpose of having
a test data is to check how close the predicted values are to the observed values.
Choice of test data may also be modified according to special applications. In certain predictive
methods that are intended to be applied for a period of time (e.g. credit scoring), the test data is
taken to be the most recent period. For example: Q1, Q2, and Q3 data are used in training set,
but Q4 data is used as test set for a model that is intended to project credit risk for the next one
quarter. In time series, one of the most specialized predictive models, only the most recent
periods are used as test data. (This is discussed in detail in Time Series Monograph).
10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
21YORICED7
11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
In the monograph on Linear Regression, after fitting the linear regression models, a residual
analysis was performed to check if the regression assumptions (the LINE assumptions, see Sec
4.2.1) remained valid. Recall that one of the assumptions was normality. Another important
assumption was that the variances of the response were constant.
If any or both of these two assumptions are violated, which can be detected from the residual
plot, a transformation of the response may be necessary. Many transformations of the response
are possible, but the most popular choice of transformations are given by the Box-Cox family
of transformations.
Let 𝜆 be a constant, 𝜆≠0. The response Y is transformed to a new variable Z (say) where Z =
(𝑦𝜆−1) / 𝜆. If 𝜆 = 0 then the response is changed to log(𝑦). The transformed response Z fulfils
the regression assumptions. However, it is clear from the functional form of the transformation,
that Z can be simplified as 𝑌𝜆. This simplification works because 1 / 𝜆 being a constant, gets
absorbed into the intercept term, and the regression slope parameters are multiplied (scaled) by
the constant 𝜆. If any predictor was significant in predicting (𝑦𝜆−1) / 𝜆 then it is expected to be
[email protected]
21YORICED7
significant in predicting 𝑦^𝜆. Therefore, the inference remains unchanged whether we use 𝑦𝜆
or (𝑦𝜆−1) / 𝜆.
That value of λ is chosen such that the log-likelihood curve reaches its maximum. This will be
explained in further detail through the case study.
Case Study
[Refer to the monograph on Multiple Linear Regression]
A top wine manufacturer wants to invest in new technologies to improve its wine quality. Wine
quality is directly dependent on the amount of alcohol in wines and the smoothness which, in
turn, is controlled by various chemicals either directly added during the manufacturing process
or generated through various chemical reactions. Wine certification and quality assessment are
key elements for wine gradation and its price ticket. Wine certification is determined by various
physiochemical elements in the wine. Therefore, the company wants to estimate the percentage
(%) of alcohol in a bottle of wine as a function of various chemical components of the wine.
12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
We have considered the problem of building a multiple linear regression model in the
[email protected]
21YORICED7monograph on Linear Regression. In this monograph, the problem of selecting the best possible
predictive model using this dataset is considered.
The dataset is split into a training and a test data set according to a random 80:20 allocation.
Training data contains 1279 observations since 80% of 1599 is approximately 1279. A simple
random sample (without replacement) of size 1279 is taken from the first 1599 positive integers,
and the training dataset is formed by selecting the rows of WineData corresponding to these
random numbers. The remaining (1599-1279=) 320 rows will constitute the test dataset.
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
from statsmodels.formula.api import ols
import statsmodels.regression.linear_model as sm
import matplotlib.pyplot as plt
df = pd.read_csv('WineData.csv')
X = df.drop(['alcohol','ID'],axis = 1)
y = df['alcohol']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
train_WineData = pd.concat([X_train,y_train],axis=1)
test_WineData = pd.concat([X_test,y_test],axis=1
train_WineData =
pd.concat([train_WineData,pd.get_dummies(train_WineData['Brand'],drop_first=True)],axis=1)
test_WineData =
pd.concat([test_WineData,pd.get_dummies(test_WineData['Brand'],drop_first=True)],axis=1)
train_WineData['SulaVineyards'] = train_WineData['Sula Vineyards']
test_WineData['SulaVineyards'] = test_WineData['Sula Vineyards']
train_WineData.drop(['Brand','Sula Vineyards'],axis = 1,inplace = True)
test_WineData.drop(['Brand','Sula Vineyards'],axis = 1,inplace = True)
13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Let us first determine the value 𝜆 to see whether any transformation is necessary. Note that
henceforth all model building activities will be carried on the training data set.
MLR_wine =
ols(formula="alcohol~FA+VA+CA+RS+chloride+FSD+TSD+density+sulphate+pH+Seagram+SulaVi
neyards",data=train_WineData).fit()
lmbdas = np.linspace(-3, 2)
llf = np.zeros(lmbdas.shape, dtype=float)
for ii, lmbda in enumerate(lmbdas):
llf[ii] = stats.boxcox_llf(lmbda, MLR_wine.fittedvalues)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(lmbdas, llf, 'b.-')
ax.axhline(stats.boxcox_llf(lmbda_optimal, MLR_wine.fittedvalues), color='r')
ax.set_xlabel('lmbda parameter')
ax.set_ylabel('log-likelihood')
[email protected]
21YORICED7
That value(s) of 𝜆 is (are) chosen for which the graph plotted above reaches its maximum. For
this data the maximum is attained somewhere between −1 and −2. For clarity of interpretation,
any fractional value of 𝜆 is ignored.
14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
MLR_wine1 =
ols(formula="new_resp~FA+VA+CA+RS+chloride+FSD+TSD+density+sulphate+pH+Seagram+SulaVineyar
ds",data=train_WineData).fit()
[email protected]
21YORICED7
It is clear that the regression assumptions are not violated. Next, 𝜆 = −2 transformation is
considered.
#lambda=-2
15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
MLR_wine2 =
ols(formula="new_resp~FA+VA+CA+RS+chloride+FSD+TSD+density+sulphate+pH+Seagram+SulaVineyar
ds",data=train_WineData).fit()
regression_plots(MLR_wine2,train_WineData)
[email protected]
21YORICED7
16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Note that the predictors FSD and 𝐼(𝐵𝑟𝑎𝑛𝑑 = 𝑆𝑢𝑙𝑎 𝑉𝑖𝑛𝑒𝑦𝑎𝑟𝑑𝑠) are not significant in
predicting 1/alcohol.
17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Once several candidate models are found, it is necessary to be able to compare among them.
There are several options for comparison. Two of those, namely R2 and adjusted R2, have
already been introduced (see Regression Monograph) and their merits and demerits have been
discussed.
In this section two more criteria are introduced both of which are based on information lost in
fitting a model. A model is a simplification of the process from which the observed data is
generated. The closer the model is to the real process; the more information it contains.
Nevertheless, no model will ever be able to emulate the real process and hence, some amount
of information will always be lost. The errors or residuals of the model fit provide
quantification of the information lost.
Both the information criteria, AIC and BIC, are popular for comparison of models. Both criteria
are based on the error sum of squares and both penalize models on the number of predictors
included. In a way, they try to strike a balance between bias and variance.
Consider the multiple linear regression model with 𝑝 parameters (including intercept
[email protected] term) on 𝑛
21YORICED7data points and let the residual sum of squares be denoted by SSE.
18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Forward selection algorithm for regression model building is an automated algorithm which
selects one predictor at a time, conditional on which other predictors are already included in
the model. This model is not expected to include the redundant predictors. Forward selection
algorithm suggests a minimal set of predictors following certain optimality criteria.
The forward selection algorithm is one of the most popular algorithm because of its easy
interpretability.
Assume there are p available predictors. The steps of forward selection algorithm are as
follows:
Start with a null (intercept-only) model.
Perform p simple linear regressions with one predictor at a time. For each model,
compute the AIC value. Include that predictor for which the corresponding model has
the smallest AIC value, smaller than the null model. If no such predictor is found, then
the null model is considered final.
Assume that a certain predictor, say 𝑋1, did enter the model. At the end of this step the
model contains an intercept term and 𝑋1.
Perform p-1 multiple linear regressions with the remaining p-1 predictors 𝑋2,…,𝑋𝑝,
Check which predictor among 𝑋2,…,𝑋𝑝 gives smallest AIC in presence of 𝑋1, smaller
[email protected]
21YORICED7 than the AIC obtained with just 𝑋1 as predictor. That predictor is included in the model.
If no predictor among 𝑋2,…,𝑋𝑝 satisfies the entry criterion, then the process is stopped
and the final model includes 𝑋1 as the only predictor for 𝑌
This process is continued till either all predictors are included in the regression or a
subset of the predictors is included and inclusion of no other predictor found to give
smaller AIC value than that of the model determined by this subset.
Forward selection is performed on the training dataset with response new_resp.
# forward selection
new_resp=1/train_WineData['alcohol']
train_WineData['new_resp'] = new_resp
train_X_WineData = train_WineData.drop(['alcohol','new_resp'],axis = 1)
def forwardSelection_aic(X, y):
X["intercept"] = 1
cols = X.columns.tolist()
cols = cols [-1:] + cols[:-1]
X = X[cols]
iterations_log = ""
cols = X.columns.tolist()
selected_cols = ["intercept"]
remaining_cols = cols.copy()
remaining_cols.remove("intercept")
print("Remaining columns : ",remaining_cols)
model = sm.OLS(y, X[selected_cols]).fit()
criteria = model.aic
print("Null model AIC critetia :",criteria)
19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
frwd_slctn_cols = forwardSelection_aic(train_X_WineData,train_WineData['new_resp'
])
Remaining columns : ['FA', 'VA', 'CA', 'RS', 'chloride', 'FSD', 'TSD', 'density'
, 'pH', 'sulphate', 'Seagram', 'SulaVineyards']
[email protected]
Null model AIC critetia : -8369.496080838044
21YORICED7
Entered column : density
Selected columns : ['intercept', 'density']
Remaining columns : ['FA', 'VA', 'CA', 'RS', 'chloride', 'FSD', 'TSD', 'pH', 'su
lphate', 'Seagram', 'SulaVineyards']
New AIC critetia : -8686.871678269768
Entered column : FA
Selected columns : ['intercept', 'density', 'FA']
Remaining columns : ['VA', 'CA', 'RS', 'chloride', 'FSD', 'TSD', 'pH', 'sulphate
', 'Seagram', 'SulaVineyards']
New AIC critetia : -8896.704471654608
Entered column : pH
Selected columns : ['intercept', 'density', 'FA', 'pH']
Remaining columns : ['VA', 'CA', 'RS', 'chloride', 'FSD', 'TSD', 'sulphate', 'Se
agram', 'SulaVineyards']
New AIC critetia : -9202.05736648084
Entered column : RS
Selected columns : ['intercept', 'density', 'FA', 'pH', 'RS']
Remaining columns : ['VA', 'CA', 'chloride', 'FSD', 'TSD', 'sulphate', 'Seagram'
, 'SulaVineyards']
New AIC critetia : -9509.221077244645
Entered column : CA
Selected columns : ['intercept', 'density', 'FA', 'pH', 'RS', 'sulphate', 'Seagra
m', 'FSD', 'CA']
Remaining columns : ['VA', 'chloride', 'TSD', 'SulaVineyards']
New AIC critetia : -9696.160147228982
Entered column : VA
Selected columns : ['intercept', 'density', 'FA', 'pH', 'RS', 'sulphate', 'Seagra
m', 'FSD', 'CA', 'chloride', 'VA']
Remaining columns : ['TSD', 'SulaVineyards']
New AIC critetia : -9711.47318715933
Final selected columns : ['intercept', 'density', 'FA', 'pH', 'RS', 'sulphate', '
Seagram', 'FSD', 'CA', 'chloride', 'VA', 'TSD']
Final removed columns : ['SulaVineyards']
Start with the null model (null_mod) that has only the intercept term as the predictor. If at any stage,
it is seen that inclusion of a new predictor does not improve the AIC value, the process stops.
The null model has AIC −8369.496. Starting with null model, density has the smallest AIC value.
The model containing density has AIC value −8686.871, which is an improvement on the
previous AIC value of −8369.496. Therefore, density enters the model.
Similarly, once density is present in the model, the best improvement in AIC value is obtained by
including FA. So at the next step, the model includes both density and FA. In this way, the process
continues till the full model is obtained.
For this data set, all predictors except “SulaVineyards” enter the model through forward selection
procedure.
Therefore, the regression equation also gets updated as the feature “SulaVineyards” will be
removed from the equation. The regression equation will look as follows :
MLR_wine1 =
ols(formula="new_resp~FA+VA+CA+RS+chloride+FSD+TSD+density+sulphate+pH+Seagram",data=train_
WineData).fit()
21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Forward selection includes the predictors one by one into the model. Backward elimination is
an algorithm that goes the opposite way. The algorithm is described below.
22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
It was seen that the variable satisfying this condition was SulaVineyards. The AIC of the model
obtained by removing SulaVineyards is −9718.548 which is smaller than AIC of the full model
(−9716.88). Therefore, SulaVineyards was eliminated at the first stage. Similarly, in second
stage, the AIC of the model obtained by removing FSD is −9719.924 which is smaller than AIC
of the previous model (−9718.548). However, after removing FSD, no other predictor satisfied
the exit criterion. The process stopped, keeping all the predictors except SulaVineyards & FSD
in the model.
iterations_log = ""
cols = X.columns.tolist()
selected_cols = cols
remaining_cols = cols.copy()
remaining_cols.remove("intercept")
[email protected]
model = sm.OLS(y, X[selected_cols]).fit()
21YORICED7
criteria = model.aic
print("Initial AIC critetia :",criteria)
while(True):
rmvd_col_dict = {}
for i in range(len(remaining_cols)):
temp_rmvd_col = remaining_cols[i]
selected_cols.remove(temp_rmvd_col)
model = sm.OLS(y, X[selected_cols]).fit()
new_criteria = model.aic
selected_cols.append(temp_rmvd_col)
if new_criteria < criteria:
rmvd_col_dict[temp_rmvd_col] = new_criteria
if len(rmvd_col_dict) == 0:
break
print()
print("Columns eligible for elimination:",rmvd_col_dict)
eliminated_col = sorted(rmvd_col_dict.items(), key=lambda x: x[1])[0][0]
print("Removed col with least AIC:",eliminated_col)
selected_cols.remove(eliminated_col)
remaining_cols.remove(eliminated_col)
model = sm.OLS(y, X[selected_cols]).fit()
criteria = model.aic
print("Updated AIC criteria : ",criteria)
print(selected_cols)
print("\n\n")
print("Final selected columns :",selected_cols)
return selected_cols
bckwrd_elmtn_cols = backwardElimination_aic(train_X_WineData,train_WineData['new_resp'])
23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Final selected columns : ['intercept', 'FA', 'VA', 'CA', 'RS', 'chloride', 'TSD',
'density', 'pH', 'sulphate', 'Seagram']
Backward elimination suggests that FA, VA, CA, RS, chloride, TSD, density, sulphate, pH, and
Seagram are enough to be used as predictors and in their presence the variable FSD and
SulaVineyards is redundant
MLR_wine1 =
ols(formula="new_resp~FA+VA+CA+RS+chloride+TSD+density+sulphate+pH+Seagram",data=train_Wine
Data).fit()
print(MLR_wine1.summary())
24
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
Stepwise regression is a combination of Forward Selection and Backward Elimination
21YORICED7
procedures.
Start with the null model.
At the next step, use Forward Selection to select a predictor showing smallest AIC value,
smaller than that of the null model. If this criterion is not satisfied, then the null model is
selected to be the final model.
Suppose a predictor 𝑋1 entered the model. For each of the remaining predictors, include it,
compute AIC. Remove 𝑋1 from the model and also compute the AIC value. Print the variables
in increasing order of AIC values. If there is a variable whose entry/exit improves the AIC of
the existing model, include/eliminate it from the model.
Continue till all predictors are included or there is no further improvement of AIC.
It was seen that the outcome of Stepwise Regression was identical to that of Backward
Elimination, i.e. only SulaVineyards & FSD was removed from the set of predictors to build the
final model.
def combine_fs_be_stepaic(X,y):
X["intercept"] = 1
cols = X.columns.tolist()
cols = cols[-1:] + cols[:-1]
X = X[cols]
iterations_log = ""
cols = X.columns.tolist()
selected_cols = ["intercept"]
remaining_cols = cols.copy()
25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
temp_dict = {}
for i in range(len(remaining_cols)):
new_col = remaining_cols[i]
selected_cols.append(new_col)
model = sm.OLS(y, X[selected_cols]).fit()
new_criteria = model.aic
if new_criteria < criteria:
temp_dict[new_col] = new_criteria
selected_cols.remove(new_col)
entered_col = sorted(temp_dict.items(), key=lambda x: x[1])[0][0]
entered_aic = sorted(temp_dict.items(), key=lambda x: x[1])[0][1]
print()
print("Entered column :",entered_col)
selected_cols.append(entered_col)
temp_dict = {}
temp_dict[entered_col] = entered_aic
first_stage = True
print("Selected columns :",selected_cols)
remaining_cols.remove(entered_col)
print(temp_dict)
while(True):
if first_stage:
aic_dict = temp_dict
else:
[email protected]
21YORICED7 first_stage = False
aic_dict = {}
flag = False
be_remaining_cols = selected_cols.copy()
be_selected_cols = selected_cols.copy()
be_remaining_cols.remove('intercept')
for i in range(len(be_remaining_cols)):
temp_rmvd_col = be_remaining_cols[i]
be_selected_cols.remove(temp_rmvd_col)
model = sm.OLS(y, X[be_selected_cols]).fit()
new_criteria = model.aic
be_selected_cols.append(temp_rmvd_col)
if new_criteria < criteria:
aic_dict[temp_rmvd_col] = new_criteria
flag = True
print()
col_elig_fr_elim = None
if flag :
col_elig_fr_elim = sorted(aic_dict.items(), key=lambda x: x[1])[0][0]
print("Column that is eligible for elimination:",col_elig_fr_elim)
else:
print('No colums eligible for BE in this stage')
for i in range(len(remaining_cols)):
new_col = remaining_cols[i]
selected_cols.append(new_col)
model = sm.OLS(y, X[selected_cols]).fit()
new_criteria = model.aic
if new_criteria < criteria:
aic_dict[new_col] = new_criteria
26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
return selected_cols
combine_fs_be_stepaic(train_X_WineData,train_WineData['new_resp'])
[email protected]
21YORICED7 Remaining columns : ['FA', 'VA', 'CA', 'RS', 'chloride', 'FSD', 'TSD', 'density', 'pH', 'su
lphate', 'Seagram', 'SulaVineyards']
Null model AIC critetia : -8369.496080838044
27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Final selected columns : ['intercept', 'density', 'FA', 'pH', 'RS', 'sulphate', 'Seagram', '
CA', 'chloride', 'VA', 'TSD']
No further discussion with Stepwise Regression is taken up since its output is identical to that
of Backward Elimination. All computations performed with the model obtained from
Backward Elimination will be identical with the model obtained from Stepwise Regression.
The statistic known as the Mallows' Cp criterion is useful to measure bias in a multiple linear
regression setting.
28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
If a set of p-1 predictors is chosen out of this set so that there are in total p parameters (including
intercept), then Mallows’ Cp is computed by
𝐶𝑝=[𝑆𝑆𝐸𝑝/𝑀𝑆𝐸𝑎𝑙𝑙] − (𝑛−2𝑝)
where n is the number of observations in the data set. Here 𝑆𝑆𝐸𝑝 is the residual sum of squares
obtained when the response is modelled with the p-1 chosen predictors.
If the full model is fitted, using all the 𝑘 predictors, then there will be 𝑘+1 parameters. Then
𝑀𝑆𝐸𝑎𝑙𝑙 or Mean Squared Error for the full model is computed by dividing the residual sum of
squares (SSE) by 𝑛−(𝑘+1).
If the model works well, the numerical value of 𝐶𝑝 is expected to be close to 𝑝. For a potentially
good model 𝐶𝑝 ≤ p. The full model always has Mallows’ Cp equal to 𝑘+1. Aim is to select the
smallest 𝑝 and that subset of predictors for which Cp is smaller than but closest to p.
This procedure is also known as All Possible Regressions or Regression Subset Selection.
However, for even a decent sized k, all 2𝑘 subset of predictors are not possible to consider.
Typically, only a few models at various values of k are compared.
One advantage of this procedure is that, the models are not conditional on what predictors are
[email protected]
21YORICED7already included in the model. For the algorithms discussed previously, i.e. FS, BE or stepwise
regression, the next predictor to be included or eliminated, depends on what other predictors
are already included in the model. In All Possible Regression method, there does not exist any
such dependency.
import os
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
X=train_X_WineData
y=train_WineData['new_resp']
# Perform an Exhaustive Search. The EFS and SFS packages use 'neg_mean_squared_error'. The
'mean_squared_error' seems to have been deprecated. I think this is just the MSE with the
a negative sign.
lr = LinearRegression()
efs1 = EFS(lr,
min_features=1,
max_features=13,
scoring='neg_mean_squared_error',
print_progress=True,
cv=5)
# Create a efs fit
efs1 = efs1.fit(X.values, y.values)
print('Best negtive mean squared error: %.2f' % efs1.best_score_)
## Print the IDX of the best features
print('Best subset:', efs1.best_idx_)
Features: 8191/8191
29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
MLR_wine1 =
ols(formula="new_resp~FA+VA+CA+RS+chloride+TSD+density+sulphate+pH+Seagram",data=train_W
ineData).fit()
print(MLR_wine1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: new_resp R-squared: 0.658
Model: OLS Adj. R-squared: 0.655
Method:
[email protected] Least Squares F-statistic: 243.4
21YORICED7 Date: Sun, 10 Jan 2021 Prob (F-statistic): 1.21e-286
Time: 15:53:56 Log-Likelihood: 4871.0
No. Observations: 1279 AIC: -9720.
Df Residuals: 1268 BIC: -9663.
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -4.6515 0.133 -34.877 0.000 -4.913 -4.390
FA -0.0042 0.000 -20.494 0.000 -0.005 -0.004
VA -0.0041 0.001 -3.661 0.000 -0.006 -0.002
CA -0.0078 0.001 -5.806 0.000 -0.010 -0.005
RS -0.0022 0.000 -18.700 0.000 -0.002 -0.002
chloride 0.0165 0.004 4.418 0.000 0.009 0.024
TSD 2.439e-05 5.1e-06 4.779 0.000 1.44e-05 3.44e-05
density 4.9141 0.137 35.913 0.000 4.646 5.182
sulphate -0.0097 0.001 -9.532 0.000 -0.012 -0.008
pH -0.0309 0.002 -20.365 0.000 -0.034 -0.028
Seagram 0.0022 0.000 6.225 0.000 0.001 0.003
==============================================================================
Omnibus: 30.227 Durbin-Watson: 1.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 50.976
Skew: -0.184 Prob(JB): 8.52e-12
Kurtosis: 3.906 Cond. No. 7.30e+04
==============================================================================
30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Note again that, all possible regression mechanism gives a third choice of model to predict
alcohol (%).
As we know that Backward elimination, combination of backward & forward as well as Mallows
Cp results in same features as the best features, we have calculated same RMSE for all 3 as shown
above.
Both RMSE values are again comparable, but this time, the model obtained from Backward
elimination, combination of backward & forward as well as Mallows Cp beats the other
forward selection model.
Hence, we choose the model from Backward Elimination as our final model.
31
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Note that the RMSE for this model is 0.00531 on the test data, and the RMSE for the same model
on the train data is approximately 0.00537.
y_train_pred = MLR_wine_be.predict(X)
print("Train MSE :",np.sqrt(mean_squared_error(y,y_train_pred)))
Train MSE : 0.005367713537929523
The comparability of these two values suggests that our final model is unbiased and hence a
good fit.
[email protected]
21YORICED7
32
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
No
LINE
assumptions Transform Y
satisfied?
Yes
33
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
21YORICED7
34
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.