0% found this document useful (0 votes)
7 views23 pages

DCSN 216 Summary

This document provides a summary of key topics from chapters in a data science textbook including data visualization, dimension reduction, classification performance evaluation, linear regression, and logistic regression. It defines common terminology like descriptive analytics, predictive analytics, prescriptive analytics, supervised vs. unsupervised learning, and the data mining process. Dimension reduction techniques discussed include pivot tables and correlation analysis. Evaluation measures include accuracy, confusion matrices, and predictive error. Linear and logistic regression modeling steps are also outlined.

Uploaded by

joumana.r.daher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views23 pages

DCSN 216 Summary

This document provides a summary of key topics from chapters in a data science textbook including data visualization, dimension reduction, classification performance evaluation, linear regression, and logistic regression. It defines common terminology like descriptive analytics, predictive analytics, prescriptive analytics, supervised vs. unsupervised learning, and the data mining process. Dimension reduction techniques discussed include pivot tables and correlation analysis. Evaluation measures include accuracy, confusion matrices, and predictive error. Linear and logistic regression modeling steps are also outlined.

Uploaded by

joumana.r.daher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

1

DCSN 216 Summary


Summary of slides, book(chapters 3, 4, and 10), class notes, and Moodle guidelines.

Table of Contents
Chapter 1 – Introduction..........................................................................................................2
Types of Analytics............................................................................................................................2
Data.................................................................................................................................................2
Four Types of Data Based on Measurement Scale............................................................................2
Terminology and Notation...............................................................................................................2
Chapter 2 – Overview of the Data Mining Process...................................................................5
Supervised Learning Vs. Unsupervised Learning...............................................................................5
Steps in Data Mining........................................................................................................................6
Chapter 3 – Data Visualization.................................................................................................8
Distribution Plots: Boxplots and Histograms.....................................................................................8
Chapter 4 – Dimension Reduction............................................................................................9
Pivot Tables.....................................................................................................................................9
Correlation Analysis.......................................................................................................................10
Reducing Number of Categories in Categorical Variables................................................................10
Chapter 5 – Evaluating Classification and Predictive Performance..........................................11
Accuracy Measures for Classification:.............................................................................................11
Confusion (classification) Matrix....................................................................................................11
Cuttoff for Classification.................................................................................................................11
Rare Cases.....................................................................................................................................12
Measuring Predictive Error.............................................................................................................12
Chapter 6 – Multiple Linear Regression..................................................................................13
Explanatory Vs. Predictive Modeling with Regression.....................................................................13
Steps in Prediction.........................................................................................................................13
Chapter 10 – Logistic Regression............................................................................................16
Computing the Logistic Response Function and Logit......................................................................16
Estimating the Logistic Model from Data........................................................................................17
Logistic Regression in SPSS.............................................................................................................17
The ROC Curve...............................................................................................................................18
Appendix...............................................................................................................................19
Questions and Answers..................................................................................................................19
2

Chapter 1 – Introduction

Types of Analytics
 Descriptive analytics
- Uses data to understand past and present
- Descriptive model: Simply tells “what is” and describes relationships
 Predictive analytics
- Analyzes past performance
- Predictive model: incorporates uncertainty to help managers analyze risk
 Prescriptive analytics
- Uses optimization techniques
- Prescriptive model: helps decision makers identify the best solution

Data:
 Metrics are used to quantify performance.
 Measures are numerical values of metrics.

 Discrete metrics involve counting


 Metrics
 Continuous metrics involve measuring

Four Types of Data Based on Measurement Scale

 Categorical (nominal) data: Data placed in categories according to a specified characteristic

 Ordinal data: Data that is ranked or ordered according to some relationship with one another

 Interval data: Ordinal data but with constant differences between observations + no true zero point

 Ratio data: Continuous values and have a natural zero point

Terminology and Notation

Algorithm refers to a specific procedure used to implement a particular data mining


technique- classification tree, discriminant analysis, etc.

Attribute - see Predictor.

Case - see Observation.

Confidence has a specific meaning in association rules of the type “If A and B are
purchased, C is also purchased.” Confidence is the conditional probability that C will be
purchased, IF A and B are purchased.
3

Confidence also has a broader meaning in statistics (“confidence interval”), concerning


the degree of error in an estimate that results from selecting one sample as opposed to
another.

Dependent variable - see Response.

Estimation - see Prediction.

Feature - see Predictor.

Estimation – See Prediction

Holdout sample is a sample of data not used in fitting a model, used to assess the
performance of that model; this book uses the terms validation set or, if one is used in the
problem, test set instead of holdout sample.

Input variable - see Predictor.

Model refers to an algorithm as applied to a dataset, complete with its settings (many of
the algorithms have parameters which the user can adjust).

Observation is the unit of analysis on which the measurements are taken (a customer, a
trans- action, etc.); also called case, record, pattern or row. (each row typically
represents a record, each column a variable)

Outcome variable - see Response.

Output variable - see Response.

P(A|B) is the conditional probability of event A occurring given that event B has
occurred. Read as “the probability that A will occur, given that B has occurred.”

Profile – A set of measurements on an observation (e.g., height, weight, and age of a


person)

Prediction – The prediction of the numerical value of a continuous output variable; also
called estimation.

Pattern is a set of measurements on an observation (e.g., the height, weight, and age of a
person) Prediction means the prediction of the value of a continuous output variable; also
called estimation.

Predictor usually denoted by X, is also called a feature, input variable, independent


variable, or, from a database perspective, a field. Also called a feature, input variable,
independent variable, or from a database perspective, a field.
4

Record - see Observation.Response , usually denoted by Y , is the variable being


predicted in supervised learning; also called dependent variable, output variable,
target variable or outcome variable.

Response – A variable, usually denoted by Y, which is the variable being predicted in


supervised learning, also called dependent variable, output variable, target variable, or
outcome variable.

Score refers to a predicted value or class. “Scoring new data” means to use a model
developed with training data to predict output values in new data.

Success class is the class of interest in a binary outcome (e.g., “purchasers” in the
outcome “purchase/no-purchase”)

Supervised learning refers to the process of providing an algorithm (logistic regression,


regression tree, etc.) with records in which an output variable of interest is known and the
algorithm “learns” how to predict this value with new records where the output is
unknown.

Target – See Response

Test data (or test set) refers to that portion of the data used only at the end of the model
building and selection process to assess how well the final model might perform on
additional data.

Training data (or training set) refers to that portion of data used to fit a model.

Unsupervised learning refers to analysis in which one attempts to learn something about
the data other than predicting an output value of interest (whether it falls into clusters, for
example).

Validation data (or validation set) refers to that portion of the data used to assess how
well the model fits, to adjust some models, and to select the best model from among those
that have been tried.

Variable is any measurement on the records, including both the input (X) variables and
the output (Y) variable.
5

Chapter 2 – Overview of the Data Mining Process


Data mining methods are usually applied to a sample from a large database, and then the best
model is used to score the entire database.

Supervised Learning Vs. Unsupervised Learning

Supervised: Predict a certain “outcome” or “target” variable


Methods: Classification and Prediction

 Classification:
 Predict categorical target (outcome) variable
 Examples: fraud/no fraud, purchase/no purchase…
 Usually binary (1=yes, 0=no)
 Prediction:
 Predict numerical target (outcome) variable
 Examples: sales, revenue, performance

Taken together, classification and prediction constitute “predictive analytics”.

Unsupervised: Segment data into meaningful segments; detect patterns (no specific
target to predict or classify)
Methods: Association rules, data reduction & exploration, visualization

 Association:
 Produce rules that define “what goes with what
 Example: “If X was purchased, Y was also purchased”
 Also called “Affinity” analysis
 Data Reduction:
 Reducing the number of columns and/or rows
 Visualization:
 Building graphs and plots of data which is useful to
examine relationships between pairs of variables
 Examples: Histograms, boxplots, bar charts, scatterplots

Data exploration is using techniques of Data reduction and Visualization to help understand and
better analyze the complex, large data.
6

Steps in Data Mining

1. Define/understand purpose

2. Obtain data (may involve random sampling)


 Rare Event Oversampling: Sampling may yield too few “interesting”
cases to effectively train a model in case of a rare event.
 A popular solution: oversample the rare cases to obtain a
more balanced training set (chap 5)

3. Explore, clean, pre-process data


 Separate between categorical (ordered/unordered) and numeric
(continuous/integer)
- Numeric: May occasionally need to “bin” into categories
- Categorical: In most other algorithms, must create binary
dummies (if more than 2 categories, number of dummies = number of
categories – 1)

 Detect outliers: Once detected, domain knowledge is required to determine


if it is an error, or truly extreme. In some contexts, finding outliers is the
purpose of the DM exercise (anomaly detection).

 Handling Missing Data: Default is to drop those records.


- Solution 1: Omission
 You can drop a record with a missing value and you
can drop the entire variable (column) if many
records have it.
- Solution 2: Imputation
 You can replace missing values with reasonable
substitutes

 Normalizing (Standardizing) Data: Used in some techniques to put all


variables on same scale
- Normalizing function: Subtract mean and divide by
standard deviation
- Alternative function: scale to 0-1 by subtracting minimum
and dividing by the range (Useful when the data contain dummies
and numeric)
7

 Partitioning Data: Separating the Data into 2-3 parts:


Training Data – Validation Data – Testing Data
(Optional).
- This is to avoid the problem of Overfitting
which is when a model is too perfect and
fails to accurately function on newer data
because when used with new data, models
of great complexity do not do so well.
- Assessing multiple models on same
validation data can overfit validation data which is why
some people choose to test the model on Testing Data
before deploying it.

- RMS Error:
 Error = actual - predicted
 RMS = Root-mean-squared error = Square root of
average squared error

4. Reduce the data; if supervised DM, partition it

5. Specify task (classification, clustering, etc.)

Other 6. Choose the techniques (regression, CART, neural networks, etc.)


Chapters
7. Iterative implementation and “tuning”

8. Assess results – compare models

9. Deploy best model


8

Chapter 3 – Data Visualization


The three most effective basic plots are:
- Bar Charts (used to understand the data and the outcome to
categorical predictors for both classification and prediction)
- Line Graphs (primarily used for time series)
- Scatter Plots (useful for comparing and both predictors
should be numerical)

Distribution Plots: Boxplots and Histograms

Boxplots:
- We can also generate side-by-side box plots in Excel or
SPSS (easier).
- The box encloses 50% of the data.
- The horizontal line inside the box represents the median
SPSS  Graphs  Chart (50th percentile).
Builder  Boxplot - The top and bottom of the box represent the 75th and 25th
percentiles of the data, respectively.
- Lines extending above and below the box represent the data
range.
- Outliers are represented by dots or circles above or below
the data range.
- Interquartile range = IQR = Q3 – Q1
Lower Limit = Q1 – 1.5IQR
Upper Limit = Q3 + 1.5IQR
- Comparing the average and the media helps in assessing
how skewed the data is.
- Sometimes, you should consider rescaling the chart for a
better demonstration.

Histograms:
- Represents the frequencies of all x values with a series of
vertical connected bars.

Both are useful for prediction tasks because they’re built upon numerical values. Boxplots can
also support unsupervised learning by displaying relationships between numerical variables (y-
axis) and categorical variables (x-axis).
9

Chapter 4 – Dimension Reduction


Dimension reduction is the process of reducing the number of variables for the model to operate
efficiently and to avoid the problem of overfitting.

In Excel, there are several functions that assist in summarizing data among which are: average,
stdev, min, max, median, and count.

Min and max functions can be used to identify extreme values that might be errors.
The average and the mean give a sense of the central values of that variable, and a large
deviation between the two also indicates skew.
The standard deviation gives information about how dispersed the data is (relative to the mean).
You can also Excel’s Descriptive Statistics facility in the Data>Data Analysis menu.
You can also obtain a complete matrix of correlations between each pair of variables in the
data using Excel’s Correlation facility in the Data>Data Analysis.

Pivot Tables

For categorical variables (binary), we obtain a breakdown of the records by the combination of
categories. Excel’s bin notation (ranges) means that the lower value of the range is included in
the statistic but the upper value is not.

In classification tasks where the goal is to find significant predictors, a good step is to prepare
pivot tables for all classes.

If you’d like to compare the effect of 2 or more variables on the target variable, you need to
change them to ratios if they don’t have the same unit. You can do that by dividing the difference
of the averages of 1s and 0s by their total (sum). The higher ratio indicates the variable with the
higher significance with respect to our target variable.

3 Cases for Pivot Tables:

Case 1: Target variable: Numerical; Predictor: Categorical


 You put the target variable in “VALUES” and the predictor in “ROWS”
Then you change the sum to average and you compare the averages of the target
variable across the predictor categories to see if there is a significant discrepancy.

Case 2: Target variable: Categorical (Binary); Predictor: Numerical


 You put the Predictor in “VALUES” and the Target in “ROWS”
Then you change the sum to average and you compare the averages of the
Predictor across the target variable categories to see if there is a significant
discrepancy.
10

Case 3: Target variable: Categorical (Binary); Predictor: Categorical


 You put the target variable in “VALUES” and the predictor in “ROWS”
Then you change the sum to average (The average here will act as a proportion
since the target variable is binary) and you compare the proportions across the
predictor categories.

You can’t create a pivot table for 2 numerical variables  use correlation.

Correlation Analysis

Removing variables that are strongly correlated to others is useful for avoiding multicollinearity
problems. Multicollinearity is the presence of two or more predictors sharing the same linear
relationship with the outcome variable.

Checking for correlation in SPSS: Analyze  Correlate  Bivariate correlation

Note that you can’t generate correlation analysis for a categorical variable with a numerical one
unless the categorical variable is coded or converted to a dummy variable.

To identify the best numerical prediction based on the correlation analysis, look for the value
with the highest correlation coefficient with respect to the target variable.

The ** notation indicates that the 2 variables are highly correlated at a significance level of 1%.

To compute the VIFs for multicollinearity diagnostics when you run the linear regression: you go
to “Statistics” and check the box “Collinearity diagnostics”. The general rule of thumb is that
VIFs exceeding 4 permit further investigation, while VIFs exceeding 10 are signs of serious
multicollinearity requiring correction. Always prefer predictors with VIF less than 3 and
eliminate predictors with the highest VIFs. As you eliminate predictors with VIFs > 3, make sure
you eliminate them one by one and always check of changes in adjusted R2.

Reducing Number of Categories in Categorical Variables

When a categorical variable has many categories, and this variable is destined to be a
predictor, it’s preferable to convert it into many dummy variables (1=yes, 0=no). In particular, a
variable with m categories should be converted into m-1 dummy variables. You can also
combine similar categories together by using only relevant categories and combining other
variables as “others”.

In classification tasks (with a categorical output), a pivot table broken down by the output classes
can help identify categories that do not separate the classes. Those categories are also candidates
for inclusion in the “others” category.

Sometimes the categories in a categorical variable represent intervals. In such cases, we can
replace the categorical value with the mid-interval value.
11

Chapter 5 – Evaluating Classification and Predictive Performance


Measures as R2 don’t tell us much about the ability of the model to predict new data.

Accuracy Measures for Classification:


- Error = classifying a record as belonging to one class
when it belongs to another class.
- Error rate = percent of misclassified records out of the
total records in the validation data

Confusion (classification) Matrix

Used to compare the predicted data to the actual data (using validation data) which is a way to
check the accuracy of our model.

Overall error rate (misclassification rate) = (sum of misclassified


If question is to
records)/(total records)
discuss the
accuracy of a Accuracy = 1 – err OR (sum of correctly classified records)/(total records)
model: Sensitivity = (number of ones that are correctly classified)/(total number of ones)
1- compare actual = n(1,1)/n(1 actual)
to predicted Specificity = (number of zeroes that are
2-compute Classification Confusion Matrix
correctly classified)/(total number of zeroes) =
accuracy, Predicted Class
n(0,0)/n(0 actual)
sensitivity, and Actual Class 1 0
specificity 1 201 85
0 25 2689
False positive rate = (number of misclassified ones)/(total number of predicted
ones)  opposite of sensitivity

False negative rate = (number of misclassified zeroes)/(total number of predicted


zeroes)
12

Sometimes a model is more accurate, sensitive, and specific just because more
variables are used. This causes overfitting and we prefer not working with this
model.

Cuttoff for Classification

If a probability that a case belongs to a given class is greater than the Cutoff rate, then the case
belongs to this class.

Default cutoff value is 0.50 (lowest error rate)


If >= 0.50, classify as “1”
If < 0.50, classify as “0”
We can use a cutoff rate bigger or smaller than 0.5 but either way, the misclassification
rate will increase.

Rare Cases

We often oversample rare cases (very low % of 1s) to give model more information to work
with.

Steps for Oversampling:

1. Separate the responders (rare) from non-responders


2. Randomly assign half the responders to the training sample, plus equal
number of non-responders
3. Remaining responders go to validation sample
4. Add non-responders to validation data, to maintain original ratio of responders
to non-responders
5. Randomly take test set (if needed) from validation

Measuring Predictive Error

We want to know how well the model predicts new data, not how well it fits the data it was
trained with (the later is known as Goodness of Fit). A key component of most measures is
difference between actual y and predicted y hat.

Some Measures of Error:

- MAE or MAD = (1/n).(|ei|)


Mean absolute error (deviation)
13

Gives an idea of the magnitude of errors

- Average error = (ei)/(n)


Gives an idea of systematic over- or under-prediction

- MAPE = (1/n).(|ei|/yi) x 100


Mean absolute percentage error
Where y is the number of actual y
(check appendix question 8)

- RMSE (root-mean-squared-error)
Square the errors, find their average, take the square root
Used to compare between models
 the one with the lowest RMSE is better.
(check appendix question6)

- Total SSE = ∑ei2


Total sum of squared errors

Chapter 6 – Multiple Linear Regression


It’s a predictive model tool.

Explanatory Vs. Predictive Modeling with Regression


If categorical and
classification task, use
error rate and
confusion matrix.

If numerical and
predictive task, use the
following measures of
error.

Explanatory Modeling Predictive Modeling


Goal: predict target values in other data
14

“goodness-of-fit”: R2, residual analysis, p- Train model on training data and assess
values performance on validation data
Use the entire dataset for estimating the Data is split into a training set and a
model validation set

Focus is on coefficients (ß) Focus is on predictions (y hat)

Steps in Prediction

1- Identifying numerical variable vs. categorical variables


2- Converting categorical variables into dummy variables
3- Splitting and Partition the Data
Splitting on SPSS:
- Transform  Compute variable  Target variable  Call it “Split”  Function
Group All  RV. Bernoulli(0.7)
70%  1s 30%  0s

Partitioning on SPSS:
- Data  Select cases  if split = 1  copy selected to new new dataset and name it
training
- Data  Select cases  if split = 0  copy selected to new dataset and name it
validation
4- Run regression analysis: analyze  regression  linear
Multiple linear regression only accepts categorical variables if binary
5- Compute Error Reports for both training and validation: Total sum of squared errors,
RMS error, and Average error
6- Check residuals and compute a boxplot to identify outliers

7- Find parsimonious model (the simplest model that performs sufficiently well) by:

a. Exhaustive search (explanatory model):


 2 methods:
1. Judge by the highest adjusted R2
2. Judge by Cp: good models have Cp near p + 1 and a small p

b. Eliminating variables (predictive model):

 Forward Selection:
- Start with no predictors
- Add them one by one (add the one with largest
contribution)
15

- Stop when the addition is not statistically significant

 Background Selection:
- Start with all predictors
- Successively eliminate least useful predictors one by one
- Stop when all remaining predictors have statistically significant contribution
 Stepwise:
- Like Forward Selection
- Except at each step, also consider dropping non-significant predictors

If T-test: p-value > alpha  non-significant predictor

8- Good Model: high adj-R2, or Cp = # predictors


9- Indicate candidate models that might be “good models”

______________________________________________________________________

 Goodness-of-fit: The goodness of fit of a statistical model describes how well the model
fits the data. In general, the measure used to evaluate the goodness of fit is R2(coefficient
of determination). R2 ranges from 0 to 1.

The higher the R2 the better the goodness-of fit. However, the main disadvantage of this
measure is that R2 increases every time you add a predictor to the model even if this
predictor is not important. Therefore, the Adjusted R2, a modified version of R2, is used to
account for the number of added predictors.

 Overall Significance of the Regression: It is analyzed by the ANOVA F-test for the
below hypotheses:

H0: All the predictors are significant


H1: At least one predictor is significant

Typically, if the p-value of the F-test is below 5%, then the regression is significant.

 Individual predictor’ significance: It is analyzed by the T-test for the below


hypotheses:

H0: the coefficient of the predictor equal to zero


H1: the coefficient of the predictor not equal to zero
16

Typically, if the p-value for this T-test is below 5% then the predictor is significant

 Detecting Multicollinearity: Collinearity is caused by having too many variables trying to


do the same job. A popular measure for detecting multicollinearity is the Variance
Inflation Factor (VIF). The VIF for the Kth predictor is 1/(1-R 2 k) where R 2 k is the R 2
-value obtained by regressing the k th predictor on the remaining predictors.
17

Chapter 10 – Logistic Regression

Computing the Logistic Response Function and Logit

Here, the Outcome Variable is categorical and since it should be binary to indicate a specific
class, linear regression can’t help us. Why?

Because, in standard linear regression:

Where p is the probability of belonging to class 1.

P is probably not going to be 0 £ p £ 1

Thus, instead of Y as outcome variable (like in linear regression), we use a function of Y called
the logit.

So how do we get to the Logit function?

Step 1: Logistic Response Function

To guarantee that p will be between 0 and 1, we use the following function:


(Logistic Response Function)

Note that you should only talk about probabilities when your evaluating specific predictors (single predictor cases)

Step 2: The Odds

Next, we look at another measure of belonging to a certain class, known as Odds. The odds is
defined as the ratio of the probability of belonging to class 1 to the probability of belonging to
class 0.
p p=
Odds
Odds= (1) --> (2)
1− p 1+Odds

If, for example, odds = 4, this means that success is 4 times more likely to
happen.

Odds Ratio: If we divide odds of one variable by odds of another, we will


get how many times the first is more/less likely to cause success than the
18

second. Example: Odds(males will buy)/Odds(females will buy) = 3. This


means males are 3 times more likely to buy than females.

If we substitute function (2) in the Logistic Response Function above, we get:


β 0 +β 1 x1 +β 2 x2+⋯+ β q xq
Odds=e

Now, log both sides and you get the logit function:

log (Odds )=β 0 + β1 x 1 + β2 x 2 +⋯+β q x q

Log(Odds) takes values from -infinity to +infinity and a log(odds) of 0 indicates probability =
0.5.

Note that:
P is represented as: P(Y=1 or other success prediction | “predictor” = x)
Odds is represented as: Odds(Y=1 or other success prediction)

Estimating the Logistic Model from Data

In logistic regression, the relation between Y and the parameters is nonlinear. For this reason, the
parameters are not estimated using the method of the least squares. Instead, a method called
maximum likelihood is used. In brief, this method finds the estimates that maximize the chance
of obtaining the data that we have.

Once you generate your regression model, different conclusions can be made:

- Notice how positive coefficients in the logit model translate into coefficients larger
than 1 in the odds model and vise-versa.

- If asked to know how increasing a certain predictor by 1 unit would affect our target
variable, use odds and not probability.

Odds(x1+1,x2,…,xn) = e1(x1+ 1) + 2(x2) + … + n(xn) = e1


Odds(x1,x2,…,xn) e1(x1) + 2(x2) + … + n(xn)

Thus, e1 is the multiplicative factor by which the odds of belonging to the
successful class increase when the value of x1 is increased by 1 unit. If 1 is
negative, an increase in x1 is associated with a decrease in the odds of belonging
to class 1 and vise-versa.

1
For example, people with a $1000 more salary are e more likely to buy.
19

Logistic Regression in SPSS

To generate frequency tables:


Analyze  descriptive statistics  frequencies

To run logistic regression:


Analyze  regression  binary logistic
Don’t take non-binary categorical variables as independent variables
Options: Class cutoff: if not provided, choose 0.5
Note that in cases of rare events, lower cutoffs are taken.
As you decrease cutoffs, sensitivity increases.
To check for multicollinearity, we can trick SPSS and run linear regression on the model so we
can get VIFs.

The ROC Curve

A ROC Curve is determinant of the sensitivity and specificity and how they vary as we change
cutoffs. Moreover, it is used to evaluate overall performance.

The ROC Curve should be higher than the base line and the goal is to maximize the area under
the curve (AUC) which in turn maximizes sensitivity.

The ROC Curve should be higher than the base line.

Always use validation data for the ROC Curve.

This is how it goes:


Y Y hat Y hat
(cutoff = a) (cutoff = b)
First, Linear regression to predict different y hats using
different cutoffs (a and b here) 1 0 1
1 1 1
0 1 1
Then the ROC Curve models the variation of the
sensitivity and 1-specificity of the different model (with the different y hats)

sensitivity

ROC Curve Baseline: 50%


chance that the
model is good

UAC: the bigger


the better
20

1 - specificity

Building a ROC Curve:

1. Build Logistic Regression on training data so we get different probabilities


2. Data  Select Cases  if  split = 0 or whatever the validation data is
3. Analyze  ROC Curve
Test variable: probability
State variable: target variable ex. Personal Loan
Calue of state variable = 1
Select with diagonal reference line

Optimal Cutoff? Farthest point on curve from baseline


4.
21

Appendix

Questions and Answers

1. Most discriminant (significant) numerical predictor?


 Correlation matrix:
SPSS: Go to Analyze>Correlation>Bivariate
Choose all numerical values (and binomial variables)
Highest coefficient indicates most discriminant variable

2. Most discriminant (significant) categorical predictor?


 Pivot Tables
Excel: Go to Insert>Pivot Tables
Check Pivot Tables in Chapter 4 to know how to generate different pivot
tables.

Then you’ll have to compare pivot tables:


If response is numerical, they’ll have the same unit so just compare the
difference between the generated values.
If response is categorical, then you’ll have to compare ratios and you can
compute those by dividing the difference over the sum.

3. Split the data?


 Check step 3 of prediction in Chapter 6

4. Code categorical variables as binary?


 SPSS: Go to Transform>Recode into different variables
Name: “variable”_bin
Old Value: Yes New Value: 1 Add Pay attention for capitalization
Select All other values Value = 0 Add Continue

5. Develop a model with all predictors?


 First code all categorical predictors as binary (check question 4 above)
Then in SPSS: Analyze>regression>Linear
Dependent: target variable
Independent: choose all variables (numerical + new coded categoricals)
Select Variable: “Split” equals to 1
Save: Select Unstandardized in predicted values
Select Unstandardized in residual values
Statistics: Keep Default (except if you’re interested in collinearty then select
collinearity diagnostics)

Note that the higher the coefficients (standardized), the more significant the predictor. (In absolute
values)
22

Also, sometimes the predictor with the highest coefficient is different than the most significant
predictor according to pivot tables or correlation matrixes. This is because of collinearity.
23

6. Prediction Performance? (Chapter 5)


 If categorical and classification task, use error rate and confusion matrix.

 If numerical and predictive task, use the following measures of error.

To find RMSE from regression results:


Go to Residuals Statistics Table
RMSE = Intersection of Std.Redidual for Split 1.00 (unselected) with the Mean

Note that split = 1  training data and split ≠ 1  validation data

7. Compare models?
 Compare RMSE and take complexity (number of predictors) into consideration.

8. Compute MAPE? (Chapter 5)


 You should compute MAPE for the validation data

In Excel:
1) Copy the following columns from SPSS to Excel: PRE_1 and RES_1 and
Split and y (which is the target variable)
2) Filter by Split: smallest to largest
3) Copy to a new sheet: all rows with split = 0
4) Apply formula:
New column and function: =abs(e/y)
Where ei = yi − yˆi.

Then compute into a different cell: average(abs(e/y))


and multiply by 100

9. Score the Data?


 3 steps:
a. You save the model outputs in the “xml” file.
b. You open the data that you want to score and you go to “Utilities” and choose
“Scoring Wizard”
c. You browse the “xml” file that you have saved and you score that active Data Set.

You might also like