DCSN 216 Summary
DCSN 216 Summary
Table of Contents
Chapter 1 – Introduction..........................................................................................................2
Types of Analytics............................................................................................................................2
Data.................................................................................................................................................2
Four Types of Data Based on Measurement Scale............................................................................2
Terminology and Notation...............................................................................................................2
Chapter 2 – Overview of the Data Mining Process...................................................................5
Supervised Learning Vs. Unsupervised Learning...............................................................................5
Steps in Data Mining........................................................................................................................6
Chapter 3 – Data Visualization.................................................................................................8
Distribution Plots: Boxplots and Histograms.....................................................................................8
Chapter 4 – Dimension Reduction............................................................................................9
Pivot Tables.....................................................................................................................................9
Correlation Analysis.......................................................................................................................10
Reducing Number of Categories in Categorical Variables................................................................10
Chapter 5 – Evaluating Classification and Predictive Performance..........................................11
Accuracy Measures for Classification:.............................................................................................11
Confusion (classification) Matrix....................................................................................................11
Cuttoff for Classification.................................................................................................................11
Rare Cases.....................................................................................................................................12
Measuring Predictive Error.............................................................................................................12
Chapter 6 – Multiple Linear Regression..................................................................................13
Explanatory Vs. Predictive Modeling with Regression.....................................................................13
Steps in Prediction.........................................................................................................................13
Chapter 10 – Logistic Regression............................................................................................16
Computing the Logistic Response Function and Logit......................................................................16
Estimating the Logistic Model from Data........................................................................................17
Logistic Regression in SPSS.............................................................................................................17
The ROC Curve...............................................................................................................................18
Appendix...............................................................................................................................19
Questions and Answers..................................................................................................................19
2
Chapter 1 – Introduction
Types of Analytics
Descriptive analytics
- Uses data to understand past and present
- Descriptive model: Simply tells “what is” and describes relationships
Predictive analytics
- Analyzes past performance
- Predictive model: incorporates uncertainty to help managers analyze risk
Prescriptive analytics
- Uses optimization techniques
- Prescriptive model: helps decision makers identify the best solution
Data:
Metrics are used to quantify performance.
Measures are numerical values of metrics.
Ordinal data: Data that is ranked or ordered according to some relationship with one another
Interval data: Ordinal data but with constant differences between observations + no true zero point
Confidence has a specific meaning in association rules of the type “If A and B are
purchased, C is also purchased.” Confidence is the conditional probability that C will be
purchased, IF A and B are purchased.
3
Holdout sample is a sample of data not used in fitting a model, used to assess the
performance of that model; this book uses the terms validation set or, if one is used in the
problem, test set instead of holdout sample.
Model refers to an algorithm as applied to a dataset, complete with its settings (many of
the algorithms have parameters which the user can adjust).
Observation is the unit of analysis on which the measurements are taken (a customer, a
trans- action, etc.); also called case, record, pattern or row. (each row typically
represents a record, each column a variable)
P(A|B) is the conditional probability of event A occurring given that event B has
occurred. Read as “the probability that A will occur, given that B has occurred.”
Prediction – The prediction of the numerical value of a continuous output variable; also
called estimation.
Pattern is a set of measurements on an observation (e.g., the height, weight, and age of a
person) Prediction means the prediction of the value of a continuous output variable; also
called estimation.
Score refers to a predicted value or class. “Scoring new data” means to use a model
developed with training data to predict output values in new data.
Success class is the class of interest in a binary outcome (e.g., “purchasers” in the
outcome “purchase/no-purchase”)
Test data (or test set) refers to that portion of the data used only at the end of the model
building and selection process to assess how well the final model might perform on
additional data.
Training data (or training set) refers to that portion of data used to fit a model.
Unsupervised learning refers to analysis in which one attempts to learn something about
the data other than predicting an output value of interest (whether it falls into clusters, for
example).
Validation data (or validation set) refers to that portion of the data used to assess how
well the model fits, to adjust some models, and to select the best model from among those
that have been tried.
Variable is any measurement on the records, including both the input (X) variables and
the output (Y) variable.
5
Classification:
Predict categorical target (outcome) variable
Examples: fraud/no fraud, purchase/no purchase…
Usually binary (1=yes, 0=no)
Prediction:
Predict numerical target (outcome) variable
Examples: sales, revenue, performance
Unsupervised: Segment data into meaningful segments; detect patterns (no specific
target to predict or classify)
Methods: Association rules, data reduction & exploration, visualization
Association:
Produce rules that define “what goes with what
Example: “If X was purchased, Y was also purchased”
Also called “Affinity” analysis
Data Reduction:
Reducing the number of columns and/or rows
Visualization:
Building graphs and plots of data which is useful to
examine relationships between pairs of variables
Examples: Histograms, boxplots, bar charts, scatterplots
Data exploration is using techniques of Data reduction and Visualization to help understand and
better analyze the complex, large data.
6
1. Define/understand purpose
- RMS Error:
Error = actual - predicted
RMS = Root-mean-squared error = Square root of
average squared error
Boxplots:
- We can also generate side-by-side box plots in Excel or
SPSS (easier).
- The box encloses 50% of the data.
- The horizontal line inside the box represents the median
SPSS Graphs Chart (50th percentile).
Builder Boxplot - The top and bottom of the box represent the 75th and 25th
percentiles of the data, respectively.
- Lines extending above and below the box represent the data
range.
- Outliers are represented by dots or circles above or below
the data range.
- Interquartile range = IQR = Q3 – Q1
Lower Limit = Q1 – 1.5IQR
Upper Limit = Q3 + 1.5IQR
- Comparing the average and the media helps in assessing
how skewed the data is.
- Sometimes, you should consider rescaling the chart for a
better demonstration.
Histograms:
- Represents the frequencies of all x values with a series of
vertical connected bars.
Both are useful for prediction tasks because they’re built upon numerical values. Boxplots can
also support unsupervised learning by displaying relationships between numerical variables (y-
axis) and categorical variables (x-axis).
9
In Excel, there are several functions that assist in summarizing data among which are: average,
stdev, min, max, median, and count.
Min and max functions can be used to identify extreme values that might be errors.
The average and the mean give a sense of the central values of that variable, and a large
deviation between the two also indicates skew.
The standard deviation gives information about how dispersed the data is (relative to the mean).
You can also Excel’s Descriptive Statistics facility in the Data>Data Analysis menu.
You can also obtain a complete matrix of correlations between each pair of variables in the
data using Excel’s Correlation facility in the Data>Data Analysis.
Pivot Tables
For categorical variables (binary), we obtain a breakdown of the records by the combination of
categories. Excel’s bin notation (ranges) means that the lower value of the range is included in
the statistic but the upper value is not.
In classification tasks where the goal is to find significant predictors, a good step is to prepare
pivot tables for all classes.
If you’d like to compare the effect of 2 or more variables on the target variable, you need to
change them to ratios if they don’t have the same unit. You can do that by dividing the difference
of the averages of 1s and 0s by their total (sum). The higher ratio indicates the variable with the
higher significance with respect to our target variable.
You can’t create a pivot table for 2 numerical variables use correlation.
Correlation Analysis
Removing variables that are strongly correlated to others is useful for avoiding multicollinearity
problems. Multicollinearity is the presence of two or more predictors sharing the same linear
relationship with the outcome variable.
Note that you can’t generate correlation analysis for a categorical variable with a numerical one
unless the categorical variable is coded or converted to a dummy variable.
To identify the best numerical prediction based on the correlation analysis, look for the value
with the highest correlation coefficient with respect to the target variable.
The ** notation indicates that the 2 variables are highly correlated at a significance level of 1%.
To compute the VIFs for multicollinearity diagnostics when you run the linear regression: you go
to “Statistics” and check the box “Collinearity diagnostics”. The general rule of thumb is that
VIFs exceeding 4 permit further investigation, while VIFs exceeding 10 are signs of serious
multicollinearity requiring correction. Always prefer predictors with VIF less than 3 and
eliminate predictors with the highest VIFs. As you eliminate predictors with VIFs > 3, make sure
you eliminate them one by one and always check of changes in adjusted R2.
When a categorical variable has many categories, and this variable is destined to be a
predictor, it’s preferable to convert it into many dummy variables (1=yes, 0=no). In particular, a
variable with m categories should be converted into m-1 dummy variables. You can also
combine similar categories together by using only relevant categories and combining other
variables as “others”.
In classification tasks (with a categorical output), a pivot table broken down by the output classes
can help identify categories that do not separate the classes. Those categories are also candidates
for inclusion in the “others” category.
Sometimes the categories in a categorical variable represent intervals. In such cases, we can
replace the categorical value with the mid-interval value.
11
Used to compare the predicted data to the actual data (using validation data) which is a way to
check the accuracy of our model.
Sometimes a model is more accurate, sensitive, and specific just because more
variables are used. This causes overfitting and we prefer not working with this
model.
If a probability that a case belongs to a given class is greater than the Cutoff rate, then the case
belongs to this class.
Rare Cases
We often oversample rare cases (very low % of 1s) to give model more information to work
with.
We want to know how well the model predicts new data, not how well it fits the data it was
trained with (the later is known as Goodness of Fit). A key component of most measures is
difference between actual y and predicted y hat.
- RMSE (root-mean-squared-error)
Square the errors, find their average, take the square root
Used to compare between models
the one with the lowest RMSE is better.
(check appendix question6)
If numerical and
predictive task, use the
following measures of
error.
“goodness-of-fit”: R2, residual analysis, p- Train model on training data and assess
values performance on validation data
Use the entire dataset for estimating the Data is split into a training set and a
model validation set
Steps in Prediction
Partitioning on SPSS:
- Data Select cases if split = 1 copy selected to new new dataset and name it
training
- Data Select cases if split = 0 copy selected to new dataset and name it
validation
4- Run regression analysis: analyze regression linear
Multiple linear regression only accepts categorical variables if binary
5- Compute Error Reports for both training and validation: Total sum of squared errors,
RMS error, and Average error
6- Check residuals and compute a boxplot to identify outliers
7- Find parsimonious model (the simplest model that performs sufficiently well) by:
Forward Selection:
- Start with no predictors
- Add them one by one (add the one with largest
contribution)
15
Background Selection:
- Start with all predictors
- Successively eliminate least useful predictors one by one
- Stop when all remaining predictors have statistically significant contribution
Stepwise:
- Like Forward Selection
- Except at each step, also consider dropping non-significant predictors
______________________________________________________________________
Goodness-of-fit: The goodness of fit of a statistical model describes how well the model
fits the data. In general, the measure used to evaluate the goodness of fit is R2(coefficient
of determination). R2 ranges from 0 to 1.
The higher the R2 the better the goodness-of fit. However, the main disadvantage of this
measure is that R2 increases every time you add a predictor to the model even if this
predictor is not important. Therefore, the Adjusted R2, a modified version of R2, is used to
account for the number of added predictors.
Overall Significance of the Regression: It is analyzed by the ANOVA F-test for the
below hypotheses:
Typically, if the p-value of the F-test is below 5%, then the regression is significant.
Typically, if the p-value for this T-test is below 5% then the predictor is significant
Here, the Outcome Variable is categorical and since it should be binary to indicate a specific
class, linear regression can’t help us. Why?
Thus, instead of Y as outcome variable (like in linear regression), we use a function of Y called
the logit.
Note that you should only talk about probabilities when your evaluating specific predictors (single predictor cases)
Next, we look at another measure of belonging to a certain class, known as Odds. The odds is
defined as the ratio of the probability of belonging to class 1 to the probability of belonging to
class 0.
p p=
Odds
Odds= (1) --> (2)
1− p 1+Odds
If, for example, odds = 4, this means that success is 4 times more likely to
happen.
Now, log both sides and you get the logit function:
Log(Odds) takes values from -infinity to +infinity and a log(odds) of 0 indicates probability =
0.5.
Note that:
P is represented as: P(Y=1 or other success prediction | “predictor” = x)
Odds is represented as: Odds(Y=1 or other success prediction)
In logistic regression, the relation between Y and the parameters is nonlinear. For this reason, the
parameters are not estimated using the method of the least squares. Instead, a method called
maximum likelihood is used. In brief, this method finds the estimates that maximize the chance
of obtaining the data that we have.
Once you generate your regression model, different conclusions can be made:
- Notice how positive coefficients in the logit model translate into coefficients larger
than 1 in the odds model and vise-versa.
- If asked to know how increasing a certain predictor by 1 unit would affect our target
variable, use odds and not probability.
Thus, e1 is the multiplicative factor by which the odds of belonging to the
successful class increase when the value of x1 is increased by 1 unit. If 1 is
negative, an increase in x1 is associated with a decrease in the odds of belonging
to class 1 and vise-versa.
1
For example, people with a $1000 more salary are e more likely to buy.
19
A ROC Curve is determinant of the sensitivity and specificity and how they vary as we change
cutoffs. Moreover, it is used to evaluate overall performance.
The ROC Curve should be higher than the base line and the goal is to maximize the area under
the curve (AUC) which in turn maximizes sensitivity.
sensitivity
1 - specificity
Appendix
Note that the higher the coefficients (standardized), the more significant the predictor. (In absolute
values)
22
Also, sometimes the predictor with the highest coefficient is different than the most significant
predictor according to pivot tables or correlation matrixes. This is because of collinearity.
23
7. Compare models?
Compare RMSE and take complexity (number of predictors) into consideration.
In Excel:
1) Copy the following columns from SPSS to Excel: PRE_1 and RES_1 and
Split and y (which is the target variable)
2) Filter by Split: smallest to largest
3) Copy to a new sheet: all rows with split = 0
4) Apply formula:
New column and function: =abs(e/y)
Where ei = yi − yˆi.