Visual Guide To Machine Learning
Visual Guide To Machine Learning
MACHINE LEARNING
LANDSCAPE
MACHINE LEARNING
Clustering/Segmentation
Classification Regression Reinforcement Learning
K-Means (Q-learning, deep RL, multi-armed-bandit, etc.)
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Quality Assurance (QA) is about preparing & cleaning data prior to analysis. We’ll cover common QA
topics
including variable types, empty/missing values, range & count calculations, censored data, etc.
PRELIMINARY DATA
QA
Data QA (otherwise known as Quality Assurance or Quality Control) is the first
step in the analytics and machine learning process; QA allows you to identify and
correct underlying data issues (blanks, errors, incorrect formats, etc.) prior to analysis
EVERY. SINGLE.
TIME. (no exceptions!)
WHY IS QA
IMPORTANT?
As an analyst, Preliminary Data QA will help you answer questions
like:
Are there any missing Was the data our client Is there any risk that Are there any outliers
or empty values in the captured from the the data capture that might skew the
data shared by the HR online survey encoded process was biased in results of our
team? properly? some way? analysis?
VARIABLE
TYPES
Variable types give us information about our variables
Variable Types
• January 1, 2000 as a number simply displays a date
• January 1, 2000 as a date implies a set of information we can use (first day of
Empty the year, winter, Saturday, weekend, etc.) all of which can be used to build
Values strong machine learning models
Range Calculations
Common variable types
include:
Count Calculations Numeric Discrete Date
• Customer count • Count • 1/1/2020
Count Calculations • Values that Less obvious cases like variables that are re-coded from
raw values to buckets (surveys)
Here we’re formatting zip codes (which will never
be analyzed as values), as a numeric rather than
string
Left/Right
Structure
EMPTY VALUES
Variable Types Investigating empty values, and how they are recorded in your
data, is a prerequisite for every single analysis
Empty
Values
Empty values can be recorded in many ways (NA, N/A, #N/A,
NaN, Null, “-”, “Invalid”, blank, etc.), but the most common
Range Calculations mistake is turning empty numerical values into zeros (0)
Count Calculations
Left/Right
Structure
EMPTY VALUES
Censored Table
For a missing Retail Price, you
would likely be able to impute the
value since you know the product
Structure name/ID
RANGE CALCULATIONS
min(height) = -10
max(height = 10 Is your variable normalized
Censored Table around a central value (i.e. 0)?
Structure
COUNT CALCULATIONS
PRO TIP: For numerical variables with many unique values (i.e. long decimals),
use a
Table Structure histogram to plot frequency based on custom ranges or “bins” (more on that soon!)
LEFT/RIGHT
CENSORED
Variable Types
When data is left or right censored, it means that due to
some circumstance the min or max value observed is not the
natural minimum or maximum of that metric
Empty Values • This can be difficult to spot unless you are aware of how the data is
being
recorded (which means it’s a particularly dangerous issue to watch out
Range for!) Left Right
Censored Censored
Calculations
Count
Left/Right Censored
Calculations
Table Structure
M all Shopper Survey Results Ecommerce Repeat Purchase Rate
Only tracks shoppers over the age of 18 due to legal Sharp drop as you approach the current date has nothing to
reasons, so anyone under 18 is excluded (even do with customer behavior, but the fact that recent
though there are plenty of mall shoppers under customers haven’t have the opportunity or need to
18) repurchase yet
TABLE STRUCTURE
Range Calculations
PIVOT
Count Calculations
UNPIVOT
Left/Right
Long tables typically contain a single, distinct column for each field
Variable Types
(Date, Product, Category, Quantity, Profit, etc.)
• Easy to see all available fields and variable types
Empty Values • Great for exploratory data analysis and aggregation (i.e. PivotTables)
Range Calculations Wide tables typically split the same metric into multiple columns or
categories (i.e. 2018 Sales, 2019 Sales, 2020 Sales, etc.)
• Typically not ideal for human readability, since wide tables may contain
Count Calculations
thousands of columns (vs. only a handful if pivoted to a long format)
• Often (but not always) the best format for machine learning model input
Left/Right • Great format for visualizing categorical data (i.e. sales by product category)
Censored
Table Structure There’s no right or wrong table structure; each type has strengths &
weaknesses!
CASE STUDY: PRELIMINARY
QA
THE You’ve just been hired as a Data Analyst for Maven Market, a local
SITUATION grocery store looking for help with basic data management and analysis.
Remember that NA and 0 do not mean the same thing! Think carefully
about how to handle missing data and the impact it may have on your
analysis
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Univariate profiling is about exploring individual variables to build an understanding of your data. We’ll
cover
common topics like normal distributions, frequency tables, histograms, etc.
UNIVARIATE PROFILING
Univariate profiling is the next step after preliminary QA; think of univariate profiling
as conducting a descriptive analysis of each variable by itself
VARIABLE
Categorical Variables TYPES
Data Profiling
DISCRETIZATION
Numerical
Variables Discretization
Rules:
If Price <100 then Price Level =
Low
If Price >=100 & Price <500 then Price Level =
Histograms & Med
If Price >=500 then Price Level =
Kernel Densities High
Normal
Distribution
Data Profiling
NOMINAL VS. ORDINAL
VARIABLES
VARIABLE
Categorical Variables TYPES
Histograms &
Kernel Densities
There are two types of categorical variables: nominal and ordinal
• Nominal variables contain categories with no inherent logical rank, which can
Normal be re-ordered with no consequence (i.e. Product Type = Camping, Biking or
Distribution Hiking)
• Ordinal variables contain categories with a logical order (i.e. Size = Small,
Medium, Large), but the interval between those categories has no logical
Data Profiling interpretation
CATEGORICAL
DISTRIBUTIONS
Categorical Variables Categorical distributions are visual and/or numeric representations
of the unique values a variable contains, and how often each occurs
Categorical
Distributions
Common categorical distributions include:
Numerical • Frequency tables: Show the count (or frequency) of each distinct value
Variables • Proportions tables: Show the count of each value as a % of the total
• Heat maps: Formatted to visualize patterns (typically used for multiple variables)
Histograms &
Kernel Densities
Understanding categorical distributions will help us gather knowledge
for building accurate machine learning models (more on this later!)
Normal
Distribution
Data Profiling
CATEGORICAL
DISTRIBUTIONS
Categorical Variables
Section
Distribution:
Categorical Camping Frequency
14 6 Biking
Distributions table
Campin Biking
g Proportions
Numerical 70% 30% table
Variables
Data Profiling
NUMERICAL
VARIABLES
VARIABLE
Categorical Variables TYPES
Data Profiling
HISTOGRAMS
Categorical Variables
Histograms are used to plot a single, discretized numerical
variable
Imagine taking a numerical variable (like age), defining ranges or
Categorical “bins” (1-5, 6-10, etc.), and counting the number of observations
Distributions
which fall into each bin; this is exactly what histograms are
designed to do!
Numerical Variables
Age
Values:
8 29 45 8
Histograms & 25 33 37
7
Frequenc
Kernel 6
Densities 19 43 21 5
28 32 40 4
y
Normal Distribution 24 17 28 3
2
5 22 39
1
15 47 12
Data 0- 11- 21- 31- 41-
Profiling 10 20 30 40 50
Age
Range
KERNEL
DENSITIES
Categorical Variables
Kernel densities are “smooth” versions of histograms, which can
help
to prevent users from over-interpreting breaks between bins
Categorical • Technical definition: “Non-parametric density estimation via
Distributions smoothing”
• Intuitive definition: “Wet noodle laying on a histogram”
Numerical Variables
Age
Values:
8 29 45 8
Histograms & 25 33 37
7
Frequenc
Kernel 6
Densities 19 43 21 5
28 32 40 4
y
Normal Distribution 24 17 28 3
2
5 22 39
1
15 47 12
Data 0- 11- 21- 31- 41-
Profiling 10 20 30 40 50
Age
Range
HISTOGRAMS & KERNEL
DENSITIES
Categorical Variables When to use histograms and kernel density charts:
• Visualizing how a given variable is distributed
Categorical • Providing a visual glimpse of profiling metrics like mean, mode, and
Distributions skewness
Things to watch out
Numerical Variables
for: • Bin sensitivity: Bin size can significantly change the shape and ”smoothness” of
a
histogram, so select a bin width that accurately shows the data distribution
Histograms & • Outliers: Histograms can be used to identify outliers in your data set, but
Kernel you may need to remove them to avoid skewing the distribution
Densities
• Sample size: Histograms are best suited for variables with many observations,
Normal Distribution to reflect the true population distribution
PRO TIP: If your data is relatively symmetrical (not skewed), you can use Sturge’s Rule
Data as a quick “rule of thumb” to determine an appropriate number of bins: K = 1 + 3.322
log (N) (where K = number of bins, N = number of observations)
Profiling
CASE STUDY:
HISTOGRAMS
THE You’ve just been promoted as the new Pit Boss at The Lucky Roll
SITUATION Casino. Your mission? Use data to help expose cheats on the casino
floor.
Profits at the craps tables have been unusually low, and you’ve been asked
THE to
investigate the possible use of loaded die (weighted towards specific numbers).
ASSIGNMENT
Your plan is to track the outcome of each roll, then compare your results
against the expected probability distribution to see how closely they
match.
Histograms &
Kernel Densities
Normal Distribution
Histograms &
Kernel
Densities
𝜎 2𝜋 Turn the
parabola
Normal Distribution upside down
Make the
tails flare
Data out
Profiling
CASE STUDY: NORMAL
DISTRIBUTION
THE It’s August 2016, and you’ve been invited to Rio de Janeiro as a Data
SITUATION Analyst
for the Global Olympic Committee.
THE Your job is to collect demographic data for all female athletes competing in
the Summer Games and determine how the distribution of Olympic
ASSIGNMENT athlete heights compares against the general public.
1. Gather heights for all female athletes competing in the 2016 Games
THE 2. Plot height frequencies using a Histogram, and test various bin widths
OBJECTIVES 3. Determine if athlete heights follow a normal distribution, or “bell
curve”
4. Compare the distributions for athletes vs. the general public
DATA PROFILING
Numerical
Variables Mode of City =
“Houston”
Mode of Sessions = 24
Histograms &
Mode of Gender = F, M
Kernel Densities (this is a bimodal field!)
Normal Common
Distribution
uses:
• Understanding the most common values within a
Data
Profiling dataset
• Diagnosing if one variable is influenced by another
MODE
Categorical Variables While modes typically aren’t very useful on their own, they can
provide
helpful hints for deeper data exploration
Categorical
Distributions • For example, the right histogram below shows a multi-modal distribution,
which indicates that there may be another variable impacting the age
distribution
Numerical Single Mode (21- Two Modes (0-10, 41-
Variables 30) 50)
8 8
7 7
Histograms & 6 6
Frequency
Frequency
Kernel Densities 5 5
4 4
3 3
Normal 2 2
Distribution 1 1
Categorical Variables
The mean is the calculated “central” value in a discrete set on
numbers
• Mean is what most people think of when they hear the word “average”, and is
Categorical
calculated by dividing the sum of all values by the count of all observations
Distributions
• Means can only be applied to numerical variables (not categorical)
Numerical
Variables 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠
𝑚𝑒𝑎𝑛 =
𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
Histograms & 5,220
Kernel Densities =
5 = 𝟏,
Normal Common 𝟎𝟒𝟒
Distribution
uses:
• Making a “best-guess” estimate of a value
Data
Profiling • Calculating a central value when outliers are not
present
MEDIAN
Numerical
Variables
Median =
Histograms &
Kernel Densities 19.5
(average of 15 and 24)
Normal
Distribution Common uses:
Data • Identifying the “center” of a distribution
Profiling • Calculating a central value when outliers may be
present
PERCENTILE
Numerical
Variables
Histograms &
Kernel Densities
Bob is the 3rd tallest in a group of
75 12. Since he’s taller than 9 others,
% Bob is in the 75th percentile for
Normal
height!
Distribution Common uses:
Data • Providing context and intuitive benchmarks for how values “rank” within a sample
Profiling (test scores, height/weight, blood pressure, etc.)
VARIANCE
Numerical
Variables Variance = 5
Variance = 15
Histograms &
Kernel Densities Variance = 30
Normal
Distribution
Common uses:
Data
Profiling • Comparing the numerical distributions of two different groups (i.e. prices of
products ordered online vs. in store)
VARIANCE
Categorical Variables
σ 𝑛𝑖=1(𝑥𝑖 − 𝜇)2
Categorical
Distributions 𝑛−1
Numerical Variables Calculation
Steps:
1) Calculate the average of the
Histograms &
variable
Kernel
2) Subtract that average from the first
Densities Average squared
row
3) Square that distance from the
Normal Distribution
difference mean
4) Do steps 1-3 for every row and sum it
up
Data
5) Divide by the number of observations (-
Profiling
1)
STANDARD DEVIATION
Histograms &
Kernel Densities
1 s.d. 2 s.d. 3 s.d.
Normal
Distribution Common uses:
Data • Comparing segments for a given metric (i.e. time on site for mobile users vs.
Profiling desktop)
• Understanding how likely certain values are bound to occur
SKEWNESS
Histograms &
Kernel Densities
Normal
Distribution Common uses:
Data
• Identifying non-normal distributions, and describing them
Profiling
mathematically
BEST PRACTICES: UNIVARIATE
PROFILING
Make sure you are using the appropriate tools for profiling categorical
variables vs. numerical variables
QA still comes first! Profiling metrics are important, but can lead to
misleading results without proper QA (i.e. handling outliers or missing
values)
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Multivariate profiling is about understanding relationships between multiple variables. We’ll cover
common
tools for exploring categorical & numerical data, including kernel densities, violin & box plots, scatterplots,
etc.
MULTIVARIATE PROFILING
Multivariate profiling is the next step after univariate profiling, since single-
metric distributions are rarely enough to draw meaningful insights or conclusions
THE 1. Create a table to plot accident frequency by time of day and day of week
OBJECTIVES 2. Apply conditional formatting to the table to create a heatmap showing the
days and times with the fewest (green) and most (red) accidents in the
sample
CATEGORICAL-NUM ERICAL
DISTRIBUTIONS
Categorical-numerical distributions are used for comparing
Categorical-Categorical
Distributions
numerical distributions across classes in a category (i.e. age
distribution by gender)
Categorical-Numerical These are typically visualized using variations of familiar
Distributions
univariate numerical distributions, including:
Multivariate • Histograms & Kernel Densities: Show the count (or frequency) of values
Kernel
Densities
• Violin Plots: Kernel density “glued” to its mirror image, and tilted on its side
• Box Plots: Like a kernel density, but formatted to visualize key statistical
Violin & Box values
Plots (min/max, median, quartiles) and outliers
Numerical-Numerical Common
Distributions
uses:
• Comparing key business metrics (i.e. customer lifetime value, average order
Scatter Plots size, purchase frequency, etc.) by customer class (gender, loyalty status,
& location, etc.)
Correlation • Comparing sales performance by day of week or hour of day
KERNEL
DENSITIES
Categorical-Categorical Remember: kernel densities are just smooth versions of histograms
Distributions
• To visualize a categorical-numerical distribution, kernel densities can be
repeated to represent each class within a particular category
Categorical-Numerical
Distributions
Numerical-Numerical
Distributions Purple class has a mean of ~25, overlaps
with yellow, and has relatively high
variance
Scatter Plots
&
Correlation
VIOLIN
PLOTS
Categorical-Categorical A violin plot is essentially a kernel density flipped vertically
Distributions and combined with its mirror image
Numerical-Numerical
Distributions
Scatter Plots
& Correlation
BOX
PLOTS
Box plots are like violin plots, but
Categorical-Categorical
Distributions
designed to show key statistical attributes
rather than smooth distributions,
Categorical-Numerical including:
Distributions • Median
• Min & Max (excluding
Multivariate outliers)
Kernel
Densities • 25th & 75th Percentiles
Multivariate
Kernel
?
Densities
Categorical-Numerical They are typically visualized using scatter plots, which plot points
Distributions
along the X and Y axis to show the relationship between two
Multivariate
variables
Kernel
Densities
• Scatter plots allow for simple, visual intuition: when one variable increases
or decreases, how does the other variable change?
Violin & Box • There are many possibilities: no relationship, positive, negative, linear, non-
Plots linear,
cubic, exponential, etc.
Numerical-Numerical Common uses:
Distributions
• Quickly visualizing how two numerical variables relate
Scatter Plots • Predicting how a change in one variable will impact another (i.e. square footage
& and house price, marketing spend and sales, etc.)
Correlation
CORRELATION
Numerical-Numerical
Distributions
Scatter Plots
& Correlation
No Positive Strong positive
correlation correlation correlation
CORRELATION
Correlation is an extension of
Categorical-Categorical
Distributions variance
• Think of correlation as a way to measure the variance of both variables at one
time
(called “co-variance”), while controlling for the scales of each variable
Categorical-Numerical
Distributions
Variance Correlation
formula formula
Multivariate
σ 𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
Kernel 𝑛
Densities
σ 𝑖=1 (𝑥 𝑖 − 𝜇)2
Violin & Box 𝑛−1 (𝑛 − 1)𝑠𝑥𝑠𝑦
Plots
Numerical-Numerical
Distributions • Here we multiply variable X’s difference from its mean with variable Y’s
difference
Scatter Plots from its mean, instead of squaring a single variable (like we do with variance)
& • Sx and Sy are the standard deviations of X and Y, which puts them on the same
Correlation scale
CORRELATION VS.
CAUSATION
CORRELATIO
Categorical-Categorical
Distributions
Categorical-Numerical
Distributions
Multivariate
Kernel
Densities
CAUSATIO
Plots
Numerical-Numerical
Distributions
N
Scatter Plots
&
Correlation
CORRELATION VS.
CAUSATION
Categorical-Categorical
Distributions
Drowning
Categorical-Numerical
Deaths
Distributions
Multivariate
Kernel Densities
Ice Cream Cones
Sold
Violin & Box Consider the scatter plot above, showing daily ice cream sales and
Plots drowning deaths in a popular New England vacation town
Numerical-Numerical • These two variables are clearly correlated, but do ice cream cones CAUSE people
Distributions to drown? Do drowning deaths CAUSE a surge in ice cream sales?
Scatter Plots
&
Of course not, because correlation does NOT imply
Correlation causation!
So what do you think is really going on here?
PRO TIP: VISUALIZING A THIRD
DIMENSION
Categorical-Categorical Scatter plots show two dimensions by default (X and Y), but
Distributions using symbols or color allows you to visualize additional
variables and expose otherwise hidden patterns or trends
Categorical-Numerical
Distributions
Multivariate
Kernel Densities
Numerical-Numerical
Distributions
THE You’ve just landed your dream job as a Marketing Analyst at Loud & Clear,
SITUATION the hottest ad agency in San Diego.
Your client would like to understand the impact of their digital media spend,
THE and how it relates to website traffic, offline spend, site load time, and sales.
ASSIGNMENT Your role is to collect and visualize these metrics at the weekly-level in order
to begin exploring the relationships between them.
Remember that correlation does not imply causation, and that variables
can be related without one causing a change in the other
PART
2:
ABOUT THIS
SERIES
This is Part 2 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised
learning
2 Classification
Models
Clustering/Segmentation
Classification Regression Reinforcement Learning
K-Means (Q-learning, deep RL, multi-armed-bandit, etc.)
LASSO/RIDGE, state-
Support vector
space, advanced Deep Learning
machines, gradient
generalized linear (Feed Forward, Convolutional, RNN/LSTM,
boosting, neural
methods, VAR, DFA, Attention, Deep RL, Autoencoder, GAN. etc.)
nets/deep learning, etc.
etc.
MACHINE LEARNING
LANDSCAPE
MACHINE LEARNING
columns
• Each row represents an individual record/observation
Session ID Browser Time (Sec) Pageviews Purchase
1 Chrome 354 11 1
• Each columns represent an individual variable
2 Safari 94 4 1
3 Safari 36 2 0
Variables can be categorical or numerical 4 IE 17 1 0
time
• Without quality data, you can’t build quality models (“garbage in, garbage out”)
• Profiling is the first step towards conditioning (or filtering) on key variables to understand their
impact
The goal of any classification model is to predict a dependent variable using independent
variables
𝒚 Dependent variable
• The dependent
•
(DV)
variable is commonly referred to as “Y”, “predicted”, “output”, or “target” variable
Classification is about understanding how the DV is impacted by, or dependent on, other variables in the
• model
This is the variable you’re trying to predict
𝐱 Independent variables
• These are the
•
variables which help you predict the dependent variable
Independent variables are commonly referred to as “X’s”, “predictors”, “features”, “dimensions”, “explanatory variables”, or
(IVs) “covariates”
• Classification is about understanding how the IVs impact, or predict, the DV
CLASSIFICATION 101
EXAMPLE: Using data from a CRM database (sample below) to predict if a customer will churn next
month
Cust. ID Gender Status HH Income Age Sign-Up Date Newsletter Churn Churn is our dependent
1 M Bronze $30,000 29 Jan 17, 2019 1 0 variable,
since it’s what we want to
2 F Gold $60,000 37 Mar 18, 2017 1 1
predict
3 F Bronze $15,000 24 Oct 1, 2020 0 0
EXAMPLE: Using data from a CRM database (sample below) to predict if a customer will churn next
month
3 F Bronze $15,000 24 Oct 1, 2020 0 0 We use records with observed values for
4 F Silver $75,000 41 Apr 12, 2019 1 0 both independent and dependent
5 M Bronze $40,000 36 Jul 23, 2020 1 1
variables to “train” our classification
model...
6 M Gold $35,000 31 Oct 22, 2017 0 1
Project
Scoping
Remember, these steps ALWAYS come first
Preliminary Before building a model, you should have a deep understanding of both the project scope (stakeholders,
QA framework, desired outcome, etc.) and the data at hand (variable types, table structure, data quality, profiling
metrics, etc.)
Data
Profiling
Adding new, calculated Splitting records into Building classification models Choosing the best
variables (or “features”) “Training” and “Test” from Training data and performing model for a
to a data set based on data sets, to validate applying to Test data to given prediction, and
existing fields accuracy and avoid maximize prediction tuning it to prevent drift
overfitting accuracy over time
FEATURE ENGINEERING
Cust. ID Status HH Income Age Sign-Up Date Newsletter Gold Silver Bronze Scaled Income Log Income Age Group Sign-Up Year Priority
Splitting is the process of partitioning data into separate sets of records for
the purpose of training and testing machine learning models
• As a rule of thumb, ~70-80% of your data will be used for Training (which is what your
model
learns from), and ~20-30% will be reserved for Testing (to validate the model’s accuracy)
3 F Bronze $15,000 24 Oct 1, 2020 0 0 Using Training data for optimization and Test
Training
4 F Silver $75,000 41 Apr 12, 2019 1 0 data for validation ensures that your model can
data
5 M Bronze $40,000 36 Jul 23, 2020 1 1 accurately predict both known and unknown
6 M Gold $35,000 31 Oct 22, 2017 0 1
values, which helps to prevent overfitting
7 F Gold $80,000 46 May 2, 2019 0 0
• In its simplest form, KNN creates a scatter plot with training data, plots a new
Naïve Bayes unobserved value, and assigns a class (DV) based on the classes of nearby
points
Decision Trees • K represents the number of nearby points (or “neighbors”) the model will
consider
when making a prediction
Random Forests
K-Nearest $100,000
Neighbors $90,000
$80,000
$60,000
$50,000
Decision Trees
$40,000
$30,000
$10,000
No Purchase
AGE
20 25 30 35 40 45 50 55 60 65 70 75
Logistic Regression
X Y K 9 Purchase PURCHASE
28 $85,000 10 1 No (90% Confidence)
Sentiment Analysis Purchase
UNOBSERVED VALUE NEIGHBORS PREDICTION
K-NEAREST NEIGHBORS
(KNN)
HH
INCOME
K-Nearest $100,000
Neighbors $90,000
$80,000
$60,000
$50,000
Decision Trees
$40,000
$30,000
$10,000
No Purchase
AGE
20 25 30 35 40 45 50 55 60 65 70 75
Logistic Regression
X Y K 6 Purchase NO
44 $40,000 20 14 No PURCHASE
Sentiment Analysis Purchase (70% Confidence)
UNOBSERVED VALUE NEIGHBORS PREDICTION
K-NEAREST NEIGHBORS
(KNN)
HH
INCOME
K-Nearest $100,000
Neighbors $90,000
$80,000
$60,000
$50,000
Decision Trees
$40,000
$30,000
$10,000
No Purchase
AGE
20 25 30 35 40 45 50 55 60 65 70 75
Logistic Regression
???
X Y K 3 Purchase
50 $80,000 6 3 No
Sentiment Analysis Purchase
UNOBSERVED VALUE NEIGHBORS PREDICTION
K-NEAREST NEIGHBORS
(KNN)
Random Forests
Example use cases:
• Predicting purchase probability for prospects in a marketing database
Logistic
• Credit risk scoring in banking or financial industries
Regression
NAÏVE
BAYES
These are the independent variables
Each record represents a (IV’s)
customer which will help us make a prediction
K-Nearest Cust. ID
Subscribed Followed Visited
Purchase?
to FB Websit
Neighbors Newsletter Page e
1 0 1 0 1 This is the dependent variable
(DV) we are predicting
Naïve 2 1 0 1 0
Bayes
3 1 1 0 1
4 0 0 0 0
Decision Trees
5 1 0 0 0 These are our observed values,
which we use to train the model
6 1 1 1 1
Random Forests 7 1 0 1 0
8 0 1 1 1
9 1 0 1 1
Logistic
10 1 1 0 0
This is an unobserved
11 0 1 1 ???
Regression value; which our model
will predict
NAÏVE
BAYES
Frequency tables
between each IV and
the DV
K-Nearest Cust. ID
Subscribed Followed Visited
Purchase?
to FB Websit
Neighbors Newsletter Page e
PURCHASE
1 0
1 0 1 0 1
1
Naïve 2 1 0 1 0
NEW
Bayes 0
S
3 1 1 0 1
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
B
F
6 1 1 1 1 0
Random Forests 7 1 0 1 0
1 0
8 0 1 1 1
1
9 1 0 1 1
Logistic
SIT
E
0
10 1 1 0 0
11 0 1 1 ???
Regression
NAÏVE
BAYES
K-Nearest Cust. ID
Subscribed Followed Visited
Purchase?
to FB Websit
Neighbors Newsletter Page e
PURCHASE
1 0
1 0 1 0 1
Naïve 1 3
2 1 0 1 0
NEW
Bayes 0
S
3 1 1 0 1
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
B
F
6 1 1 1 1 0
Random Forests 7 1 0 1 0
1 0
8 0 1 1 1
1
9 1 0 1 1
Logistic
SIT
E
0
10 1 1 0 0
11 0 1 1 ???
Regression
NAÏVE
BAYES
K-Nearest Cust. ID
Subscribed Followed Visited
Purchase?
to FB Websit
Neighbors Newsletter Page e
PURCHASE
1 0
1 0 1 0 1
Naïve 1 3 4
2 1 0 1 0
NEW
Bayes 0
S
3 1 1 0 1
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
B
F
6 1 1 1 1 0
Random Forests 7 1 0 1 0
1 0
8 0 1 1 1
1
9 1 0 1 1
Logistic
SIT
E
0
10 1 1 0 0
11 0 1 1 ???
Regression
NAÏVE
BAYES
K-Nearest Cust. ID
Subscribed Followed Visited
Purchase?
to FB Websit
Neighbors Newsletter Page e
PURCHASE
1 0
1 0 1 0 1
Naïve 1 3 4
2 1 0 1 0
NEW
Bayes 0 2
S
3 1 1 0 1
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
B
F
6 1 1 1 1 0
Random Forests 7 1 0 1 0
1 0
8 0 1 1 1
1
9 1 0 1 1
Logistic
SIT
E
0
10 1 1 0 0
11 0 1 1 ???
Regression
NAÏVE
BAYES
K-Nearest Cust. ID
Subscribed Followed Visited
Purchase?
to FB Websit
Neighbors Newsletter Page e
PURCHASE
1 0
1 0 1 0 1
Naïve 1 3 4
2 1 0 1 0
NEW
Bayes 0 2 1
S
3 1 1 0 1
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
B
F
6 1 1 1 1 0
Random Forests 7 1 0 1 0
1 0
8 0 1 1 1
1
9 1 0 1 1
Logistic
SIT
E
0
10 1 1 0 0
11 0 1 1 ???
Regression
NAÏVE
BAYES
K-Nearest Cust. ID
Subscribed Followed Visited
Purchase?
to FB Websit
Neighbors Newsletter Page e
PURCHASE
1 0
1 0 1 0 1
Naïve 1 3 4
2 1 0 1 0
NEW
Bayes 0 2 1
S
3 1 1 0 1
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1 4 1
B
F
6 1 1 1 1 0 1 4
Random Forests 7 1 0 1 0
1 0
8 0 1 1 1
1 3 2
9 1 0 1 1
Logistic
SIT
E
0 2 3
10 1 1 0 0
=
Overall
11 0 1 1 ??? 50% 50%
Regression :
NAÏVE
BAYES
Frequency tables give us the conditional probability of each
outcome
• For example, P (News | Purchase) tells us the probability that a customer
K-Nearest
Neighbors subscribed to the newsletter, given (or conditioned on) the fact that they
purchased
PURCHASE P (News | Purchase) 60
Naïve %
1 0 P (No News |
Bayes
1 3 4 40
Purchase) P (FB | Probability of each independent
%
NEW
Purchase) variable, given that a purchase was
Decision Trees 0 2 1 80
S
made
P (No FB | %
1 0 20
Purchase) P (Site |
1 4 1 %
Random Forests Purchase)
60
B
F
Purchase)
E
0 2 3 20
P (FB | No Purchase)
Sentiment Analysis %
P (No FB | No Overall probability of
20 purchase
Purchase) P (Site | No %
NAÏVE
BAYES
Unobserved
P (News | Purchase) 60 value:
NEWS FB SITE
%
K-Nearest P (No News | 0 1 1
40
Neighbors Purchase) P (FB |
%
Purchase)
80
Naïve P (No FB | Purchase) % Prob. given Purchase = Prob. given Purchase =
1 0
Bayes
=
P (Site | Purchase) 20
% P (No News | Purchase) P (No News | No Purchase)
P (No Site | Purchase)
60 x x
Decision Trees P (News | No % P (FB | Purchase) P (FB | No Purchase)
Purchase) x x
40
P (Site | Purchase) P (Site | No Purchase)
P (No News | No %
x x
Random Forests Purchase) P (FB | No 80 P (Purchase) P (No Purchase)
%
Purchase)
=
20
P (No FB | No %
Logistic Regression
Purchase) P (Site | No 20
%
Purchase)
80
P (No Site | No
Sentiment Analysis %
Purchase) P (Purchase)
40
P (No Purchase) %
60
NAÏVE
BAYES
Unobserved
P (News | 60 value:
NEWS FB SITE
Purchase) %
K-Nearest P (No News | 40 0 1 1
Purchase) %
Neighbors P (FB | 80
Purchase) %
P (No FB | 20
Purchase) % Prob. given Purchase =
Naïve P (Site | 60 Prob. given Purchase =
Purchase) % 1 0
Bayes
=
P (No Site | Purchase) 40
% 40% P (No News | No Purchase)
P (News | No
x x
Decision Trees 80
Purchase) 80% P (FB | No Purchase)
%
x x
P (No News | No 20 60% P (Site | No Purchase)
Purchase) P (FB | No %
x x
Random Forests Purchase) 20 50% P (No Purchase)
%
P (No FB | No
=
80
P (Purchase)
Purchase) P (Site | No 50
Logistic
%
%
9.6%
P (No Purchase)
Purchase) 50
40
%
%
P (No Site | No
Purchase) 60
Regression %
NAÏVE
BAYES
Unobserved
P (News | Purchase) 60 value:
NEWS FB SITE
%
K-Nearest P (No News | 0 1 1
40
Neighbors Purchase) P (FB |
%
Purchase)
80
Naïve P (No FB | Purchase) % Prob. given Purchase = Prob. given Purchase =
1 0
Bayes
=
P (Site | Purchase) 20
% 40% 20%
P (No Site | Purchase)
60 x x
Decision Trees P (News
(No News| No| No 20
% 80% 20%
Purchase) % x x
Purchase)
P (FB | No 20
40
Purchase) % 60% 40%
P (No FB | No %
80 x x
Random Forests Purchase)
P (Site | No
%
80
40 50% 50%
Purchase) %
%
P (No Site | No 60
=
%
Purchase) P (Purchase)
Logistic 50
9.6% 0.8%
P (No Purchase) 50
%
%
Regression
NAÏVE
BAYES
Unobserved
P (News | Purchase) 60 value:
NEWS FB SITE
%
K-Nearest P (No News | 0 1 1
40
Neighbors Purchase) P (FB |
%
Purchase)
80
Naïve P (No FB | Purchase) % Prob. given Purchase = Prob. given Purchase =
1 0
Bayes
=
P (Site | Purchase) 20
% 40% 20%
P (No Site | Purchase)
60 x x
Decision Trees P (News | No % 80% 20%
Purchase) x x
40
60% 40%
P (No News | No %
x x
Random Forests Purchase) P (FB | No 80 50% 50%
%
Purchase)
=
20
P (No FB | No
Logistic
% 9.6% 0.8%
Purchase) P (Site | No 20
%
Purchase)
80
P (No Site | No
Regression %
Overall Purchase = PURCHASE
Purchase) P (Purchase) 9.6%
Probability: 40
92.3%
PREDICTION
P (No Purchase) % (9.6% +
60 0.8%)
NAÏVE
BAYES
K-Nearest
Neighbors Who has time to work through all this math?
Decision Trees
Doesn’t multiplying many probabilities lead to very small
values?
Random Forests
• Yes, which is why the relative probability is so important; even so,
Naïve Bayes can struggle with a very large number of IVs
Logistic Regression
Sentiment Analysis
CASE STUDY: NAÏVE
BAYES
THE You’ve just been promoted to Marketing Manager at Cat Slacks, a global
SITUATION retail powerhouse specializing in high-quality pants for cats
Neighbors Age
10 10 NO
Gende YES
HH Income
Naïve Bayes r Gender
Sign-Up Date
M F
4 YES 4 NO 6 YES 6 NO
Random Forests
Splitting on Gender isn’t effective, since our classes are still evenly distributed after the
Logistic
split (in other words, Gender isn’t a good predictor for churn)
Regression
Sentiment Analysis
DECISION
TREES
Dependent variable:
CHURN
Independent Variables
Q: Will this subscriber churn next month?
K-Nearest
Neighbors Age
Gende 10 10 NO
YES
HH Income
r
Naïve Bayes Last Log-In
Sign-Up Date
<7 >7
days days
4 YES 8 NO 7 YES 1 NO
Random Forests
Sentiment Analysis
• This is where Machine Learning comes in; the model tries all IVs to
determine which splits reduce entropy the most
DECISION
TREES
Probability of Class Probability of Class
1 2
(Count of Class 1 / N) (Count of Class 2 / N)
K-Nearest
Neighbors 𝑙𝑜𝑔2 𝑃1 − 𝑃2 ∗ 𝑙𝑜𝑔2(𝑃2)
ENTROPY = −𝑃1 ∗
Curve between 0 -
1
Naïve Bayes
1.00
Decision
Trees Entropy = 1
0.75
(max)
50/50 split between
Random Forests ENTROP classes
0.50
Y
Logistic
Regression 0.25
Entropy = 0 Entropy = 0
(min) (min)
All Class 2 (P1=0) All Class 1 (P2=0)
P1
Sentiment Analysis
0.25 0.50 0.75 1.00
DECISION
TREES
Dependent variable:
CHURN
Independent Variables
Q: Will this subscriber churn next month?
K-Nearest
Age
Neighbors Entropy =
1
Gender
10 10 NO
YES
HH Income
Naïve Bayes Last Log-In
Sign-Up Date
<7 >7
days days
The reduction in entropy after the split tells us that we’re gaining information,
Logistic and teasing out differences between those who churn and those who do
Regression not
Sentiment Analysis
DECISION
TREES
Dependent variable:
CHURN
Independent Variables
Q: Will this subscriber churn next month?
K-Nearest
Age
Neighbors Entropy =
1
Gender
10 10 NO
YES
HH Income
Naïve Bayes Last Log-In
Sign-Up Date
<7 >7
days days
Decision node
K-Nearest 10 YES 10 NO
Naïve Bayes
4 YES 8 NO 7 YES 1 NO
Random Forests
Sign-Up Date
Logistic
<60 >60
Regression days days
UNOBSERVED VALUE PREDICTION
K-Nearest 10 YES 10 NO
Naïve Bayes
4 YES 8 NO 7 YES 1 NO
Random Forests
Sign-Up Date
Logistic
<60 >60
Regression days days
UNOBSERVED VALUE PREDICTION
K-Nearest 10 YES 10 NO
Naïve Bayes
4 YES 8 NO 7 YES 1 NO
Random Forests
Sign-Up Date
Logistic
<60 >60
Regression days days
UNOBSERVED VALUE PREDICTION
Logistic Does the best first split always lead to the most accurate model?
Regression
• Not necessarily! That’s why we often use a collection of multiple decision
trees, known as a random forest, to maximize accuracy
Sentiment Analysis
RANDOM
FORESTS
A Random Forest is a collection of individual decision trees, each
built using a random subset of observations
K-Nearest
Neighbors • Each decision tree randomly selects variables to evaluate at each split
• Each tree produces a prediction, and the mode, or most frequent prediction,
wins
Naïve Bayes TREE 1
PREDICTION
SPLIT SPLIT SPLIT
1 2 N CHURN
Age Age Age
Forests
CHURN
Sign- Sign- Sign-
Up Up Up
TREE 3
Log- Log- Log- PREDICTION
Logistic
In In In
CHURN
LTV LTV LTV
Regression
TREE N
PREDICTION
CHURN
Sentiment
Analysis
CASE STUDY: DECISION
TREES
THE You are the founder and CEO of Trip Genie, an online subscription
SITUATION service designed to connect global travelers with local guides.
You’d like to better understand your customers and identify which types of
THE behaviors can be used to help predict paid subscriptions.
ASSIGNMENT Your goal is to build a decision tree to help you predict subscriptions based
on multiple factors (newsletter, Facebook follow, time on site, sessions, etc.).
TRUE
K-Nearest Neighbors (1)
Naïve Bayes
0.5
Decision Trees
Random Forests
FALSE (0)
X1
Logistic
Regression
• Each dot represents an observed value, where X is a numerical
independent variable and Y is the binary outcome (true/false) that we
Sentiment Analysis want to predict
LOGISTIC REGRESSION
TRUE
K-Nearest Neighbors (1)
Naïve Bayes
PROBABILIT
0.5
Decision Trees
Y
Random Forests
FALSE
(0)
X1
Logistic
Regression
• Logistic regression plots the best-fitting curve between 0 and 1, which tells us
the probability of Y being TRUE for any given value of X1
Sentiment Analysis
LOGISTIC REGRESSION
SPAM
K-Nearest Neighbors
Naïve Bayes
PROBABILIT
0.5
Decision Trees
Y
Random Forests
NOTFALSE
SPA M #
(0) 5 10 15 20 25
RECIPIENTS
Logistic
Regression
• Here we’re using logistic regression to predict if an email will be marked as
spam,
Sentiment Analysis based on the number of email recipients (X1)
LOGISTIC REGRESSION
SPAM
K-Nearest
P = 95%
Neighbors Prediction: SPAM
Naïve Bayes
PROBABILIT
0.5
Y
?
Random Forests P = 1%
Prediction: NOT
SPAM
NOTFALSE
SPA M #
(0) 5 10 15 20 25
RECIPIENTS
Logistic
Regression
• Using this model, we can test unobserved values of X1 to predict the probability
that
Sentiment Analysis Y is true or false (in this case the probability that an email is marked as spam)
LOGISTIC REGRESSION
SPAM
K-Nearest
0.9
Neighbors
In this case the risk of categorizing a real email
PROBABILIT
as spam is high, so our decision point may be
>50%
Naïve Bayes
By increasing the threshold to 90%,
0.5
we:
1. Correctly predict more of the legit
(not spam) emails, which is really
Decision Trees
Y
important
2. Incorrectly mark a few spam emails
as
“not spam”, which is not a big deal
Random
NOTFALSE
SPA M #
(0) 5 10 15 20 25
RECIPIENTS
Logistic
Forests
Regression Is 50% always the right decision point for logistic models?
Sentiment • No. It depends on the relative risk of a false positive (incorrectly predicting a
Analysis TRUE outcome) or false negative (incorrectly predicting a FALSE outcome)
LOGISTIC REGRESSION
QUIT
K-Nearest
Neighbors
In this case the risk of a false positive
PROBABILIT
is
Naïve Bayes low, so our decision point may be
<50%
By decreasing the threshold to 10%, we:
0.5
Y
2. Incorrectly flag some employees who plan
to stay, which is not a big
deal
0.1
Random
STAY %
25% 50% 75% 100%
NEGATIVE
FEEDBACK
Logistic
Forests
Regression • Now consider a case where we’re predicting if an employee will quit based on
negative feedback from HR
Sentiment • It’s easier to train an employee than hire a new one, so the risk of a false positive
Analysis is
low but the risk of a false negative (incorrectly predicting someone will stay) is high
LOGISTIC REGRESSION
K-Nearest
Makes the output
fall between 0 1
and 1 Linear equation, where
Neighbors
1 + 𝑒− (𝛽0 +𝛽1 𝑥1 ) 𝖰𝟎 is the intercept, X
is the IV value and 𝖰 𝟏
is
Naïve Bayes the weight (or slope)
Decision Trees
Random Forests
𝖰𝟏 =
0.5 0.5
𝖰𝟏 =
Logistic
Regression 1.0
𝖰𝟏 =
Sentiment Analysis 0 X1 5.0
LOGISTIC REGRESSION
K-Nearest
Makes the output
fall between 0 1
and 1 Linear equation, where
Neighbors
1 + 𝑒− (𝛽0 +𝛽1 𝑥1 ) 𝖰𝟎 is the intercept, X
is the IV value and 𝖰 𝟏
is
the weight (or slope)
Naïve Bayes
1
Decision Trees
𝖰𝟏 =
Random 0.5 0.5
𝖰𝟏 =
Logistic
Forests
Regression 1.0
𝖰𝟏 =
Sentiment
Analysis
0 X1 5.0
LOGISTIC REGRESSION
K-Nearest
Makes the output
fall between 0 1
and 1 Linear equation, where
Neighbors
1 + 𝑒− (𝛽0 +𝛽1 𝑥1 ) 𝖰𝟎 is the intercept, X
is the IV value and 𝖰 𝟏
is
the weight (or slope)
Naïve Bayes
1
Decision Trees
𝖰𝟏 =
Random 0.5 0.5
𝖰𝟏 =
Logistic
Forests
Regression 1.0
𝖰𝟏 =
Sentiment
Analysis
0 X1 5.0
LOGISTIC REGRESSION
K-Nearest
Makes the output
fall between 0 1
and 1 Linear equation, where
Neighbors
1 + 𝑒− (𝛽0 +𝛽1 𝑥1 ) 𝖰𝟎 is the intercept, X
is the IV value and 𝖰 𝟏
is
the weight (or slope)
Naïve Bayes
1
Decision Trees
𝖰𝟏 =
Random 0.5 0.5
𝖰𝟏 =
Logistic
Forests
Regression 1.0
𝖰𝟏 =
Sentiment
Analysis
0 X1 5.0
LOGISTIC REGRESSION
K-Nearest
Makes the output
fall between 0 1
and 1 Linear equation, where
Neighbors
1 + 𝑒− (𝛽0 +𝛽1 𝑥1 ) 𝖰𝟎 is the intercept, X
is the IV value and 𝖰 𝟏
is
the weight (or slope)
Naïve Bayes
1
Decision Trees
𝖰𝟎 = 0
Random 0.5
𝖰𝟎 = 2
Logistic 𝖰 𝟎 = -2
Forests
Regression
Sentiment 0 X1
Analysis
LOGISTIC REGRESSION
K-Nearest
Makes the output
fall between 0 1
and 1 Linear equation, where
Neighbors
1 + 𝑒− (𝛽0 +𝛽1 𝑥1 ) 𝖰𝟎 is the intercept, X
is the IV value and 𝖰 𝟏
is
the weight (or slope)
Naïve Bayes
1
Decision Trees
𝖰𝟎 = 0
Random 0.5
𝖰𝟎 = 2
Logistic 𝖰 𝟎 = -2
Forests
Regression
Sentiment 0 X1
Analysis
LOGISTIC REGRESSION
K-Nearest
Makes the output
fall between 0 1
and 1 Linear equation, where
Neighbors
1 + 𝑒− (𝛽0 +𝛽1 𝑥1 ) 𝖰𝟎 is the intercept, X
is the IV value and 𝖰 𝟏
is
the weight (or slope)
Naïve Bayes
1
Decision Trees
𝖰𝟎 = 0
Random 0.5
𝖰𝟎 = 2
Logistic 𝖰𝟎 = -
Forests
Regression
2
Sentiment 0 X1
Analysis
LOGISTIC REGRESSION
Actual Observation
Random Forests
Y=1 Y=0 • When our model output is close to the actual
Y, we want likelihood to be HIGH (near 1)
Logistic ~1 HIGH LOW
Regression Model Output • When our model output is far from the actual
~0 LOW HIGH Y, we want likelihood to be LOW (near 0)
Sentiment Analysis
LOGISTIC REGRESSION
Actual Observation
Random Forests
Y=1 Y=0 LIKELIHOOD FUNCTION:
=
Logistic ~1 HIGH LOW
𝒚 𝟏−𝒚
Regression Model Output 𝒐𝒖𝒕𝒑𝒖𝒕 ∗ 𝟏 − 𝒐𝒖𝒕𝒑𝒖𝒕
~0 LOW HIGH
Sentiment Analysis
LOGISTIC REGRESSION
=
~0 LOW HIGH
=
~0 LOW HIGH
=
~0 LOW HIGH
=
~0 LOW HIGH
High
likelihood
K-Nearest SPAM
Neighbors
PROBABILIT
Naïve Bayes Low
likelihood
Low
likelihood
Y
Decision Trees
NOT SPAM
Random 5 10 15 20 25
High #
Logistic likelihood RECIPIENTS
Forests
Regression
• Observations closest to the curve have the highest likelihood values (and vice
Sentiment versa), so maximizing total likelihood allows us to find the curve that fits our data
Analysis best
LOGISTIC REGRESSION
PROBABILIT
Decision Trees
Y
Random
NOT SPAM
Forests
Logistic
5 10 15 20 25
Regression #
RECIPIENTS
• # of Recipients can help us detect spam, but so can other variables like the
Sentiment Analysis
number of typos, count of words like “free” or “bonus”, sender reputation
score, etc.
LOGISTIC REGRESSION
Naïve Bayes
Decision Trees
Random Forests
Logistic
Naïve Bayes
Makes the output
fall between 0 1
and 1
Decision Trees
1 + 𝑒− (𝛽 0 +𝛽 1 𝑥 1 +𝛽 2 𝑥 2 +⋯+𝛽 𝑛 𝑥 𝑛 )
Random Forests Weighted independent
variables (x1 , x2 ...xn)
Logistic
Regression • Logistic regression is about finding the best combination of weights (𝛽1, 𝛽2...𝛽𝑛)
for
Sentiment Analysis a given set of independent variables (x1, x2...x𝑛) to maximize the likelihood
function
CASE STUDY: LOGISTIC
REGRESSION
THE You’ve just been promoted to Marketing Manager for Lux Dining, a
SITUATION wildly popular international food blog.
• Sentiment analysis often falls under Natural Language Processing (NLP), but
is typically applied as a classification technique
Naïve Bayes
• Sentiment models typically use a “bag of words” approach, which
involves calculating the frequency or presence (1/0) of key words to
Decision Trees convert text (which models struggle with) into numerical inputs
Logistic Regression
Example use cases:
• Understanding the tone of product reviews posted by
Sentiment customers
Analysis • Analyzing open-ended survey responses
SENTIMENT ANALYSIS
K-Nearest
Neighbors
This word cloud is *technically* a
very simple version of sentiment
Naïve Bayes
analysis
Limitations
Decision Trees : • Based entirely on word count
• Straight-forward but not flexible
Logistic Regression
Sentiment
Analysis
SENTIMENT ANALYSIS
The first step in any sentiment analysis is to clean and QA the text
to remove noise and isolate the most meaningful information:
K-Nearest
Neighbors • Remove punctuation, capitalization and special characters
• Correct spelling and grammatical errors
• Use proper encoding (i.e. UTF-8)
Naïve Bayes
• Lemmatize or stem (remove grammar tense, convert to “root” term)
• Remove stop words (“a”, “the”, “or”, “of”, “are”, etc.)
Decision Trees
The computer is running hot because I’m mining
bitcoin!
Random Forests
the computer is running hot because I’m mining
bitcoin!
Logistic Regression
NOTE: This process cancomputer
vary basedrun hotcontext;
on the mine bitcoin
for example, you may want to
Sentiment preserve capitalization or punctuation if you care about measuring intensity (i.e.
Analysis “GREAT!!” vs. “great”), or choose to allow specific stop words or special characters
SENTIMENT ANALYSIS
Once the text has been cleaned, we can transform our text into
K-Nearest numeric data using a “bag of words” approach:
Neighbors
• Split cleaned text into individual words (this is known as tokenization)
Naïve Bayes • Create a new column with a binary flag (1/0) for each word
• Manually assign sentiment for observations in your Training data
Decision Trees • Apply any classification technique to predict sentiment for unobserved
text
Sentiment is our
Each word is an independent dependent
Random Forests variable variable
“If you like watching If you like watching paint like watch
Random Forests paint dry, you’ll love this dry paint dry love
movie!” you’ll love this movie movie
Logistic Regression “The new version is awfully The new version is awfully new version
good, not as bad as good not as bad as awful good bad
expected!” expected expect
Sentiment
Analysis
CASE STUDY: SENTIMENT
ANALYSIS
THE You’re an accomplished author and creator of the hit series Bark Twain the
SITUATION Data Dog, featuring a feisty chihuahua who uses machine learning to solve
crimes.
Reviews for the latest book in the Bark Twain series are coming in, and they
THE aren’t looking great...
ASSIGNMENT To apply a bit more rigor to your analysis and automate scoring for future
reviews, you’ve decided to use this feedback to build a basic sentiment
model.
Confusion
K-Nearest Naïve Decision Random Logistic
Trees Regression
Neighbors Bayes Forests
Model Drift
IMBALANCED
CLASSES
Hyperparameters Up-sampling
Minority class observations
are duplicated to balance
Imbalanced the data
Classes
Down-sampling
Majority class observations
Confusion
are randomly removed to
Matrix balance the data
Model Weighting
For models that randomly sample
observations (random forests),
increase the probability of
Selection selecting the minority class
Model Drift
IMBALANCED
CLASSES
Hyperparameters x
Up-sampling
x Minority class observations
x are duplicated to balance
Imbalanced the data
Classes x
Down-sampling
x
Majority class observations
Confusion x
are randomly removed to
Matrix balance the data
x
x
x
x
Model Weighting
x For models that randomly sample
observations (random forests),
x increase the probability of
Selection selecting the minority class
x
Model Drift x
x
IMBALANCED
CLASSES
Hyperparameters Up-sampling
Minority class observations
are duplicated to balance
Imbalanced 1 the data
Classes
Down-sampling
Confusion
Majority class observations
Matrix are randomly removed to
balance the data
Model Weighting
For models that randomly sample
observations (i.e. random
forests), increase the probability
Selection of selecting the minority class
Model Drift
CONFUSION MATRIX
ACTUAL CLASS
1 0
Imbalanced
True Positive False Positive
Classes 1 (TP) (FP)
PREDICTE
D
Confusion CLASS False Negative True Negative
Matrix 0 (FN) (TN)
Model
Selection
Accuracy Precision Recall
(TP+TN) / (TP+TN+FP+FN) TP /
TP / (TP+FP)
Model Drift (TP+FN)
Of all predictions, Of all predicted Of all actual positives,
what % were positives, what % what
correct? were correct? % were predicted
correctly?
CONFUSION MATRIX
ACTUAL CLASS
1 0
Imbalanced
Classes 1 100 5
PREDICTE
D
Confusion CLASS
Matrix 0 15 50
Model
Selection
Accuracy Precision Recall
(TP+TN) / (TP+TN+FP+FN) TP /
TP / (TP+FP)
Model Drift (TP+FN)
=
=
(100+50)/(100+50+5+15) 100/(100+5) 100/(100+15)
=
=
.8 .9 .8
8 5 7
CONFUSION MATRIX
Confusion
A A
Matrix
B B
Model PREDICTED PREDICTED
Selection PRODUCT PRODUCT
C C
D D
Model Drift
Model Drift
CONFUSION MATRIX
=
Matrix .9846
B
PREDICTED
PRODUCT Precision
Model C
Selection TP / ( TP + FP )
( 214 ) / ( 214 + 1 + 8 + 2 )
=
D In thiswhere
cases case, (TN) includes
Product A all .9511
was not predicted OR
observed Recall
Model Drift
TP / ( TP + FN )
( 214 ) / ( 214 + 15 +
3)
=
.9224
CONFUSION MATRIX
=
Matrix .9899
B
PREDICTED
PRODUCT Precision
Model C
Selection TP / ( TP + FP )
( 452 ) / ( 452 + 15 + 1 )
=
D
.9658
Model Drift Recall
TP / ( TP + FN )
( 452 ) / ( 452 + 1 + 2 )
=
.9934
CON FUSION
MATRIX
To score a multi-class confusion matrix, calculate metrics for each
Hyperparameters predicted class, then take a weighted average to evaluate the model as a
whole
ACTUAL PRO DUCT PRODUCT
Imbalanced Classes C:
A B C D Accuracy
( TP + TN ) / ( TP + TN + FP + FN )
A ( 1123 + 214 + 1+ 2 + 15 + 452 + 3 + 2 + 34 ) /
Confusion ( 1886 )
=
Matrix
B .9788
PREDICTED
PRODUCT Precision
Model Selection C
TP / ( TP + FP )
( 1123 ) / ( 1123 + 19 )
=
D
Model Drift .9834
Recall
TP / ( TP + FN )
( 1123 ) / ( 1123 + 8 + 1 + 12 )
=
.9816
CONFUSION MATRIX
=
Matrix .9799
B
PREDICTED
PRODUCT Precision
Model C
Selection TP / ( TP + FP )
( 34 ) / ( 34 + 3 + 2 + 12 )
=
D
.6667
Model Drift Recall
TP / ( TP + FN )
( 34 ) / ( 34 + 2 + 19 )
=
.6182
CON FUSION
MATRIX
To score a multi-class confusion matrix, calculate metrics for each
Hyperparameters predicted class, then take a weighted average to evaluate the model as a
whole
Imbalanced Classes
# Accuracy Precisio Recall
Obs. n
Model Selection
C 1,144 .9788 .9834 .9816
Model Selection
C 207 .5498 .2239 .4348
• PRECISION may be the best metric if false negatives aren’t a big deal, but
Confusion
Model Matrix false
Selection positives are a major risk (i.e. spam filter or document search)
Model • ACCURACY may be the best metric if you care about predicting positive
Drift and negative outcomes equally, or if the risk of each outcome is
comparable
MODEL
DRIFT
Drift is when a trained model gradually becomes less accurate over
Hyperparameters
time, even when all variables and parameters remain the same
• As a best practice, all ML models used for ongoing prediction should
Imbalanced be updated or retrained on a regular basis to combat drift
• If you notice drift compared to your benchmark, retrain the model using
updated Training data and consider discarding old records (if you have enough
Selection
Model volume)
Drift
• Conduct additional feature engineering as necessary
PART
3:
ABOUT THIS
SERIES
This is Part 3 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised
learning
Clustering/Segmentation
Classification Regression Reinforcement Learning
K-Means (Q-learning, deep RL, multi-armed-bandit, etc.)
LASSO/RIDGE, state-
Support vector
space, advanced Deep Learning
machines, gradient
generalized linear (Feed Forward, Convolutional, RNN/LSTM,
boosting, neural
methods, VAR, DFA, Attention, Deep RL, Autoencoder, GAN. etc.)
nets/deep learning, etc.
etc.
MACHINE LEARNING
LANDSCAPE
MACHINE LEARNING
columns
• Each row represents an individual record/observation
Session ID Browser Time (Sec) Pageviews Purchase
1 Chrome 354 11 1
• Each columns represent an individual variable
2 Safari 94 4 1
3 Safari 36 2 0
Variables can be categorical or numerical 4 IE 17 1 0
time
• Without quality data, you can’t build quality models (“garbage in, garbage out”)
• Profiling is the first step towards conditioning (or filtering) on key variables to understand their
impact
𝒚 Dependent variable
• The dependent
•
(DV)
variable – for regression,
is commonly this
referred to as “Y”, must be
“predicted”, numerical
“output”, or “target”(not categorical)!
variable
Regression is about understanding how the numerical DV is impacted by, or dependent on, other variables in the
• model
This is the variable you’re trying to predict
𝐱 Independent variables
• These are the
•
variables which help you predict the dependent variable
Independent variables are commonly referred to as “X’s”, “predictors”, “features”, “explanatory variables”, or
(IVs) “covariates”
• Regression is about understanding how the IVs impact, or predict, the DV
REGRESSION
101
EXAMPLE: Using marketing and sales data (sample below) to predict revenue for a given
month
Social Posts, Competitive Activity, Marketing Spend and Promotion Count are
all
independent variables, since they can help us explain, or predict, monthly
Revenue
REGRESSION
101
EXAMPLE: Using marketing and sales data (sample below) to predict revenue for a given
month
Measurement
Planning
Remember, these steps ALWAYS come first
Preliminary Before building a model, you should have a clear measurement plan (KPIs, project scope, desired outcome, etc.)
QA
and an understanding of the data at hand (variable types, table structure, data quality, profiling metrics, etc.)
Data
Profiling
Adding new, calculated Splitting records into Building regression models Choosing the best
variables (or “features”) “Training” and “Test” from Training data and performing model for a
to a data set based on data sets, to validate applying to Test data to given prediction, and
existing fields accuracy and avoid maximize prediction tuning it to prevent drift
overfitting accuracy over time
FEATURE ENGINEERING
Original Engineered
features features
Socia Competitiv Marketin Promotio Competitiv Competitiv Competitiv Promotion >10 &
Month ID Revenue Log Spend
l e g n e High e e Low Social > 25
Post Activity Spend Count Medium
s
1 30 High $130,000 12 $1,300,050 1 0 0 1 11.7
Splitting is the process of partitioning data into separate sets of records for
the purpose of training and testing machine learning models
• As a rule of thumb, ~70-80% of your data will be used for Training (which is what your
model
learns from), and ~20-30% will be reserved for Testing (to validate the model’s accuracy)
2 15 Low $600,000 5 $11,233,310 Using Training data for optimization and Test
3 11 Medium $15,000 10 $1,112,050 Training data for validation ensures that your model can
data accurately predict both known and unknown
4 22 Medium $705,000 11 $1,582,077
values, which helps to prevent overfitting
5 41 High $3,000 3 $1,889,053
There are two common use cases for linear regression: prediction and root-cause
analysis
PREDICTION ROOT-CAUSE
ANALYSIS
• Used to predict or forecast a • Used to determine the causal impact
numerical dependent variable of individual model inputs
• Goal is to make accurate predictions, • Goal is to prove causality by
even if causality cannot necessarily be comparing the sensitivity of each IV
proven on the DV
*Copyright Maven Analytics, LLC
REGRESSION
MODELING
In this section we’ll introduce the basics of regression modeling, including linear
relationships, least squared error, simple and multiple regression and non-linear
models
βx
Regression 𝘢
Y Slope
value (rise/run)
Multiple Linear
Regression • NOTE: Not all relationships are linear (more on that
later!)
Non-Linear Common examples:
Regression • Taxi M ileage (X) and Total Fare (Y)
• Units Sold (X) and Total Revenue
(Y)
LINEAR
RELATIONSHIPS
Positive Negative
Linear Linear
Linear
Relationships
Least Squared
Error
Univariate
Linear
Non-Linear No
Regression
(logarithmic) Relationship
Multiple Linear
Regression
Non-Linear
Regression
Linear 50
Relationships
40
Consider a line that fits
Least Squared every single point in the plot
Error 30
This is known as a
perfectly linear
20
Univariate relationship
Linear
Regression 10
Multiple Linear
Regression 10 20 30 40 50 60 70 80
Non-Linear
In this case you can simply calculate the exact value of Y for any given
Regression value
of X (no Machine Learning needed, just simple math!)
LINEAR
RELATIONSHIPS
Linear 50
Relationships
40
In the real world, things
Least Squared aren’t quite so simple
Error 30
When you add variance,
it means that many
20
Univariate different lines could
Linear potentially fit through the
Regression 10
plot
Multiple Linear
Regression 10 20 30 40 50 60 70 80
Non-Linear
To find the equation of the line with the best possible fit, we can use
Regression a
technique known as least squared error or “least squares”
LEAST SQUARED
ERROR
Least squared error is used to mathematically determine the line
Linear
Relationships that best fits through a series of data points
• Imagine drawing a line through a scatterplot, and measuring the distance
Least Squared between your line and each point (these distances are called errors, or residuals)
Error
• Now square each of those residuals, add them all up, and adjust your line until
you’ve minimized that sum; this is how least squares works!
Univariate
Linear
Regression
Why “squared” error?
Multiple Linear
Regression • Squaring the residuals converts them into positive values, and prevents
positive and negative distances from cancelling each other out (this helps the
model optimize more efficiently, too)
Non-Linear
Regression
LEAST SQUARED
ERROR
Linear 50
X Y (actual) Y (line) Error Sq.
Relationships 10 10
Error
15 5 25
40 20 25
20 -5 25
30 20
Least Squared 25 5 25
35 30
Error 30
27.5 -2.5 6.25
40 40
30 -10 100
50 15
20
Univariate 60 40
35 20 400
Linear 40 0 0
65 30
Regression 10 42.5 12.5
70 50 156.25
80 40
Multiple Linear 45 -5 25
Regression 10 20 30 40 50 60 70 80
50 10 100
Non-Linear
STEP 1: Plot each data point on a scatterplot, and record the X and Y
Regression
values
LEAST SQUARED
ERROR
y = 10 +
Linear 50 0.5x
X Y (actual) Y (line) Error Sq.
Relationships 10 10
Error
15 5 25
40 20 25
20 -5 25
30 20
Least Squared 25 5 25
35 30
Error 30
27.5 -2.5 6.25
40 40
30 -10 100
50 15
20
Univariate 60 40
35 20 400
Linear 40 0 0
65 30
Regression 10 42.5 12.5
70 50 156.25
80 40
Multiple Linear 45 -5 25
Regression 10 20 30 40 50 60 70 80
50 10 100
Non-Linear
STEP 2: Draw a straight line through the points in the scatterplot,
Regression
and calculate the Y values derived by your linear equation
LEAST SQUARED
ERROR
y = 10 +
Linear 50 0.5x
X Y (actual) Y (line) Erro Sq.
Relationships 10 10 15
r Error
5 25
40 20 25 20
-5 25
30 20 25
Least Squared 5 25
35 30 27.5
Error 30
-2.5 6.25
40 40 30
-10 100
50 15 35
20
Univariate 60 40 40
20 400
Linear 0 0
65 30 42.5
Regression 10 12.5 156.25
70 50 45
-5 25
80 40 50
Multiple Linear 10 100
Regression 10 20 30 40 50 60 70 80
Non-Linear
STEP 3: For each value of X, calculate the error (or residual) by
Regression
comparing the actual Y value against the Y value produced by your
linear equation
LEAST SQUARED
ERROR
y = 10 +
Linear 50 0.5x
X Y (actual) Y (line) Error Sq.
Relationships 10 10 15 5
Error
25
40 20 25 20 -5
25
30 20 25 5
Least Squared 25
35 30 27.5 -2.5
Error 30
6.25
40 40 30 -10
100
50 15 35 20
20
Univariate 60 40 40 0
400
Linear 0
65 30 42.5 12.5
Regression 10 156.25
70 50 45 -5
25
80 40 50 10
Multiple Linear 100
Regression
=
10 20 30 40 50 60 70 80
30 20 24 4 16
Least Squared
35 30 26 -4 16
Error 30
40 40 28 -12 144
50 15 32 17 289
20
Univariate 60 40 36 -4 16
Linear
65 30 38 8 64
Regression 10
70 50 40 -10 100
80 40 44 4 16
Multiple Linear
Regression
=
10 20 30 40 50 60
72
70 80
SUM OF SQUARED
ERROR:
2
Non-Linear STEP 5: Plot a new line, repeat Steps 1-4, and continue the process until
Regression
you’ve found the line that minimizes the sum of squared error
• This is where Machine Learning comes in; human trial-and-error is
completely
impractical, but machines can find an optimal linear equation in seconds
UNIVARIATE LINEAR
REGRESSION
Univariate (“simple”) linear regression is used for predicting a
Linear
Relationships numerical output (DV) based on a single independent variable
• Univariate linear regression is simply an extension of least squares; you use the
Least Squared linear equation that minimizes SSE to predict an output (Y) for any given input
Error (X)
Coefficient/
Univariate parameter
Dependent variable
Linear (DV) (sensitivity of Y to X) This is just the equation
Regression
y =𝘢 + βx + Error/residual of a line, plus an error
term
Multiple Linear
Regression 𝜀
Y-intercept
(IV)
Independent variable
Non-Linear
Simple linear regression is rarely used on its own; think of it as a primer
Regression for understanding more complex topics like non-linear and multiple
regression
CASE STUDY: UNIVARIATE LINEAR
REGRESSION
THE You are the proud owner of The Cone Zone, a mobile ice cream
SITUATION cart operating on the Atlantic City boardwalk.
You’ve noticed that you tend to sell more ice cream on hot days, and want to
THE understand how temperature and sales relate. Your goal is to build a simple linear
ASSIGNMENT regression model that you can use to predict sales based on the weather
forecast.
Least Squared
Error
Univariate
Linear
Regression
Multiple Linear
Regression
Non-Linear
Multiple regression can scale well beyond 2 variables, but this is
Regression
where visual analysis breaks down (and why we need machine
learning!)
MULTIPLE LINEAR
REGRESSION
EXAMPLE You are preparing to list a new property on AirBnB, and want to
estimate
Linear
Relationships
(or predict) an appropriate price using the listing data below
Least Squared
Error
Univariate
Linear
Regression
Multiple Linear
Regression
Non-
Linear
Regression
MULTIPLE LINEAR
REGRESSION
EXAMPLE You are preparing to list a new property on AirBnB, and want to
estimate
Linear
Relationships
(or predict) an appropriate price using the listing data below
MODEL 1: Predict price (Y) based on accommodation (X1):
Y =55.71 + (16.6*X1 )
Least Squared
Error
MODEL 2: Predict price (Y) based on accommodation (X1) and number of bedrooms
Least Squared
Error
Univariate
Linear
Regression
Multiple Linear
Regression
Non-
Linear
Regression
y = 𝘢 + β*ln(x) Error/residual
Multiple Linear
Regression
+𝜀
Y-intercept
Log-transformed independent variable
Non-Linear (IV)
Regression
All we’re really doing is transforming the data to create linear relationships between each IV and the
DV,
then applying a standard linear regression model using those transformed values
NON-LINEAR
REGRESSION
EXAMPLE You are predicting Sales (Y) using Marketing Spend (X). As you
#1 spend
Linear more on marketing, the impact on sales eventually begins to
Relationships
diminish.
Least Squared
Error
Univariate
Linear
Regression
Multiple Linear The relationship between Sales and ...but the relationship between Sales and
Marketing Spend is non-linear the
Regression
Non-Linear (logarithmic)... log of Marketing Spend is linear!
Regression y = 𝘢 + βx + y = 𝘢 + β*ln(x)
𝜀 +𝜀
NON-LINEAR
REGRESSION
EXAMPLE You are predicting population growth (Y) over time (X) and notice
#2 an increasing rate of growth as the population size increases.
Linear
Relationships
Least Squared
Error
Univariate
Linear
Regression
The relationship between Time and ...but the relationship between Time and
Multiple Linear Population is non-linear the
Regression (exponential)... log of Population is linear!
Non-Linear
y = 𝘢 + βx + 𝜀 ln(y) = 𝘢 + βx + 𝜀
Regression
NOTE: There are multiple ways to transform variables based on the type of
relationship
(log, exponential, cubic, etc.), and multiple techniques to model them (more on that
CASE STUDY: NON-LINEAR
REGRESSION
THE You work as a Marketing Analyst for Maven Marketing, an
SITUATION international advertising agency based in London.
Your client has asked you to help set media budgets and estimate sales for
THE an upcoming campaign. Using historical ad spend and revenue, your goal is
ASSIGNMENT to build a regression model to help predict campaign performance.
OBJECTIVES 2. Create a linear regression, and gauge the fit. Does it look accurate?
R-Squared
Mean Error
Metrics
Homoskedasticity
F-Significance
P-Values & T-
Statistics
Multicollinea
rity Variance
R-SQUARED
Sample Model R-Squared measures how well your model explains the variance in
Output the dependent variable you are predicting
R-Squared
• The higher the R-Squared, the “better” your model predicts variance in the DV
and the more confident you can be in the accuracy of your predictions
Mean Error Metrics
• Adjusted R-Squared is often used as it “penalizes” the R-squared value
Homoskedasticity based on the number of variables included in the model
Inflation
R-SQUARED
EXAMPLE
Sample Model y = 12 +
Output 50 0.4x x y prediction (y–prediction)2 y ( y– y )2
10 10 16 36 30 400
R-Squared 40 20 25 20 25 30 25
30
30 20 24 16 100
40 40 28 144 30 100
30
20 50 15 32 289 225
Homoskedasticity 60 40 36 16 30 100
10 65 30 38 64 30 0
30
F-Significance 70 50 40 100 400
80 40 44 16 30 100
10 20 30 40 50 60 70 80
=
P-Values & T-
Statistics
722 1,450
SSE TSS
Multicollinearity 72
R2 =1 – SSE =
𝑖 (𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 )2
1 2
1,45
= 0.50
(𝑦 𝑖 − –
Variance
TSS 𝑦ത)2 0 2
Inflation = 𝑖
MEAN ERROR
METRICS
Mean error metrics measure how well your regression model
Sample Model predicts, as opposed to how well it explains variance (like R-
Output
Squared)
R-Squared • There are many variations, but the most common ones are Mean Squared
Error
Mean Error
Metrics
(MSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error
(MAPE)
Homoskedasticity
• TheseMSE
metrics provide “standards” which
MAEcan be used to compare predictive
MAPE
accuracy across multiple regression models
F-Significance
σ 𝑖 (𝑦 𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖)2 σ 𝑖 |𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖| σ𝑖
|𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖|
𝑛 𝑛 𝑦𝑖
𝑛
P-Values & T-
Statistics Average of the squared distance Average of the absolute Mean Absolute Error,
between actual & predicted distance between actual & converted to a
values predicted values percentage
Multicollinearity
Mean error metrics can be used to evaluate regression models just like performance metrics like
Variance accuracy,
precision and recall can be used to evaluate classification models
Inflation
MSE
EXAMPLE
y = 12 +
Sample Model 50 0.4x
X Y (actual) Y (line) Error Sq. Error
Output
10 10 16 6 36
40 20 25 20 -5 25
R-Squared 30 20 24 4 16
35 30 26 -4 16
Mean Error 30
50 15 32 17 289
20
Homoskedasticity 60 40 36 -4 16
65 30 38 8 64
10
F-Significance 70 50 40 -10 100
80 40 44 4 16
=
P-Values & T- 10 20 30 40 50 60
72
70 80
SUM OF SQUARED
Statistics ERROR:
2
Multicollinearity
MSE
722
Variance
σ 𝑖(𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑖 )2
𝑛
=
10
= 72.
2
Inflation
MAE
EXAMPLE
y = 12 +
Sample Model 50 0.4x
X Y (actual) Y (line) Error Abs
Sq. Error
Output Error
10 10 16 6 36
20 25 20 -5 65
40 2
R-Squared 30 20 24 4
5
14
6
35 30 26 -4 14
Mean Error 30
40 40 28 -12 6
12
Metrics 14
4
50 15 32 17 17
28
20
Homoskedasticity 60 40 36 -4
9
14
65 30 38 8 68
6
10 4
F-Significance 70 50 40 -10 10
10
0
80 40 44 4 14
6
=
P-Values & T- 10 20 30 40 50 60
7
70 80
SUM OF ABSOLUTE
Statistics ERROR:
4
Multicollinearity MAE
74
Variance
σ 𝑖 |𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖|
𝑛
=
10
= 7.
4
Inflation
MAPE
EXAMPLE
y = 12 +
Sample Model 50 0.4x X
Y Y
Error
Abs Abs
(actual) (line) Erro %
Output r Error
10 10 16 6 6 0.6
40
20 25 20 -5 5 0.2
R-Squared
30 20 24 4 4 0.2
20 50 15 32 17 17 1.133
Homoskedasticity
60 40 36 -4 4 0.1
65 30 38 8 8 0.267
10
F-Significance 70 50 40 -10 10 0.2
80 40 44 4 4 0.1
=
P-Values & T- 10 20 30 40 50 60 70 80
Multicollinearity
MAPE
|𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖| 3.23
Variance
σ𝑖 𝑦𝑖 = 3
10
= 32.33
𝑛
%
Inflation
MEAN ERROR
METRICS
Sample Model
Output
When should I use each type of error
metric?
R-Squared
• Mean Squared Error (MSE) is particularly
Mean Error
useful when outliers or
Metrics Mean Absolute Error (MAE) to
is useful
• extreme values are important predictif you want to minimize the
Homoskedasticity impact of outliers on model selection
PRO TIP: In general we recommend considering all of them, since they can
Multicollinearity be calculated instantly and each provide helpful context into model
performance
Variance
Inflation
HOMOSKEDASTICITY
Metrics
F-Significance
P-Values & T-
Statistics
Multicollinearity Residuals are consistent Residuals increase at higher IV values, indicating that
over the entire IV there is some variance that the IVs are unable to
range explain
Variance
Breusch-Pagan tests can report a formal calculation of Heteroskedasticity, but usually a simple visual check is
enough
Inflation
NULL
HYPOTHESIS
Our goal is to reject the null hypothesis and prove (with a high level of confidence) that our
model can produce accurate, statistically significant predictions and not just random
outputs
*Oxford
F-STATISTIC & F-
SIGNIFICANCE
Sample Model The F-Statistic and associated P-Value (aka F-Significance) help
Output us understand the predictive power of the model as a whole
Metrics • The smaller the F-Significance, the more useful your regression is for prediction
F-Significance • NOTE: It’s common practice to use a P-Value of .05 (aka 95%) as a threshold
Homoskedasticity to
determine if a model is “statistically significant”, or valid for prediction
P-Values & T-
Statistics
PRO TIP: F-Significance should be the first thing you check when you
Multicollinearity evaluate a regression model; if it’s above your threshold, you may need more
training, if it’s below your threshold, move on to coefficient-level significance
(up next!)
Variance
Inflation
T-STATISTICS & P-
VALUES
Sample Model T-Statistics and their associated P-Values help us understand
Output the predictive power of the individual model coefficients
0.001 99.9%***
Variance 0.01 99%**
Inflation 0.05 95%*
MULTICOLLINEARITY
Sample Model Output Multicollinearity occurs when two or more independent variables
are highly correlated, leading to untrustworthy model coefficients
R-Squared
• Correlation means that one IV can be used to predict another (i.e. height
and
Mean Error weight), leading to many combinations of coefficients that predict equally
well
Metrics
• This leads to unreliable coefficient estimates, and means that your model will
fail to generalize when applied to non-training data
Homoskedasticity
F-Significance
How do I measure multicollinearity, and what can I do about it?
P-Values & T-
Multicollinearity • Variance Inflation Factor (VIF) can help you quantify the degree of
Statistics
multicollinearity, and determine which IVs to exclude from the model
Variance
Inflation
VARIANCE INFLATION FACTOR
Sample Model Output To calculate Variance Inflation Factor (VIF) you treat each
individual IV as the dependent variable, and use the R2 value to
R-Squared measure how well you can predict them using the other IVs in the
model
Mean Error
Y = 𝘢 + β1 x1 + β2 x2 + β3 x3 + … + βnxn + 𝜀
Metrics
Multicollinearity
VARIANCE INFLATION FACTOR
Mean Error
Entire.place and Private.room produce high VIF values, since they essentially measure the same
thing;
Metrics if a listing isn’t a private room, there’s a high probability that it’s an entire place (and vice versa)
F-Significance
P-Values & T-
Private.room
Adios NO YES
multicollinearity! PRO TIP: Use a
Entire.place
Variance
Statistics
NO 816 15,815
frequency
Inflation YES 13,166 0
table to confirm
correlation!
Multicollinearity
RECAP: SAMPLE MODEL
OUTPUT
Formula for the regression (variables & data
set)
Sample Model
Output
R-Squared
Profile of
Residuals/Errors
Mean Error
Coefficien
Metrics t
P-Values
Y-Intercept &
Homoskedasticity IV
coefficients
F-Significance
Coefficient
Standard
P-Values & T- R2 and Errors & T-
Statistics Adjusted R2 Values
Variance
*Copyright Maven Analytics, LLC
TIME-SERIES
FORECASTING
In this section we’ll explore common time-series forecasting techniques, which
use regression models to predict future values based on seasonality and trends
Common Examples:
Non-Linear Trends • Forecasting revenue for the next fiscal
year
• Predicting website traffic growth over time
Intervention
Analysis • Estimating sales for an new product launch
SEASONALITY
Linear Trending • We can identify seasonal patterns using an Auto Correlation Function (ACF),
then apply that seasonality to forecasts using techniques like one-hot
encoding or moving averages (more on that soon!)
Smoothing
Common
Non-Linear Trends
examples:
• Website traffic by hour of the day
• Seasonal product sales
Intervention
• Airline ticket prices around major
Analysis
holidays
AUTO CORRELATION FUNCTION
Smoothing
80%
60%
CORRELATIO
40%
20%
Intervention -20%
0%
Analysis
N
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
AUTO CORRELATION FUNCTION
Smoothing
80%
60%
CORRELATIO
40%
20%
Intervention -20%
0%
Analysis
N
-40%
0
-60% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
AUTO CORRELATION FUNCTION
Smoothing
80%
60%
CORRELATIO
40%
20%
Intervention -20%
0%
Analysis
N
-40%
0
-60% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
AUTO CORRELATION FUNCTION
Linear Trending
Strong correlation every 7th lag,
which indicates a weekly
cycle
Smoothing
100%
80%
60%
CORRELATIO
20%
0%
-20%
N
-40%
Intervention
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Analysis
CASE STUDY: AUTO
CORRELATION
THE You are a Business Intelligence Analyst for Accucorp Accounting,
SITUATION a national firm specializing in tax preparation services.
Seasonality
• NOTE: When you use one-hot encoding, you must exclude one of the options
rather than encoding all of them (it doesn’t matter which one you exclude)
• Consider the equation A+B=5; there are an infinite combination of A & B values
Linear Trending that can solve it. One-hot encoding all options creates a similar problem for
regression
Quarter ID Revenue Q1 Q2 Q3 Q4
Smoothing
1 $1,300,050 1 0 0 0
2 $11,233,310 0 1 0 0
4 $1,582,077 0 0 0 1
Intervention
Analysis PRO TIP: If your data contains multiple seasonal patterns (i.e. hour of day + day of
week), include both dimensions in the model as one-hot encoded independent
variables
CASE STUDY: ONE-HOT
ENCODING
THE You’re are a Senior Analyst for Weather Trends, a Brazilian
SITUATION weather station boasting the longest and most accurate forecasts
in the biz.
You’ve been asked to help prepare temperature forecasts for the upcoming
THE year. To do this, you’ll need to analyze ~5 years of historical data from Rio
ASSIGNMENT de Janeiro, and use regression to predict monthly average temperatures.
Forecasting 101
What if the data includes both seasonality and a linear
trend?
Seasonality
Trend describes an overarching direction or movement in a time
series, not counting seasonality
Linear Trending • Trends are often linear (up/down), but can be non-linear as well (more on that
later!)
• To account for linear trending in a regression, you can include a time step IV;
Smoothing this is simply an index value that starts at 1 and increments with each time
period
• If the time step coefficient isn’t statistically significant, it means you don’t have
Non-Linear Trends a
meaningful linear trend
Intervention PRO TIP: It’s common for time-series models to include trending AND seasonality; in
Analysis this case, use a combination of one-hot encoding and time step variables to account
for both!
LINEAR
TRENDING
Forecasting 101
What if the data includes both seasonality and a linear
trend?
Seasonality
Trend describes an overarching direction or movement in a time
series, not counting seasonality
Month Revenue T-Step Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
Linear Trending
Jan $1,300,050 1 1 0 0 0 0 0 0 0 0 0 0
Feb $1,233,310 2 0 1 0 0 0 0 0 0 0 0 0
Mar $1,112,050 3 0 0 1 0 0 0 0 0 0 0 0
Smoothing
Apr $1,582,077 4 0 0 0 1 0 0 0 0 0 0 0
May $1,776,392 5 0 0 0 0 1 0 0 0 0 0 0
Jun $2,110,201 6 0 0 0 0 0 1 0 0 0 0 0
Non-Linear Trends Jul $1,928,290 7 0 0 0 0 0 0 1 0 0 0 0
Aug $2,250,293 8 0 0 0 0 0 0 0 1 0 0 0
Sep $2,120,050 9 0 0 0 0 0 0 0 0 1 0 0
Dec $2,739,022 12 0 0 0 0 0 0 0 0 0 0 0
CASE STUDY: SEASONALITY +
TREND
THE You are a Business Intelligence Analyst for Maven Muscles, a large
SITUATION national chain of fitness centers.
Intervention PRO TIP: Smoothing is a great way to expose patterns and trends that otherwise might
Analysis be tough to see; make this part of your data profiling process!
SMOOTHING
Smoothing
Non-Linear Trends
Intervention
Analysis
CASE STUDY:
SMOOTHING
THE You’ve just been hired as an analytics consultant for Maven Motel, a
SITUATION struggling national motel chain.
Management has been taking steps to improve guest satisfaction, and has
THE asked you to analyze daily data to determine how ratings are trending. Your
ASSIGNMENT task is to use a moving average calculation to discern if an underlying
trend is present.
1. Collect daily average guest ratings for the motel. Do you see any
THE clear patterns or trends?
OBJECTIVES 2. Calculate a moving average, and compare various windows from 1-
12 weeks
3. Determine if an underlying trend is present. How would you describe
it?
NON-LINEAR
TRENDS
Time-series data won’t always follow a seasonal pattern or linear
Forecasting 101 trend;
it may follow a non-linear trend, or have no predictable trend at all
Seasonality • There are many formulas designed to forecast common non-linear trends:
Non-Linear Trends
Intervention
Analysis
NON-LINEAR
TRENDS
Time-series data won’t always follow a seasonal pattern or linear
Forecasting 101 trend;
it may follow a non-linear trend, or have no predictable trend at all
Seasonality • There are many formulas designed to forecast common non-linear trends:
Non-Linear Trends
Intervention
Analysis
NON-LINEAR
TRENDS
Time-series data won’t always follow a seasonal pattern or linear
Forecasting 101 trend;
it may follow a non-linear trend, or have no predictable trend at all
Seasonality • There are many formulas designed to forecast common non-linear trends:
Non-Linear Trends
Intervention
Analysis
NON-LINEAR
TRENDS
Time-series data won’t always follow a seasonal pattern or linear
Forecasting 101 trend;
it may follow a non-linear trend, or have no predictable trend at all
Seasonality • There are many formulas designed to forecast common non-linear trends:
Non-Linear Trends
Intervention
Analysis
NON-LINEAR
TRENDS
Time-series data won’t always follow a seasonal pattern or linear
Forecasting 101 trend;
it may follow a non-linear trend, or have no predictable trend at all
Seasonality • There are many formulas designed to forecast common non-linear trends:
Non-Linear Trends
Intervention
Analysis PRO TIP: ADBUDG and Gompertz are more flexible versions of a logistic curve,
and are commonly seen in BI use cases (product launches, diminishing returns, etc.)
CASE STUDY: NON-LINEAR
TREND
THE The team at Cat Slacks just launched a new product poised to revolutionize
the world of feline fashion: a lightweight, breatheable jogging short
SITUATION designed for active cats who refuse to compromise on quality.
THE You’ve been asked to provide a weekly sales forecast to help the
manufacturing and warehouse teams with capacity planning. You only
ASSIGNMENT have 8 weeks of data to work with, but expect the launch to follow a
logistic curve.
1. Collect sales data for the first 8 weeks since the launch
THE
2. Apply a Gompertz curve to fit a logistic trend
OBJECTIVES 3. Adjust parameters to compare various capacity limits and growth
rates
INTERVENTION ANALYSIS
Linear Trending • By fitting a model to the “pre-intervention” data (up to the date of the change),
you can compare predicted vs. actual values after that date to estimate the
impact of the intervention
Smoothing
Common examples:
Non-Linear Trends
• Measuring the impact of a new website or check-out page on conversion
rates
Intervention • Quantifying the impact of a new HR program to reduce employee churn
Analysis
INTERVENTION ANALYSIS
Linear Trending
Smoothing
Non-Linear Trends
Intervention
Analysis
INTERVENTION ANALYSIS
Forecasting 101 STEP 1: Fit a regression model to the data, using only observations from the
pre- intervention period:
Seasonality
Intervention
Linear Trending
Smoothing
Non-Linear Trends
Intervention
Analysis
INTERVENTION ANALYSIS
Forecasting 101 STEP 1: Fit a regression model to the data, using only observations from the
pre- intervention period:
Seasonality
Intervention
Linear Trending
Smoothing
Non-Linear
Trends
Intervention
Analysis
INTERVENTION ANALYSIS
Forecasting 101 STEP 2: Compare the predicted and observed values in the post-
intervention
period, and sum the daily residuals to estimate the impact of the change:
Seasonality
Intervention
Linear Trending
Smoothing
Non-Linear
Trends
Intervention
Analysis
CASE STUDY: INTERVENTION
ANALYSIS
THE You are a Web Analyst for Alpine Supplies, an online retailer
SITUATION specializing in high-end camping and hiking gear.
The company recently rolled out a new product landing page, and the
THE
CMO has asked you to help quantify the impact on conversion rate
ASSIGNMENT (CVR) and sales. You’ll need to conduct an intervention analysis, and make
sure to capture day of week seasonality and trending in your model.
1. Collect data to track sessions and CVR before and after the change
THE
2. Fit a regression model to predict CVR using data from before the
OBJECTIVES redesign
3. Use the model to forecast ”baseline” CVR after the change, and calculate
both incremental daily sales and the total cumulative impact
PART
4:
ABOUT THIS
SERIES
This is Part 4 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised
learning
Explore common techniques for association mining and when to use each
3 Association Mining of them, including A Priori and Markov models
MACHINE LEARNING
K-Means Clustering
Classification Regression Reinforcement Learning
Hierarchical (Q-learning, deep RL, multi-armed-bandit, etc.)
Clustering
K-Nearest Neighbors Least Squares Markov Chains
Natural Language Processing
Naïve Bayes Linear Regression Apriori (Latent Semantic Analysis, Latent Dirichlet Analysis,
relationship extraction, semantic parsing,
Decision Trees Forecasting Cross-sectional Outlier Detection contextual word embeddings, translation,
etc.)
Logistic Regression Non-Linear Regression Time-series Outlier Detection
Computer Vision
Sentiment Analysis Intervention Analysis Dimensionality Reduction (Convolutional neural networks, style translation, etc.)
LASSO/RIDGE, state-
Support vector space, advanced Matrix factorization, principal components, factor Deep Learning
machines, gradient generalized linear analysis, UMAP, T-SNE, topological data (Feed Forward, Convolutional, RNN/LSTM,
boosting, neural methods, VAR, DFA, analysis, advanced clustering, etc. Attention, Deep RL, Autoencoder, GAN. etc.)
nets/deep learning, etc.
etc.
MACHINE LEARNING LANDSCAPE
MACHINE LEARNING
Classification Regression Used to DESCRIBE or ORGANIZE the data in some non-obvious way
MACHINE LEARNING
Business
Objective
Remember, these steps ALWAYS come first
Preliminary Before building a model, you should have a clear understanding of the business objective (identifying clusters,
QA
detecting anomalies, etc.) and the data at hand (variable types, table structure, data quality, profiling metrics,
Data
etc.)
Profiling
Hyperparamet Model
Feature Model
er Tuning Selection
Engineering Application
Add new, calculated Apply relevant Adjust and tune model Select the model that
variables (or unsupervised ML parameters (this is yields the most useful
“features”) to the techniques, based on the typically an iterative or insightful results,
data set based on objective (you will process) based on the objective
existing fields typically test multiple at hand
models)
There often are no strict rules to determine which model is best; it’s about which one helps you best answer the question at
hand
RECAP: FEATURE ENGINEERING
Original Engineered
features features
Socia Competitiv Marketin Promotio Competitiv Competitiv Competitiv Promotion >10 &
Month ID Revenue Log Spend
l e g n e High e e Low Social > 25
Post Activity Spend Count Medium
s
1 30 High $130,000 12 $1,300,050 1 0 0 1 11.7
Remember, there’s no “right” answer or single optimization metric when it comes to clustering and segmentation; the
best outputs are the ones which help you answer the question at hand and make practical, data-driven business
decisions
CLUSTERING BASICS
Hierarchical
Clustering Key
Takeaways
CLUSTERING BASICS
Hierarchical
Clustering Key
Takeaways
CLUSTERING BASICS
Clustering Basics
K-Means
Hierarchical
Clustering
Key
Takeaways
Clustering Basics
K-Means
Hierarchical
Clustering
Key
Takeaways
Clustering Basics
K-Means
Hierarchical
Clustering
Key
Takeaways
Clustering Basics
K-Means
Hierarchical
Clustering
Key
Takeaways
Clustering Basics
K-Means
Hierarchical
Clustering
Key
Takeaways
Clustering Basics
K-Means
Hierarchical
Clustering
Key
Takeaways
Hierarchical
Square these distances and sum them to calculate WSS for two clusters
Clustering
(K=2)
Key
Takeaways
K-M EANS
2 3 4 5 7 8
6
NUMBER OF CLUSTERS
(K)
K-M EANS
Hierarchical
Clustering Look for an “elbow” or inflection point,
where adding another cluster has a
Key relatively small impact on WSS (in this
Takeaways case where K=5)
WS
S
PRO TIP: Think of this as
a guideline, not a strict
rule
2 3 4 5 7 8
6
NUMBER OF CLUSTERS
(K)
K-M EANS
Key
Does the shape of the clusters matter?
Takeaways
• Yes, K-Means works best when the clusters are mostly circular in
shape; but other tools like Hierarchical Clustering (up next!) can
address this
*This is known as agglomerative or “bottom-up” clustering (vs. divisive or “top-down” clustering, which is much less
common)
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
p5
p6
Hierarchical
DISTANCE
p2
Clustering
p3
Key
Takeaways p1
p4
p1 p2 p3 p4 p5 p6
STEP 1: Find the two closest points, and group them into a
cluster
HIERARCHICAL CLUSTERING
Clustering
Basics
K-Means
p5
p6
Hierarchical
DISTANCE
p2
Clustering
p3
Key
Takeaways p1
p4
5
clusters
p1 p2 p3 p4 p5 p6
STEP 1: Find the two closest points, and group them into a
cluster
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
p5
p6
Hierarchical
DISTANCE
p2
Clustering
p3
Key
Takeaways p1
p4
4
clusters
p1 p2 p3 p4 p5 p6
STEP 2: Find the next two closest points/clusters, and group them
together
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
p5
p6
Hierarchical
DISTANCE
p2
Clustering
p3 3
clusters
Key
Takeaways p1
p4
p1 p2 p3 p4 p5 p6
STEP 3: Repeat the process until all points are part of the same
cluster
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
2
clusters
p5
p6
Hierarchical
DISTANCE
p2
Clustering
p3
Key
Takeaways p1
p4
p1 p2 p3 p4 p5 p6
STEP 3: Repeat the process until all points are part of the same
cluster
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
p5
p6
Hierarchical
DISTANCE
p2
Clustering
p3
Key
Takeaways p1
p4
p1 p2 p3 p4 p5 p6
STEP 3: Repeat the process until all points are part of the same
cluster
HIERARCHICAL CLUSTERING
DISTANCE
Key
(taller = longer
Takeaways
distance)
Clustering Basics How exactly do you define the distance between clusters?
• There are a number of valid ways to measure the distance between
K-Means clusters, which are often referred to as “linkage methods”
• Common methods include measuring the closest min/max distance between
Hierarchical clusters, the lowest average distance, or the distance between cluster
Clustering centroids
vs
K-Means Hierarchical
KEY TAKEAWAYS
Markov Key
• Explore common association mining
Chains Takeaways techniques like Apriori and Markov Chains
Association mining is NOT about trying to prove or establish causation; it’s just
about identifying frequently occurring patterns and correlations in large datasets
APRIOR
I
The Apriori algorithm is commonly used to analyze how
Association Mining Basics
frequently items are purchased together in a single transaction
(i.e. beer and diapers, peanut butter and jelly, etc.)
Apriori
• In its simplest form (2-item sets), Apriori models compare the frequency
of transactions containing item A, item B, and both items A and B
Markov Chains
• This allows you to understand how often these items tend to be
purchased together, and calculate the strength of the association between
Key Takeaways them
Apriori models typically include three key metrics:
• Support (frequency of transactions containing a given item or set of items) / total
transactions
Support (A,B)
3
1 Support ) = 10/20 =
4 (
Apriori ) 0.5
5 Support
( ) = 7/20 =
6
9 Support ( , 0.
Key Takeaways 10 2 Confidence ) ) = 3 = 60%
) ( = Support ( 0.
11 ) 5
12
13 Support ( , 0.
14 3 Lift ) ) = 3 = 1.7
) ( = Support ( ) x Support ( 0.5 x
15
) 0.35
16
17
18
Since Lift > 1, we can interpret the association between bacon and
19 eggs as real and informative (eggs are likely to be purchased with
20 bacon)
*Inspired by KDNuggets
APRIOR
I
TRANS ITEM ITEM
. 1 2
1 EXAMPLE Calculating the association between bacon &
Association Mining Basics #2 basil
2
3
1 Support ) = 10/20 =
4 (
Apriori ) 0.5
5 Support
( ) = 5/20 =
6
9 Support ( , 0.0
Key Takeaways 10 2 Confidence ) ) = 5 = 10%
) ( = Support ( 0.
11 ) 5
12
13 Support ( , 0.0
14 3 Lift ) ) = 5 =
) ( = 0.5 x
15 Support ( ) x Support ( 0.4
) 0.25
16
17
18
Since Lift < 1, we can conclude that there is no positive
19 association between bacon and basil (basil is unlikely to be
20 purchased with bacon)
*Inspired by KDNuggets
APRIOR
I
TRANS ITEM ITEM
. 1 2
1 EXAMPLE Calculating the association between bacon &
Association Mining Basics #3 water
2
3
1 Support ) = 10/20 =
4 (
Apriori ) 0.5
5 Support
( ) = 1/20 =
6
9 Support ( , 0.0
Key Takeaways 10 2 Confidence ) ) = 5 = 10%
) ( = Support ( 0.
11 ) 5
12
13 Support ( , 0.0
14 3 Lift ) ) = 5 = 2
) ( = Support ( ) x Support ( 0.5 x
15
) 0.05
16
17
18
Since Lift = 2 we might assume a strong association between bacon
19 and water, but this is skewed since water only appears in one
20 transaction
*Inspired by KDNuggets
APRIOR
I
TRANS ITEM ITEM
.
1
1 2
How do we account for infrequently purchased
Association Mining Basics 2 items?
3
4
To filter low-volume purchases, you can plot support for each
Apriori item and determine a threshold or cutoff value:
5
Markov Chains 7
SUPPORT
9
Key Takeaways 10
13
14
15
16
In this case we might filter out any transactions containing items
17
with support <0.15 (transactions 4, 6, 8, 10, 15, 17)
18
*Inspired by KDNuggets
APRIOR
I
Association Mining Basics Wouldn’t you want to analyze all possible item combinations,
instead of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of
configurations increases exponentially with each item, this is impractical for
Markov Chains humans to do
• To reduce or “prune” the number of itemsets to analyze, we can use the
apriori principle, which basically states that if an item is infrequent, any
Key Takeaways
combinations containing that item must also be infrequent
STEP 1: 1-item
sets:
2-item
sets:
3-item
sets:
4-item
sets:
Calculate support for each 1-item
set, and filter out all
transactions containing items
below the threshold (in this
case cheese)
*Inspired by KDNuggets
APRIOR
I
Association Mining Basics Wouldn’t you want to analyze all possible item combinations,
instead of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of
configurations increases exponentially with each item, this is impractical for
Markov Chains humans to do
• To reduce or “prune” the number of itemsets to analyze, we can use the
apriori principle, which basically states that if an item is infrequent, any
Key Takeaways
combinations containing that item must also be infrequent
STEP 2: 1-item
sets:
2-item
sets:
3-item
sets:
4-item
sets:
Based on the remaining itemsets,
calculate support for each 2-item
set, and filter out transactions
containing any pairs below the
threshold (in this case chocolate &
bread)
*Inspired by KDNuggets
APRIOR
I
Association Mining Basics Wouldn’t you want to analyze all possible item combinations,
instead of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of
configurations increases exponentially with each item, this is impractical for
Markov Chains humans to do
• To reduce or “prune” the number of itemsets to analyze, we can use the
apriori principle, which basically states that if an item is infrequent, any
Key Takeaways
combinations containing that item must also be infrequent
STEP 3: 1-item
sets:
2-item
sets:
3-item
sets:
4-item
sets:
Repeat until all infrequent
itemsets have been eliminated,
and filter transactions accordingly
*Inspired by KDNuggets
APRIOR
I
Association Mining Basics Wouldn’t you want to analyze all possible item combinations,
instead of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of
configurations increases exponentially with each item, this is impractical for
Markov Chains humans to do
• To reduce or “prune” the number of itemsets to analyze, we can use the
apriori principle, which basically states that if an item is infrequent, any
Key Takeaways
combinations containing that item must also be infrequent
STEP 4: 1-item
sets:
2-item
sets:
3-item
sets:
4-item
sets:
Based on the filtered
transactions, you can use an
apriori model to calculate
confidence and lift and identify
the strongest associations
*Inspired by KDNuggets
APRIOR
I
Association Mining Basics Wouldn’t you want to analyze all possible item combinations,
instead of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of
configurations increases exponentially with each item, this is impractical for
Markov Chains humans to do
• To reduce or “prune” the number of itemsets to analyze, we can use the
apriori principle, which basically states that if an item is infrequent, any
Key Takeaways
combinations containing that item must also be infrequent
STEP 4:
Based on the filtered
transactions, you can use an
apriori model to calculate
confidence and lift and identify
the strongest associations
*Inspired by KDNuggets
APRIOR
I
Can you calculate associations between multiple items, like coffee
Association Mining Basics
being purchased with bacon and eggs?
• Yes, you can calculate support, confidence and lift using the same exact logic as
you would with individual items
Apriori
5 Support ( , , )
Confidence ( & ) 0.2 = = 67%
6
= Support , )
7 ( 0.3
8
9 Support , , ) 0.
Lift ( & ) ( = 2 =
10
= ) 0.1
Support ( , ) x Support ( 1.7
2
Calculating all possible associations would be impossible for a human; this is where Machine Learning comes
in!
*Inspired by KDNuggets
CASE STUDY:
APRIORI
THE You are the proud owner of Coastal Roasters, a local coffee shop
SITUATIO and bakery based in the Pacific Northwest.
N
THE 1. Collect transaction-level data, including date, time, and items purchased
80 40
TO % %
20
STATE %
GOLD SILVE
Gold Silver Churn 15% R
40
Gold 0.8 0.15 0.05 1 %
5 %
30
%
%
Silver 0.2 0.4 0.4
FROM
STATE
CHURN
69
%
MARKOV CHAINS
80 40
TO % %
STATE 20
%
Gold Silver Churn
Association Mining Basics GOLD SILVE
15 R
%
Gold 0.8 0.15 0.05 40
%
1%
Apriori Silver
5
30
0.2 0.4 0.4 %
FROM
%
STATE
CHURN
Churn 0.01 0.3 0.69
Markov Chains
69
Key Takeaways Example insights & %
recommendations:
• Most customers who churn stayed churned, but 31% do come back; of those who
return, nearly all of them re-subscribe to a Silver plan (vs. Gold)
RECOMMENDATION: Launch targeted marketing to recently churned customers, offering a discount
to resubscribe to a Silver membership plan
• Once customers upgrade to a Gold membership, the majority 80% renew each month
RECOMMENDATION: Offer a one-time discount for Silver customers to upgrade to Gold; while you
may sacrifice some short-term revenue, it will likely be profitable in the long term
To account for prior transitions (vs. just the previous) you can use more complex “higher-order” Markov
Chains
CASE STUDY: M ARKOV
CHAINS
THE You’ve just been promoted to Senior Web Analyst at Alpine Supplies, an
SITUATIO online retailer specializing in equipment and supplies for outdoor enthusiasts.
N
The VP of Sales just shared a sample of ~15,000 customer purchase paths, and
THE would like you to analyze the data to help inform a new cross-sell sales
strategy.
ASSIGNMENT
Your goal is to explore the data and build a simple Markov model to predict
which product an existing customer is most likely to purchase next.
In this section we’ll introduce the concept of statistical outliers, and review
common methods for detecting both cross-sectional and time-series outliers and
anomalies
The terms outlier detection, anomaly detection, and rare-event detection are often used interchangeably; they all focus
on finding observations which are materially different than the others (outliers are also sometimes called “pathological”
data)
OUTLIER DETECTION
BASICS
Outlier Detection
Outlier detection is used to identify observations in a dataset which
Basics are either unexpected or statistically different from the others
Cross-Sectional Outliers
There are two general types of outliers you may encounter:
Key • Detecting outliers 3+ dimensions is tricker, and often requires more sophisticated
Takeaways techniques; this is where machine learning comes in!
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series
Outliers
Key Takeaways
Time-Series
Outliers Key
Takeaways
Key Takeaways
TIME-SERIES
OUTLIERS
EXAMPL Hourly website sessions
Outlier Detection (n=8,760)
Basics
E
Cross-Sectional Outliers
Time-Series
Outliers Key
Takeaways
Key Takeaways
TIME-SERIES
OUTLIERS
Outlier Detection
Basics
Key
Takeaways PRO TIP: Not all outliers
are bad! If you find an
anomaly, understand what
happened and how you can
learn from it
KEY TAKEAWAYS
For time-series analysis, you can fit a regression model to the data and
plot the residuals to clearly identify outliers
• This allows you to quickly find outliers, while controlling for seasonality and
trending
*Copyright Maven Analytics, LLC
DIMENSIONALITY REDUCTION
Principal Component
• In the simplest form, PCA finds lines that best fit through the observations in
Analysis a data set, and uses those lines to create new dimensions to analyze
Advanced Techniques
Spelling (X) / Vocabulary (Y) Vocabulary (X) / Multiplication Spelling (X) / Multiplication Multiplication (X) / Geometry
Key Takeaways (Y) (Y) (Y)
PRINCIPAL COMPONENT ANALYSIS
Key Takeaways
e
on
omp
c
Vocabulary
nt
Geometry
tin
e igh
w
g
Spelling Multiplication
PRINCIPAL COMPONENT ANALYSIS
Key Takeaways
Vocabulary
Geometry
10.
0 (8.3 ,
5.6)
3.2
(2.8 ,
1.6)
Spelling Multiplication
PRINCIPAL COMPONENT ANALYSIS
Y (0.72)*X 1 + + (-0.02)*X 3 +
= (0.69)*X 2 (0.03)*X 4
PRINCIPAL COMPONENT ANALYSIS
Dimensionality
Reduction
Basics
Principal Component
Analysis
Advanced
Techniques
Key Takeaways In this example, defining components for language and math
helped us simplify and better understand the data set
• Using the new components we derived, we can conduct further
analysis like predictive modeling, classification, clustering, etc.
PRINCIPAL COMPONENT ANALYSIS
Dimensionality
Reduction
Basics
Principal Component
Math
Analysis
Advanced
Techniques
Language
Key Takeaways In this example, defining components for language and math
helped us simplify and better understand the data set
• Using the new components we derived, we can conduct further
analysis like predictive modeling, classification, clustering, etc.
• For example, clustering might help us understand student testing
patterns
(most skew towards either language or math, while a few excel in both
subjects)
ADVANCED TECHNIQUES
We hope you’ve enjoyed the entire Machine Learning for BI series, and that you find
an opportunity to put your skills to good use!