0% found this document useful (0 votes)
13 views43 pages

Chapter 02 Overview - 4

Uploaded by

Mery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views43 pages

Chapter 02 Overview - 4

Uploaded by

Mery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Overview

Machine Learning for


Business Analytics Using
RapidMiner

Shmueli, Bruce,
© Galit Shmueli, Peter Bruce and AmitDeokar,
Deokar 2023 & Patel
Core Ideas in Machine
Learning`

●Classification
●Prediction
●Association Rules & Recommenders
●Data & Dimension Reduction
●Data Exploration
●Visualization
Paradigms for Machine Learning
(variations)
●SEMMA (from SAS)
• Sample
• Explore
• Modify
• Model
• Assess
●CRISP-DM (SPSS/IBM)
• Business Understanding
• Data Understanding
• Data Preparation
• Modeling
• Evaluation
• Deployment
Supervised Learning
●Goal: Predict a single “target” or “outcome”
variable

●Training data, where target value is known

●Score to data where value is not known

●Methods: Classification and Prediction


Unsupervised Learning

●Goal: Segment data into meaningful


segments; detect patterns

●There is no target (outcome) variable to


predict or classify

●Methods: Association rules, collaborative


filters, data reduction & exploration,
visualization
Supervised: Classification

●Goal: Predict categorical target (outcome)


variable
●Examples: Purchase/no purchase, fraud/no
fraud, creditworthy/not creditworthy…
●Each row is a case (customer, tax return,
applicant)
●Each column is a variable
●Target variable is often binary (yes/no)
Supervised: Prediction
(Estimation)
●Goal: Predict numerical target (outcome)
variable
●Examples: sales, revenue, performance
●As in classification:
●Each row is a case (customer, tax return,
applicant)
●Each column is a variable
●Taken together, classification and
prediction constitute “predictive analytics”
Unsupervised: Association
Rules
●Goal: Produce rules that define “what goes
with what” in transactions
●Example: “If X was purchased, Y was also
purchased”
●Rows are transactions
●Used in recommender systems – “Our
records show you bought X, you may also
like Y”
●Also called “affinity analysis”
Unsupervised: Data
Reduction
●Distillation of complex/large data into
simpler/smaller data
●Reducing the number of variables/columns
(e.g., principal components)
●Reducing the number of records/rows (e.g.,
clustering)
The Process of Machine
Learning
Steps in Machine Learning
1. Define/understand purpose
2. Obtain data (may involve random
sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised learning,
partition it
5. Specify task (classification, clustering,
etc.)
6. Choose the techniques (regression, CART,
neural networks, etc.)
7. Iterative implementation and “tuning”
8. Assess results – compare models
Obtaining Data: Sampling

●Machine learning typically deals with huge


databases
●For piloting/prototyping, algorithms and
models are typically applied to a sample
from a larger dataset (easier to handle)
●Once you develop and select a final model,
you use it to “score” (predict values or
classes for) the records in the larger
database
●Also called “inference”
Rare Event Oversampling

●Often the event of interest is rare


●Examples: response to mailing, fraud in
taxes, …
●Sampling may yield too few “interesting”
cases to effectively train a model
●A popular solution: oversample the rare
cases (equivalent to undersampling the
dominant cases) to obtain a more balanced
training set
●Later, need to adjust results for the
oversampling
Types of Variables (Features)
●Determine the types of pre-processing
needed, and algorithms used
●Main distinction: Categorical vs. numeric
●Numeric
●Continuous
●Integer
●Categorical
●Ordered (low, medium, high)
●Unordered (male, female)
Variable handling
●Numeric
●Most algorithms can handle numeric data
●May occasionally need to “bin” into
categories

●Categorical
●Naïve Bayes can use as-is
●In most other algorithms, must create n or n-
1 binary dummies
Data Pre-processing in RM - West Roxbury
data
Detecting Outliers

⚫An outlier is an observation that is


“extreme”, being distant from the rest of
the data (definition of “distant” is
deliberately vague)
⚫Outliers can have disproportionate
influence on models (a problem if it is
spurious)
⚫An important step in data pre-processing is
detecting outliers
⚫Once detected, domain knowledge is
required to determine if it is an error, or
truly extreme.
Detecting Outliers
⚫In some contexts, finding outliers is the
purpose of the ML exercise (e.g. airport
security screening). This is called “anomaly
detection”.
Handling Missing Data
⚫Most algorithms will not process records
with missing values. Default is to drop
those records.
⚫Solution 1: Omission
⚫ If a small number of records have missing values,
can omit them
⚫ If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
⚫ If many records/variables have missing values,
omission is not practical
⚫Solution 2: Imputation
⚫ Replace missing values with reasonable substitutes
⚫ Lets you keep the record and use the rest of its (non-
missing) information
Normalizing (Standardizing)
Data
⚫Used in some techniques when variables with
the largest scales would dominate and skew
results
⚫Puts all variables on same scale
⚫Normalizing function: Subtract mean and
divide by standard deviation
⚫Alternative function: scale to 0-1 by
subtracting minimum and dividing by the
range
⚫ Useful when the data contain dummies and
numeric
The Problem of Overfitting

⚫Statistical models can produce highly


complex explanations of relationships
between variables
⚫The “fit” may be excellent
⚫When used with new data, models of great
complexity do not do so well.
100% fit – not useful for new
data
Overfitting (cont.)
Causes:
⚫ Too many predictors
⚫ A model with too many parameters
⚫ Trying many different models

Consequence: Deployed model will not work


as well as expected with completely new
data.
Partitioning the Data
Problem: How well will our model
perform with new data?

Solution: Separate data into two parts


⚫Training partition to develop the
model
⚫Validation partition (sometimes called
test) to implement the model and
evaluate its performance on “new”
data

Addresses the issue of overfitting


Holdout Partition
⚫ When a model is developed on training
data, it can overfit the training data
(hence need to assess on validation)
⚫ Assessing multiple models on same
validation data can overfit validation
data
⚫ Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data
⚫ Solution: final selected model is applied
to a holdout partition (sometimes
called test) to give unbiased estimate of
its performance on new data
Cross Validation

● Repeated partitioning = cross-validation (“cv”)


● k-fold cross validation, e.g. k=5
○ For each fold, set aside ⅕ of data as
validation
○ Use full remainder as training
○ The validation folds are non-overlapping
Partitioning the Data
(60/40 split)
Example – Linear Regression
West Roxbury Housing Data
Predictive Modeling Process in RM
Training data: a few predictions, and
summary metrics
Validation data: a few predictions, and
summary metrics
Error metrics
Error = actual – predicted
ME = Mean error
RMSE = Root-mean-squared error = Square
root of average squared error
MAE = Mean absolute error
MPE = Mean percentage error
MAPE = Mean absolute percentage error
AI Engineering, ML-Ops
⚫ In this book, focus is on developing/testing
the models that predict, classify, cluster,
recommend, forecast.
⚫ Developing/testing (prototyping) is the
initial stage; after selecting a model we
usually need to deploy it in a pipeline that
feeds data to it and generates actions.
⚫ Deployment phase = AI Engineering, or
more specifically ML Ops
AI Engineering, ML-Operations (ML-
Ops)
Dozens of Tools

The “infrastructure” layer provides base computing capability,


memory, and networking
AI Engineering, ML-Ops

Security = admin, permissions access rules


Monitoring = ingests logs, issues alerts
Automation = bring up, configure, tear down tools &
infrastructure
Resource Management = oversight, checking for resource
exhaustion
AI Engineering, ML-Ops

Tools for testing and debugging


AI Engineering, ML-Ops

Data store (data warehouse or data lake); also Analytic Base Table
(ABT) with derivatives more suited to analysis
AI Engineering, ML-Ops

Tools to create the ABT derivatives that are in the Data Collection layer.
AI Engineering, ML-Ops

Models: main focus of this book


Model Monitoring: ties into “Monitoring” component below
AI Engineering, ML-Ops

Delivery is how user views the system (text file, spreadsheet, interface
with Tableau or Power BI, …)
Summary
⚫ Machine Learning consists of supervised methods
(Classification & Prediction) and unsupervised methods
(Association Rules, Data Reduction, Data Exploration &
Visualization)
⚫ Before algorithms can be applied, data must be explored
and pre-processed
⚫ To evaluate performance and to avoid overfitting, data
partitioning is used
⚫ Models are fit to the training partition and assessed on
the validation and holdout partitions
⚫ Machine Learning methods are usually applied to a
sample from a large database, and then the best model
is used to score the entire database
⚫ Once a model is developed, AI Engineering (ML-Ops)
skills and tools are required to deploy it

You might also like