0% found this document useful (0 votes)

13 views43 pages

Chapter 02 Overview - 4

Uploaded by

Mery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views43 pages

Chapter 02 Overview - 4

Uploaded by

Mery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Overview

Machine Learning for

Business Analytics Using
RapidMiner

Shmueli, Bruce,
© Galit Shmueli, Peter Bruce and AmitDeokar,
Deokar 2023 & Patel
Core Ideas in Machine
Learning`

●Classification
●Prediction
●Association Rules & Recommenders
●Data & Dimension Reduction
●Data Exploration
●Visualization
Paradigms for Machine Learning
(variations)
●SEMMA (from SAS)
• Sample
• Explore
• Modify
• Model
• Assess
●CRISP-DM (SPSS/IBM)
• Business Understanding
• Data Understanding
• Data Preparation
• Modeling
• Evaluation
• Deployment
Supervised Learning
●Goal: Predict a single “target” or “outcome”
variable

●Training data, where target value is known

●Score to data where value is not known

●Methods: Classification and Prediction

Unsupervised Learning

●Goal: Segment data into meaningful

segments; detect patterns

●There is no target (outcome) variable to

predict or classify

●Methods: Association rules, collaborative

filters, data reduction & exploration,
visualization
Supervised: Classification

●Goal: Predict categorical target (outcome)

variable
●Examples: Purchase/no purchase, fraud/no
fraud, creditworthy/not creditworthy…
●Each row is a case (customer, tax return,
applicant)
●Each column is a variable
●Target variable is often binary (yes/no)
Supervised: Prediction
(Estimation)
●Goal: Predict numerical target (outcome)
variable
●Examples: sales, revenue, performance
●As in classification:
●Each row is a case (customer, tax return,
applicant)
●Each column is a variable
●Taken together, classification and
prediction constitute “predictive analytics”
Unsupervised: Association
Rules
●Goal: Produce rules that define “what goes
with what” in transactions
●Example: “If X was purchased, Y was also
purchased”
●Rows are transactions
●Used in recommender systems – “Our
records show you bought X, you may also
like Y”
●Also called “affinity analysis”
Unsupervised: Data
Reduction
●Distillation of complex/large data into
simpler/smaller data
●Reducing the number of variables/columns
(e.g., principal components)
●Reducing the number of records/rows (e.g.,
clustering)
The Process of Machine
Learning
Steps in Machine Learning
1. Define/understand purpose
2. Obtain data (may involve random
sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised learning,
partition it
5. Specify task (classification, clustering,
etc.)
6. Choose the techniques (regression, CART,
neural networks, etc.)
7. Iterative implementation and “tuning”
8. Assess results – compare models
Obtaining Data: Sampling

●Machine learning typically deals with huge

databases
●For piloting/prototyping, algorithms and
models are typically applied to a sample
from a larger dataset (easier to handle)
●Once you develop and select a final model,
you use it to “score” (predict values or
classes for) the records in the larger
database
●Also called “inference”
Rare Event Oversampling

●Often the event of interest is rare

●Examples: response to mailing, fraud in
taxes, …
●Sampling may yield too few “interesting”
cases to effectively train a model
●A popular solution: oversample the rare
cases (equivalent to undersampling the
dominant cases) to obtain a more balanced
training set
●Later, need to adjust results for the
oversampling
Types of Variables (Features)
●Determine the types of pre-processing
needed, and algorithms used
●Main distinction: Categorical vs. numeric
●Numeric
●Continuous
●Integer
●Categorical
●Ordered (low, medium, high)
●Unordered (male, female)
Variable handling
●Numeric
●Most algorithms can handle numeric data
●May occasionally need to “bin” into
categories

●Categorical
●Naïve Bayes can use as-is
●In most other algorithms, must create n or n-
1 binary dummies
Data Pre-processing in RM - West Roxbury
data
Detecting Outliers

⚫An outlier is an observation that is

“extreme”, being distant from the rest of
the data (definition of “distant” is
deliberately vague)
⚫Outliers can have disproportionate
influence on models (a problem if it is
spurious)
⚫An important step in data pre-processing is
detecting outliers
⚫Once detected, domain knowledge is
required to determine if it is an error, or
truly extreme.
Detecting Outliers
⚫In some contexts, finding outliers is the
purpose of the ML exercise (e.g. airport
security screening). This is called “anomaly
detection”.
Handling Missing Data
⚫Most algorithms will not process records
with missing values. Default is to drop
those records.
⚫Solution 1: Omission
⚫ If a small number of records have missing values,
can omit them
⚫ If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
⚫ If many records/variables have missing values,
omission is not practical
⚫Solution 2: Imputation
⚫ Replace missing values with reasonable substitutes
⚫ Lets you keep the record and use the rest of its (non-
missing) information
Normalizing (Standardizing)
Data
⚫Used in some techniques when variables with
the largest scales would dominate and skew
results
⚫Puts all variables on same scale
⚫Normalizing function: Subtract mean and
divide by standard deviation
⚫Alternative function: scale to 0-1 by
subtracting minimum and dividing by the
range
⚫ Useful when the data contain dummies and
numeric
The Problem of Overfitting

⚫Statistical models can produce highly

complex explanations of relationships
between variables
⚫The “fit” may be excellent
⚫When used with new data, models of great
complexity do not do so well.
100% fit – not useful for new
data
Overfitting (cont.)
Causes:
⚫ Too many predictors
⚫ A model with too many parameters
⚫ Trying many different models

Consequence: Deployed model will not work

as well as expected with completely new
data.
Partitioning the Data
Problem: How well will our model
perform with new data?

Solution: Separate data into two parts

⚫Training partition to develop the
model
⚫Validation partition (sometimes called
test) to implement the model and
evaluate its performance on “new”
data

Addresses the issue of overfitting

Holdout Partition
⚫ When a model is developed on training
data, it can overfit the training data
(hence need to assess on validation)
⚫ Assessing multiple models on same
validation data can overfit validation
data
⚫ Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data
⚫ Solution: final selected model is applied
to a holdout partition (sometimes
called test) to give unbiased estimate of
its performance on new data
Cross Validation

● Repeated partitioning = cross-validation (“cv”)

● k-fold cross validation, e.g. k=5
○ For each fold, set aside ⅕ of data as
validation
○ Use full remainder as training
○ The validation folds are non-overlapping
Partitioning the Data
(60/40 split)
Example – Linear Regression
West Roxbury Housing Data
Predictive Modeling Process in RM
Training data: a few predictions, and
summary metrics
Validation data: a few predictions, and
summary metrics
Error metrics
Error = actual – predicted
ME = Mean error
RMSE = Root-mean-squared error = Square
root of average squared error
MAE = Mean absolute error
MPE = Mean percentage error
MAPE = Mean absolute percentage error
AI Engineering, ML-Ops
⚫ In this book, focus is on developing/testing
the models that predict, classify, cluster,
recommend, forecast.
⚫ Developing/testing (prototyping) is the
initial stage; after selecting a model we
usually need to deploy it in a pipeline that
feeds data to it and generates actions.
⚫ Deployment phase = AI Engineering, or
more specifically ML Ops
AI Engineering, ML-Operations (ML-
Ops)
Dozens of Tools

The “infrastructure” layer provides base computing capability,

memory, and networking
AI Engineering, ML-Ops

Security = admin, permissions access rules

Monitoring = ingests logs, issues alerts
Automation = bring up, configure, tear down tools &
infrastructure
Resource Management = oversight, checking for resource
exhaustion
AI Engineering, ML-Ops

Tools for testing and debugging

AI Engineering, ML-Ops

Data store (data warehouse or data lake); also Analytic Base Table
(ABT) with derivatives more suited to analysis
AI Engineering, ML-Ops

Tools to create the ABT derivatives that are in the Data Collection layer.
AI Engineering, ML-Ops

Models: main focus of this book

Model Monitoring: ties into “Monitoring” component below
AI Engineering, ML-Ops

Delivery is how user views the system (text file, spreadsheet, interface
with Tableau or Power BI, …)
Summary
⚫ Machine Learning consists of supervised methods
(Classification & Prediction) and unsupervised methods
(Association Rules, Data Reduction, Data Exploration &
Visualization)
⚫ Before algorithms can be applied, data must be explored
and pre-processed
⚫ To evaluate performance and to avoid overfitting, data
partitioning is used
⚫ Models are fit to the training partition and assessed on
the validation and holdout partitions
⚫ Machine Learning methods are usually applied to a
sample from a large database, and then the best model
is used to score the entire database
⚫ Once a model is developed, AI Engineering (ML-Ops)
skills and tools are required to deploy it

2025 Seltos Car Infotainment Sys Quick Reference Guide
No ratings yet
2025 Seltos Car Infotainment Sys Quick Reference Guide
156 pages
LFD259 Kubernetes For Developers Version
No ratings yet
LFD259 Kubernetes For Developers Version
96 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
Ass bigd
No ratings yet
Ass bigd
9 pages
ML-chap-2
No ratings yet
ML-chap-2
60 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Social Media Analytics Techniques[1] (1)
No ratings yet
Social Media Analytics Techniques[1] (1)
77 pages
Unit 3
No ratings yet
Unit 3
55 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Unit 4_Question Bank and answers
No ratings yet
Unit 4_Question Bank and answers
23 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
FML - KNN
No ratings yet
FML - KNN
64 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Unit4_PPT (2)
No ratings yet
Unit4_PPT (2)
126 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Intro ML 1 Day
No ratings yet
Intro ML 1 Day
43 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Overfitting & Feature Engineering.pptx
No ratings yet
Overfitting & Feature Engineering.pptx
37 pages
ML
No ratings yet
ML
9 pages
A Practical and Technical Introduction To Machine Learning
No ratings yet
A Practical and Technical Introduction To Machine Learning
23 pages
module3_DS_ppt
No ratings yet
module3_DS_ppt
68 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
TM 4 - Data Mining and Machine Learning
No ratings yet
TM 4 - Data Mining and Machine Learning
60 pages
Introduction To ML
No ratings yet
Introduction To ML
55 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
ML - Interview Prep
No ratings yet
ML - Interview Prep
9 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
AIYA SESSION 4
No ratings yet
AIYA SESSION 4
42 pages
NEP Syllabus Questions
No ratings yet
NEP Syllabus Questions
3 pages
ML_notion_1
No ratings yet
ML_notion_1
18 pages
machine learning
No ratings yet
machine learning
37 pages
Week 12 Intro to DS and ML
No ratings yet
Week 12 Intro to DS and ML
67 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
94 pages
FAM_QUESTION_BANK_CT[1]
No ratings yet
FAM_QUESTION_BANK_CT[1]
14 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
90 pages
Basics of Machine Learning1
No ratings yet
Basics of Machine Learning1
67 pages
LECTURE-2
No ratings yet
LECTURE-2
36 pages
1 Tailieuthamkhao MachineLearning
No ratings yet
1 Tailieuthamkhao MachineLearning
151 pages
Machine Learning – I[1]
No ratings yet
Machine Learning – I[1]
126 pages
Machine Learning - Brief
No ratings yet
Machine Learning - Brief
12 pages
Unit-1 Introduction to Machine Learning [5hrs]
No ratings yet
Unit-1 Introduction to Machine Learning [5hrs]
8 pages
ML-1-PPT-UNIT-1
No ratings yet
ML-1-PPT-UNIT-1
93 pages
Machine Learning Introduction
100% (1)
Machine Learning Introduction
20 pages
Step-by-Step Machine Learning
No ratings yet
Step-by-Step Machine Learning
3 pages
Subjects You Need To Know:: Programming Languages of AI
0% (1)
Subjects You Need To Know:: Programming Languages of AI
7 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
XGBoost in Practice: Definitive Reference for Developers and Engineers
From Everand
XGBoost in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SCSI
No ratings yet
SCSI
16 pages
Subnetting Problems
No ratings yet
Subnetting Problems
17 pages
QuickRide Logcat
No ratings yet
QuickRide Logcat
129 pages
Citrix ADC WAF Implementation Document - Karnataka Gramin Bank - Ver 7.0
No ratings yet
Citrix ADC WAF Implementation Document - Karnataka Gramin Bank - Ver 7.0
51 pages
Process Synchronization
No ratings yet
Process Synchronization
48 pages
Script in Hindi
No ratings yet
Script in Hindi
3 pages
PayPal (Australia - 1 - 128 - 96 - 110)
No ratings yet
PayPal (Australia - 1 - 128 - 96 - 110)
2 pages
VDT Training Material Modules
No ratings yet
VDT Training Material Modules
114 pages
BMNG6222 Exam Brief- Suggested Solutions
No ratings yet
BMNG6222 Exam Brief- Suggested Solutions
15 pages
SE Unit 3
No ratings yet
SE Unit 3
11 pages
Evidentiary Value of Banker's Book
No ratings yet
Evidentiary Value of Banker's Book
8 pages
Electronic Resources
No ratings yet
Electronic Resources
257 pages
Physical Storage
No ratings yet
Physical Storage
14 pages
KCC Buildcon Private Limited: Department
No ratings yet
KCC Buildcon Private Limited: Department
13 pages
Integrating Artificial Intelligence into Electronics and Communication
No ratings yet
Integrating Artificial Intelligence into Electronics and Communication
2 pages
Ingilizce Almanca Ve Türkce SQL Interview Soru Cevaplar
No ratings yet
Ingilizce Almanca Ve Türkce SQL Interview Soru Cevaplar
93 pages
Atcd QB
No ratings yet
Atcd QB
4 pages
Operating System Structure
No ratings yet
Operating System Structure
36 pages
BYD B-Box Premium LVS - LVL BMU - User Manual - EN
No ratings yet
BYD B-Box Premium LVS - LVL BMU - User Manual - EN
6 pages
c@Rd1ng Full Manual
No ratings yet
c@Rd1ng Full Manual
125 pages
(ER - Normalization) مسائل مهمة
No ratings yet
(ER - Normalization) مسائل مهمة
5 pages
Continuous Beam Bending Tables
No ratings yet
Continuous Beam Bending Tables
2 pages
How To Improve Decision Making Process Through Decision Support Systems & Business Intelligence Evidence From Jordan University Hospital
No ratings yet
How To Improve Decision Making Process Through Decision Support Systems & Business Intelligence Evidence From Jordan University Hospital
12 pages
FMC Fault Escalation and Mass Complaint Process
No ratings yet
FMC Fault Escalation and Mass Complaint Process
1 page
ReleaseNotes EN
No ratings yet
ReleaseNotes EN
20 pages
Testing and Quality Assurance
No ratings yet
Testing and Quality Assurance
5 pages
UML Test Case
No ratings yet
UML Test Case
47 pages
Ch03 - TRAP Routines - Subroutines
No ratings yet
Ch03 - TRAP Routines - Subroutines
10 pages

Chapter 02 Overview - 4

Uploaded by

Chapter 02 Overview - 4

Uploaded by

Overview

Machine Learning for

●Training data, where target value is known

●Score to data where value is not known

●Methods: Classification and Prediction

●Goal: Segment data into meaningful

●There is no target (outcome) variable to

●Methods: Association rules, collaborative

●Goal: Predict categorical target (outcome)

●Machine learning typically deals with huge

●Often the event of interest is rare

⚫An outlier is an observation that is

⚫Statistical models can produce highly

Consequence: Deployed model will not work

Solution: Separate data into two parts

Addresses the issue of overfitting

● Repeated partitioning = cross-validation (“cv”)

The “infrastructure” layer provides base computing capability,

Security = admin, permissions access rules

Tools for testing and debugging

Models: main focus of this book

You might also like