0% found this document useful (0 votes)
8 views

1 Tailieuthamkhao MachineLearning

Uploaded by

minhtr160616
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

1 Tailieuthamkhao MachineLearning

Uploaded by

minhtr160616
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

Machine Learning Professional

RapidMiner Education
Introduction
Course Introduction

Introduction to the course


1. Audience
2. Objectives
3. Course Outline

Introduction to RapidMiner
1. Purpose
2. Platform Overview
3. Basics of using RapidMiner Studio
4. Continued learning
Introduction
Target Audience

Business Domain Data


Analyst Expert Scientist
Course Objectives

At the end of this course, you should be able to understand, and are
able to use, the following machine learning tools in RapidMiner Studio’s
Process Designer and Auto Model:
• Classification and Regression
• Split Validation
• Scoring
• Correlations
• Feature Importance
• Clustering and Association Analysis
Course Outline

1. Introduction
2. Introduction to Machine Learning
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Getting Started with RapidMiner
RapidMiner Platform
RapidMiner Market Place
Industry, Application & ML Extensions

RapidMiner Web Applications


RapidMiner Studio RapidMiner Server
Visual Workflow Designer Collaborate + Compute + Deploy + Maintain

Workflow Builder Data and Process


Web App Portal
Repository
Process Execution

Web Services
Engine User/Group Access
Rights management
Process Scheduler
RapidMiner Radoop
Compile + Execute in Hadoop
Process Execution
Integrate using Web Service and SQL operators
Engine
Server Application Java SE/EE Application
RapidMiner AI Cloud
RapidMiner Radoop
Compile + Execute in Hadoop Databases / DWHs Application (BI, ERP,
Managed Services CRM…) / Portal

AWS
Use any data

Azure
R / Python / SQL Scripting
Run in multiple In-Memory/H2O/Weka
Compute Engines In-Hadoop & Spark
Creating Repositories & Folders
• Repositories – Collection of Projects
- Local Repository
- Server Repository

• Folders – Collection of Project Components


- Processes - Weights
- Example Set (Data) - File Objects
- (Preprocessing) Models - Documents
- Performances - Collections of all of these

• Best practice: Create a folder for each Project under the Repository
and within this folder, a sub-folder for – Data, Processes & Results (and
more)
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
The RapidMiner “GoTo” Places

• RapidMiner Academy:
https://academy.rapidminer.com

• Online Documentation:
https://docs.rapidminer.com/studio

• Online Community:
https://community.rapidminer.com/
Introduction to Machine
Learning
Introduction to Machine Learning

1. Introduction 1. Introduction
2. Introduction to Machine 2. k-NN
Learning 3. Model Validation
3. Supervised Learning 4. Normalize & Group Models
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Introduction to Machine Learning
Unsupervised vs Supervised Learning
Can we add structure? Is the ? an or an ?

? ?? ?
? ? ?
? ?
?
?? ? ??
? ?
? ?? ? ??
?
?
Machine Learning

Feature Engineering
Supervised Learning
Classification Regression
(aka Predictive Is this A or B? How much or how many?
Will this be A or B? How many will happen?
Analytics) Feature
Generation
Create useful attributes

Feature Selection
Outlier Associations & Weight & Select Attributes
Unsupervised Clustering
How is this organized? Detection Correlations
Learning What belongs together? Is this weird?
What happens together?
What belongs together?
Underfit and Overfit

Simple Complex

High Bias High Variance


Introduction to k-NN
k-NN (Nearest Neighbor) Algorithm

K-NN is a simple distance based supervised learning algorithm for


Classification or Regression
k-NN (Nearest Neighbor) Algorithm

P4

P2
P1

P3
Distance Types & Measures

• Numerical
• Nominal
• Mixed
• Bregman Divergence
Splitting Data for Training & Testing

70%

30%
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Model Validation
Scientific Method
Training Data

Purpose The model is trained


on the training data, so
of course it fits – That
Research
Experiment doesn’t need to be
tested.
Hypothesis
Hypothesis A real experiment is
Testing Data

Experiment whether or not it has


predictive value.
Conclusion
Analysis

Conclusion
Splitting Data for Training & Testing

70%

30%
Training vs Testing Error

New Data
• Performance can only be
measured by testing
predictions with new data
Error Rate

Good
• When validating a model,
ignore performance on Training
Data
Training • Don’t over-optimize. Everything
Data you optimize needs to be
Underfit Overfit validated!
Model Complexity
Training vs Testing Error

• Performance can only be


measured by testing
predictions with new data
• When validating a model,
ignore performance on
Training Data
• Don’t over-optimize. Everything
you optimize needs to be
validated!

http://gerardnico.com/wiki/data_mining/overfitting
Anatomy of Machine Learning
Modeling

Evaluation

Data Preparation

Deployment
Performance Measurement - Accuracy

Confusion Matrix Table View:

TP FP
FN T
N

Accuracy = (TP+TN)/(TP+FP+TN+FN)
Class Precision = TP/(TP+FP) or TN/(TN+FN)
Class Recall = TP/(TP+FN) or TN/(TN+FP)
Performance Measurement - Costs

• Try to estimate value of each


possibility in the confusion
matrix
• Calculate the total value of the
model across many predictions
• Compare the value to a
baseline to estimate gain
• AutoModel produces the
results:
- Profits from Model: 1,000
- Profits for Best Option (Loyal):
-1,600
- Gain: 2,600
Split Validation vs. Cross Validation

Split Validation: Performs a simple validation by randomly splitting


data into two data sets – Training & Testing.

Cross Validation: It is an iterative validation process where the data is


split into many training & validation subsets. Each iteration validates
(tests) one subset of data with using the remaining subsets as training
data.

Note: In Cross Validation, the # of subsets of data = the # of iterations


Cross Validation
Training Subset
Validation Subset
Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5
Subset 1
Subset 2
Subset 3
Subset 4
Subset 5
Validation
80% 88% 85% 78% 84%
Accuracy
Final Accuracy = Average(Iteration 1, Iteration 2, …)
Proper Validation

Use these things: Check for these things:


• Proper validation schemes like • Data transformations or
split-validation or cross- feature engineering that are
validation dependent on observed data
• Modular validation schemes examples
• Multi-level nesting • Low-level model parameter
tuning, or model creation
• Pre-processing models
• Hyperparameter or high-level
model optimization
• Model selection
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Normalize & Group Models
Normalization (Before)
Normalization (After)
Normalization Methods

Z-transformation Proportion (Sum) Interquartile Range Range


Normalization Methods

• Range Transformation
• Proportional (Sum)
• Interquartile Range
• Z-Transformation

http://www.statistics4u.info/fundstat_eng/ee_ztransform.html
Group Models

This operator groups the given models into a


single combined model. When this combined
model is applied, it is equivalent to applying the
original models in their respective order.
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Supervised Learning
Supervised Learning

1. Introduction • Regression: Linear, Logistic,


2. Introduction to Machine GLM
Learning • Naïve Bayes
3. Supervised Learning • Decision Tree
4. Deployment & Scoring • Neural Networks
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Linear Regression
Linear Regression

Linear Regression is a statistical model based supervised learning


algorithm for Regression
Understanding Linear Regression

Goal→ Minimize error

α = intercept
β = slope
yi =α + β xi + εi
Analytical solution:

Linear Regression Analytical solution:

dominates RSS

Ridge Regression
& '(( ) *+ ',(
Helps with multicollinearity
Converting Nominal to Numerical Data
Converting Nominal to Numerical Data

Dummy Coding

Effect Coding
Converting Nominal to Numerical Data

Integer Coding

Avoid if the data is not


fundamentally
numeric…
And still avoid if it is
fundamentally numeric!
Performance (Regression)
Performance Metrics for Regression
Popular performance metrics
for linear regression include:
• Root Mean Squared Error R2 = 0.97
(RMSE)
• Absolute Error
• Relative Error
• R2 (Squared Correlation)
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Logistic Regression
Logistic Regression

Logistic Regression is a statistical model based supervised learning


algorithm for Classification
Logistic Regression Transformation

• Logit: ln[p/(1-p)] = a + BX
- ./01
• Logistic: p =
2- ./01
• where:
- ln is the natural logarithm, loge
- p is the probability that the event Y
occurs, p(Y=1)
- p/(1-p) = "odds ratio"
- ln[p/(1-p)] = log odds ratio, or "logit"
• Otherwise like a linear model
• Resulting B coefficient is the effect
on the “odds ratio”
Converting Nominal to Binominal Data
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
GLM
Generalized Linear Model

GLM is a statistical model based supervised learning algorithm for


Classification and Regression
GLM

• GLM is a generalization of both • GLM brings together many of


linear regression and logistic the advantages of regression in
regression
a flexible framework
• The model can handle a label that
is: • Expect to be able to interpret
- Numerical with a Gaussian family the model and results like you
- Binominal with a Binomial family would with Linear Regression
- Polynominal with a Multinomial or Logistic Regression
family
• Don’t expect the results to be a
• RapidMiner can assign an
appropriate family and solver perfect match with Linear
• It is robust because it has Regression or Logistic
regularization and can handle Regression.
missing values
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Naïve Bayes
Naïve Bayes

Naïve Bayes is a statistical model based supervised learning algorithm


for Classification
Bayes’ Rule
Constant
Prior probability likelihood Evidence across
Posterior probability classes!
3 45 3 7845
3 45 |7
3 7 likelihood
3 45 3 7845
class instance
Prior probability
evidence compare

prediction
Golf data set – When to play golf?

Outlook Temperature Humidity Wind Play


sunny High High false no
sunny Mild High true no
overcast High High false yes
rain Cool High false yes
rain Cool High false yes
rain Cool Normal true no
overcast Cool Normal true yes
sunny Mild High false no
sunny Cool Normal false yes
rain Mild High false yes
sunny Mild Normal true yes
overcast Mild High true yes
overcast High Normal false yes
rain Mild High true no
Number of Cases and Probabilities
Outlook Yes No Temperature Yes No Humidity Yes No Wind Yes No Yes No
sunny 2 3 Cool 4 1 Normal 4 1 false 6 2 9 5
overcast 4 0 Mild 3 3 High 5 4 true 3 3
rain 3 2 High 2 1
Probabilities
sunny 0.2 0.6 Cool 0.4 0.2 Normal 0.4 0.2 false 0.7 0.4 0.6 0.4
overcast 0.4 0.0 Mild 0.3 0.6 High 0.6 0.8 true 0.3 0.6
rain 0.3 0.4 High 0.2 0.2

?
Likelihood Calculation
Outlook Yes No Temperature Yes No Humidity Yes No Wind Yes No Yes No
sunny 2 3 Cool 4 1 Normal 4 1 false 6 2 9 5
overcast 4 0 Mild 3 3 High 5 4 true 3 3
rain 3 2 High 2 1
Probabilities
sunny 0.2 0.6 Cool 0.4 0.2 Normal 0.4 0.2 false 0.7 0.4 0.6 0.4
overcast 0.4 0.0 Mild 0.3 0.6 High 0.6 0.8 true 0.3 0.6
rain 0.3 0.4 High 0.2 0.2

YES 0.2 x 0.4 x 0.6 x 0.3 x 0.6 = 0.011758

36.4%

NO 0.6 x 0.2 x 0.8 x 0.6 x 0.4 = 0.020571

63.6%
Bayes’ Rule
Constant
Prior probability likelihood Evidence across
Posterior probability classes!
3 45 3 7845
3 45 |7
3 7 likelihood
3 45 3 7845
class instance
Prior probability
evidence compare

prediction
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Decision Tree
Decision Tree

Decision Tree is a statistical model based supervised learning algorithm


for Classification and Regression
Understand the Decision Tree
Popularity of Decision Trees

Decision Trees are very popular models for several reasons:


• Very simple to understand
• Can deal easily with interactions and non-linear effects
• Aren’t bothered by non-normalized data, missing values,
untransformed dates, multi-collinearity, and “messy” data
• Easily implemented in other platforms (try the “Tree to Rules”
operator)
• HOWEVER, they are not good at extrapolation!
• And also very prone to overfitting!!
Decision Tree Overfitting & Pruning

lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997
Extrapolation

No? Yes?
Num >= 2
Number of Cost
Candy Bars
(Num)
1 $1 No? Yes?
$1 Num >= 3

2 $2

3 $3 $2 $3
Decision Tree Operator
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Neural Networks
Neural Networks

A Neural Net is a structural supervised learning algorithm for


Classification and Regression
Selected Highlights From Neural Net History

1940’s 1950’s 1980’s


McCulloch-Pitts Rosenblatt Hinton-Rumelhart-Williams

Artificial Neuron based on a Learning rule to train Major successes built on


simple sum of binary inputs perceptron with different multi-layer networks with
weights and thresholds backpropagation on sigmoid
function
Perceptron
f(example) = f(α*Age, β*Avg Transaction, . . . )

Age • All of the inputs are


combined in a linear fashion
and processed by some
Avg Transaction activation or link function, f
f(example • It is not too unlike Logistic
) Regression or GLM
Gender
• Instead of predicting the
label, it may only be
predicting a predictor!
Input layer
A Neural Net
Neuron

Age

Avg Transaction Churn

Values between 0 and 1


Gender Loyal

Last Transaction
The one with the
highest value is chosen
as predicted class
Input layer Hidden layer(s) Output layer
How does it work?
Weight positive or negative
Age α1
α2

Avg Transaction α3 Churn

Gender Loyal

Last Transaction
How does it work?

Age

β1

Avg Transaction β2 Churn

β3

Gender Loyal

Last Transaction
How does it work?
f1 = f(α1 . Age, β1 . Avg Transaction, . . . )
Age

Avg Transaction Churn

Gender Loyal

Last Transaction
How does it work?
f2 = f(α2 . Age, β2 . Avg Transaction, . . . )
Age
f1

Avg Transaction Churn


f2

Gender Loyal
f3

Last Transaction
How does it work?

Age
f1

f4
Avg Transaction Churn
f2

f5
Gender Loyal
f3

Last Transaction
How does it work?

Age
25 f1

f4
Avg Transaction Churn
12.5 f2
Loyal
f5
Gender Loyal
1 f3

Last Transaction
10-08-2015
Where do Weights Come From?

Similar to the following:


• Create structure and Initialize
weights with random values
• Loop (training cycles)
- Forward propagation over layers
- Compute loss
- Backward propagation over layers
- Update weights
• Save trained neural net
How Does Optimization Work?

• Performance is measured by
some loss or cost function that
tells how bad our perceptron is
doing – we want to minimize
loss
• Loss and Activation functions
are chosen to make calculating
the slope trivial
• The slope indicates direction of
reduced loss
How Does Optimization Work?

• Gradient descent
• We can’t analytically solve the
Loss, cost, or

1
3 for a minimum cost, but for a
given set of weights, we can
error

measure the cost and the


direction of lower cost

Weight value
Learning Rate

• Learning Rate is a parameter


that controls the size of the
Loss, cost, or

steps
1
• A high Learning Rate may skip
error

3 right over an optimal value and


never find it
• A low Learning Rate may take
too many iterations to get close
Parameter value or weight
Momentum

• Momentum will continue


additional tests in the same
Loss, cost, or

1 direction past the best known


value
?!
error

• A low momentum can increase


3 4 the chance of getting stuck in a
local optimum
• A high momentum can result in
needless computation

Parameter value or weight


Neural Net Operator
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Review of Algorithms
http://mod.rapidminer.com

• K-NN • Naïve Bayes


- Lazy Structure of distances - Interpretable Statistical Model
- Classification or Regression - Classification
• Linear Regression - Fast, but limited
- Interpretable Statistical Model • Decision Tree
- Regression - Interpretable Structure of Rules
• Logistic Regression - Classification or Regression
- Interpretable Statistical Model - Prone to overfit
- Classification • Neural Net
• GLM - Uninterpretable Structure
- Interpretable Statistical Model - Classification or Regression
- Classification or Regression - Extremely powerful
Deployment & Scoring
Deployment & Scoring

1. Introduction 1. Deployment
2. Introduction to Machine 2. Scoring
Learning
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Deployment
Deployment Overview

Terminology
• The model is trained,
validated, and ready for
production
• A deployment is a place for
models of one purpose
• A deployment location is a
place for deployments of
similar access methods
• Deploy is the action of putting
a model in its place
Deployment in RapidMiner

Manually Deploy… OR: Deploy with built-in tools!

Build your own custom processes


to manage the deployment
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Scoring
Scoring

Take New Apply the Get a


data… model... prediction!

Transaction Score
Scoring

Score in a local process… Or use the tools!


RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Unsupervised Learning
Unsupervised Learning

1. Introduction 1. Attribute Correlations


2. Introduction to Machine 2. Clustering
Learning 3. Association Mining
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Attribute Correlations
Correlations

• Correlations between regular numeric attributes can show us if we


have redundant features
• This information is helpful for building understanding and improving
the model
Correlations
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Clustering
Clustering

• One of the most common types of clustering is k-Means Clustering


• The k-Means algorithm uses a set number of clusters, and iteratively
recalculates the center of each cluster
k-Means Clustering

Initialization
? ?? • Randomly pick k new points
? ? ?
? ? C
and assign them to k unique
clusters
?
• Each of these k points become
A ? ? the centroid of their own
? ?
? ? ? cluster
? ?? ? ??
? B
?
k-Means Clustering

Segmentation
? ?? • For each observed point, find
? ? ?
? ? C
which centroid it is closest to,
and assign it to that cluster
?
A ? ?
? ?
? ? ?
? ?? ? ??
? B
?
k-Means Clustering

Segmentation
? ?? • For each observed point, find
? ? ?
? ? C
which centroid it is closest to,
and assign it to that cluster
?
A ? ?
? ?
? ? ?
? ?? ? ??
? B
?
k-Means Clustering

Update centroids (means)


? ?? • For each cluster, calculate an
? ? ?C updated centroid
? ?
?
A? ? ? ?
? ? B ?
? ?? ? ??
?
?
k-Means Clustering

Segmentation
? ?? • For each observed point, find
? ? ?C which centroid it is closest to,
? ? and assign it to that cluster
?
A? ? ? ?
? ? B ?
? ?? ? ??
?
?
k-Means Clustering

Update centroids (means)


? ?? • For each cluster, calculate an
? ? C
? updated centroid
? ?
?
?? ? ??
A ? B ?
? ?? ? ??
?
?
k-Means Clustering

Segmentation
? ?? • For each observed point, find
? ? C
? which centroid it is closest to,
? ? and assign it to that cluster
?
?? ? ??
A ? B ?
? ?? ? ??
?
?
k-Means Clustering

k sets number of
clusters

Measure types sets


how distances are
measured
X-Means Clustering

• One drawback of k-Means is X-Means


that the number of clusters • Loop over range of values for k
must be defined before
• For each k run k-Means
clustering
- Each k-Means runs as normal
• There is no universal method
for picking the best number of
clusters
• X-Means can often provide a
useful option
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Association Mining
Basket Analysis

• Suppose there are 5,000


different items.
• Customers lump items in a
shopping cart or basket, and
purchase several items at a
time.
• A selected combination of
items, or association, can be
analyzed for relevance.
• Many possible combinations:
>20 billion combinations of 3 items
>26 trillion combinations of 4 items
Association Mining

Association Analysis can be thought of as a three-step process:


1. Data Prep varies widely; different item set methods can take a variety of data
structures.
2. Identify frequent item-sets is commonly done with the FP-Growth
operator.
3. The Create Association Rules operator is used for rule generation. It’s
important
to understand each of the different criteria including support, confidence, and
lift.
FP-Growth

Generating the FP-Tree


• Get Frequency of each term
• Sort items within each basket by overall item frequency
• Add each basket to the FP-Tree
- Start with the first item, is it already in the tree or not
 Yes? Follow the existing branch
 No? Split off and create a new branch
Root
• Example FP-Tree after adding a basket:
Item 1
{Item 1, Item 2}

Item 2
FP-Growth
Root

Now add baskets:


Item 1 Item 3
• {Item 1, Item 3}
• {Item, 3, Item 4} Item 2 Item 3
Item 4
• {Item 1, Item 2, Item 4}
Item 4
• {Item 1, Item 2} …
Create Association Rules

Inspect associations with simple statistics:


• Frequency(X) = Number of times X appears
• Support(X) = Frequency(X) / (Number of baskets)
• Confidence(X -> Y) = Support(X and Y) / Support(X)
• Lift(X -> Y) = Support(X and Y) / ( Support (Y) * Support(X) )
• Conviction(X -> Y) = ( 1 – Support(Y) ) / ( 1 – Confidence(X -> Y) )

These statistics can be used to create a threshold and provide more


insight into the strength of the rule.
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Feature Engineering
Feature Engineering

1. Introduction 1. Introduction to Feature


2. Introduction to Machine Engineering
Learning 2. Feature Weighting
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Introduction to Feature Engineering
Feature Engineering

Feature Engineering

• Feature Generation – create new Feature


Generation
attributes Create useful attributes

• Feature Selection – reduce the number of Feature Selection


Weight & Select Attributes

attributes
- Correlation
- Information Gain
- Relief
Feature Generation
Feature Generation or Feature Engineering is the process of transforming raw
data in order to make it more useful or more stable for predictive modeling
purposes
• Select approaches to feature engineering:
- Functional transformations
- Counts, sum, average, min/max/range,
ratios
- Interaction effect variables
- Binning continuous variables
- Combining high cardinality nominal
variables
- Date/time calculations

Automatic Feature Engineering tools can create many new features using these
techniques
Feature Selection

Whenever you have a large number of highly correlated possible input


variables, it can cause problems for various modeling or machine learning
algorithms including high memory usage and matrix sparsity

Feature Selection is the general


process of reducing the number of
attributes
• Manual selection
• Attribute correlation
• Model Performance
• Data-driven variance methods
(e.g. PCA)
Feature Weighting
Feature Weighting
Feature Weighting is the general process of scoring attributes according
to some measure of their importance for modeling

• Common weighting methods include:


- Weight by Correlation or squared correlation
based on correlation of each attribute with the label
- Weight by Information Gain
like Information Gain in a Decision Tree
- Weight by Relief
based on predictive power between close examples with different labels
• Feature Weights are often used for Feature Selection
• Feature Weights can also be used in other ways like scaling
data
• Understand, be able to use, and interpret the results of Feature
Importance or Weighting. It can be used by some, but not all model
types. There are different ways to weight the attributes and care is
needed to select a good technique for a given problem. Some of the
most common choices are the Weight by Relief, Weight by
Information Gain, and Weight by Correlation operators.

• http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Auto Model
Auto Model

1. Introduction 1. Clustering
2. Introduction to Machine 2. Supervised Learning
Learning 3. Deployment
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Auto Model
Auto Model
Auto Model for Clustering

Auto Model can be used for: • Outlier Detection


• Supervised Learning • Feature Engineering
• Clustering • Help with correct validation
• Correlations • Preparation for Deployment

Feature Engineering
Supervised Learning Classification Regression
Is this A or B? How much or how many?
(aka Predictive Analytics) Will this be A or B? How many will happen? Feature
Generation
Create useful attributes

Clustering Outlier Associations &


Unsupervised How is this organized? Correlations Feature Selection
Learning What belongs Detection What happens together? Weight & Select Attributes
together? Is this weird?
What belongs together?
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible


Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
Course Objectives

You should be able to understand, and are able to use, the following
machine learning tools in RapidMiner Studio’s Process Designer and
Auto Model:
• Classification and Regression
• Split Validation
• Scoring
• Correlations
• Feature Importance
• Clustering and Association Analysis
Next Steps

Get Certified!
https://academy.rapidminer.com/pages/certification
Area Next Course
Data Understanding and Data Preparation Data Engineering Master
Model Selection, Evaluation, and Validation Machine Learning Master

RapidMiner Academy: https://academy.rapidminer.com


Online Documentation: https://docs.rapidminer.com/studio
Online Community: https://community.rapidminer.com/

You might also like