0% found this document useful (0 votes)

8 views

1 Tailieuthamkhao MachineLearning

Uploaded by

minhtr160616

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

1 Tailieuthamkhao MachineLearning

Uploaded by

minhtr160616

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 151

Machine Learning Professional

RapidMiner Education
Introduction
Course Introduction

Introduction to the course

1. Audience
2. Objectives
3. Course Outline

Introduction to RapidMiner
1. Purpose
2. Platform Overview
3. Basics of using RapidMiner Studio
4. Continued learning
Introduction
Target Audience

Business Domain Data

Analyst Expert Scientist
Course Objectives

At the end of this course, you should be able to understand, and are
able to use, the following machine learning tools in RapidMiner Studio’s
Process Designer and Auto Model:
• Classification and Regression
• Split Validation
• Scoring
• Correlations
• Feature Importance
• Clustering and Association Analysis
Course Outline

1. Introduction
2. Introduction to Machine Learning
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Getting Started with RapidMiner
RapidMiner Platform
RapidMiner Market Place
Industry, Application & ML Extensions

RapidMiner Web Applications

RapidMiner Studio RapidMiner Server
Visual Workflow Designer Collaborate + Compute + Deploy + Maintain

Workflow Builder Data and Process

Web App Portal
Repository
Process Execution

Web Services
Engine User/Group Access
Rights management
Process Scheduler
RapidMiner Radoop
Compile + Execute in Hadoop
Process Execution
Integrate using Web Service and SQL operators
Engine
Server Application Java SE/EE Application
RapidMiner AI Cloud
RapidMiner Radoop
Compile + Execute in Hadoop Databases / DWHs Application (BI, ERP,
Managed Services CRM…) / Portal

AWS
Use any data

Azure
R / Python / SQL Scripting
Run in multiple In-Memory/H2O/Weka
Compute Engines In-Hadoop & Spark
Creating Repositories & Folders
• Repositories – Collection of Projects
- Local Repository
- Server Repository

• Folders – Collection of Project Components

- Processes - Weights
- Example Set (Data) - File Objects
- (Preprocessing) Models - Documents
- Performances - Collections of all of these

• Best practice: Create a folder for each Project under the Repository
and within this folder, a sub-folder for – Data, Processes & Results (and
more)
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

Speed & optimize data exploration, blending, Powerful machine learning, text analytics, Integrate all of your existing applications, data,
and cleansing tasks – reduce the time spent predictive modeling algorithms, automation, and and programming languages like R & Python.
importing and wrangling your data. process control features help you build better
models faster.
The RapidMiner “GoTo” Places

• RapidMiner Academy:
https://academy.rapidminer.com

• Online Documentation:
https://docs.rapidminer.com/studio

• Online Community:
https://community.rapidminer.com/
Introduction to Machine
Learning
Introduction to Machine Learning

1. Introduction 1. Introduction
2. Introduction to Machine 2. k-NN
Learning 3. Model Validation
3. Supervised Learning 4. Normalize & Group Models
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Introduction to Machine Learning
Unsupervised vs Supervised Learning
Can we add structure? Is the ? an or an ?

? ?? ?
? ? ?
? ?
?
?? ? ??
? ?
? ?? ? ??
?
?
Machine Learning

Feature Engineering
Supervised Learning
Classification Regression
(aka Predictive Is this A or B? How much or how many?
Will this be A or B? How many will happen?
Analytics) Feature
Generation
Create useful attributes

Feature Selection
Outlier Associations & Weight & Select Attributes
Unsupervised Clustering
How is this organized? Detection Correlations
Learning What belongs together? Is this weird?
What happens together?
What belongs together?
Underfit and Overfit

Simple Complex

High Bias High Variance

Introduction to k-NN
k-NN (Nearest Neighbor) Algorithm

K-NN is a simple distance based supervised learning algorithm for

Classification or Regression
k-NN (Nearest Neighbor) Algorithm

P2
P1

P3
Distance Types & Measures

• Numerical
• Nominal
• Mixed
• Bregman Divergence
Splitting Data for Training & Testing

70%

30%
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

Purpose The model is trained

on the training data, so
of course it fits – That
Research
Experiment doesn’t need to be
tested.
Hypothesis
Hypothesis A real experiment is
Testing Data

Experiment whether or not it has

predictive value.
Conclusion
Analysis

Conclusion
Splitting Data for Training & Testing

70%

30%
Training vs Testing Error

New Data
• Performance can only be
measured by testing
predictions with new data
Error Rate

Good
• When validating a model,
ignore performance on Training
Data
Training • Don’t over-optimize. Everything
Data you optimize needs to be
Underfit Overfit validated!
Model Complexity
Training vs Testing Error

• Performance can only be

measured by testing
predictions with new data
• When validating a model,
ignore performance on
Training Data
• Don’t over-optimize. Everything
you optimize needs to be
validated!

http://gerardnico.com/wiki/data_mining/overfitting
Anatomy of Machine Learning
Modeling

Evaluation

Data Preparation

Deployment
Performance Measurement - Accuracy

Confusion Matrix Table View:

TP FP
FN T
N

Accuracy = (TP+TN)/(TP+FP+TN+FN)
Class Precision = TP/(TP+FP) or TN/(TN+FN)
Class Recall = TP/(TP+FN) or TN/(TN+FP)
Performance Measurement - Costs

• Try to estimate value of each

possibility in the confusion
matrix
• Calculate the total value of the
model across many predictions
• Compare the value to a
baseline to estimate gain
• AutoModel produces the
results:
- Profits from Model: 1,000
- Profits for Best Option (Loyal):
-1,600
- Gain: 2,600
Split Validation vs. Cross Validation

Split Validation: Performs a simple validation by randomly splitting

data into two data sets – Training & Testing.

Cross Validation: It is an iterative validation process where the data is

split into many training & validation subsets. Each iteration validates
(tests) one subset of data with using the remaining subsets as training
data.

Note: In Cross Validation, the # of subsets of data = the # of iterations

Cross Validation
Training Subset
Validation Subset
Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5
Subset 1
Subset 2
Subset 3
Subset 4
Subset 5
Validation
80% 88% 85% 78% 84%
Accuracy
Final Accuracy = Average(Iteration 1, Iteration 2, …)
Proper Validation

Use these things: Check for these things:

• Proper validation schemes like • Data transformations or
split-validation or cross- feature engineering that are
validation dependent on observed data
• Modular validation schemes examples
• Multi-level nesting • Low-level model parameter
tuning, or model creation
• Pre-processing models
• Hyperparameter or high-level
model optimization
• Model selection
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

Z-transformation Proportion (Sum) Interquartile Range Range

Normalization Methods

• Range Transformation
• Proportional (Sum)
• Interquartile Range
• Z-Transformation

http://www.statistics4u.info/fundstat_eng/ee_ztransform.html
Group Models

This operator groups the given models into a

single combined model. When this combined
model is applied, it is equivalent to applying the
original models in their respective order.
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

1. Introduction • Regression: Linear, Logistic,

2. Introduction to Machine GLM
Learning • Naïve Bayes
3. Supervised Learning • Decision Tree
4. Deployment & Scoring • Neural Networks
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Linear Regression
Linear Regression

Linear Regression is a statistical model based supervised learning

algorithm for Regression
Understanding Linear Regression

Goal→ Minimize error

α = intercept
β = slope
yi =α + β xi + εi
Analytical solution:
′
Linear Regression Analytical solution:
′

dominates RSS

Ridge Regression
& '(( ) *+ ',(
Helps with multicollinearity
Converting Nominal to Numerical Data
Converting Nominal to Numerical Data

Dummy Coding

Effect Coding
Converting Nominal to Numerical Data

Integer Coding

Avoid if the data is not

fundamentally
numeric…
And still avoid if it is
fundamentally numeric!
Performance (Regression)
Performance Metrics for Regression
Popular performance metrics
for linear regression include:
• Root Mean Squared Error R2 = 0.97
(RMSE)
• Absolute Error
• Relative Error
• R2 (Squared Correlation)
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

Logistic Regression is a statistical model based supervised learning

algorithm for Classification
Logistic Regression Transformation

• Logit: ln[p/(1-p)] = a + BX
- ./01
• Logistic: p =
2- ./01
• where:
- ln is the natural logarithm, loge
- p is the probability that the event Y
occurs, p(Y=1)
- p/(1-p) = "odds ratio"
- ln[p/(1-p)] = log odds ratio, or "logit"
• Otherwise like a linear model
• Resulting B coefficient is the effect
on the “odds ratio”
Converting Nominal to Binominal Data
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

GLM is a statistical model based supervised learning algorithm for

Classification and Regression
GLM

• GLM is a generalization of both • GLM brings together many of

linear regression and logistic the advantages of regression in
regression
a flexible framework
• The model can handle a label that
is: • Expect to be able to interpret
- Numerical with a Gaussian family the model and results like you
- Binominal with a Binomial family would with Linear Regression
- Polynominal with a Multinomial or Logistic Regression
family
• Don’t expect the results to be a
• RapidMiner can assign an
appropriate family and solver perfect match with Linear
• It is robust because it has Regression or Logistic
regularization and can handle Regression.
missing values
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

Naïve Bayes is a statistical model based supervised learning algorithm

for Classification
Bayes’ Rule
Constant
Prior probability likelihood Evidence across
Posterior probability classes!
3 45 3 7845
3 45 |7
3 7 likelihood
3 45 3 7845
class instance
Prior probability
evidence compare

prediction
Golf data set – When to play golf?

Outlook Temperature Humidity Wind Play

sunny High High false no
sunny Mild High true no
overcast High High false yes
rain Cool High false yes
rain Cool High false yes
rain Cool Normal true no
overcast Cool Normal true yes
sunny Mild High false no
sunny Cool Normal false yes
rain Mild High false yes
sunny Mild Normal true yes
overcast Mild High true yes
overcast High Normal false yes
rain Mild High true no
Number of Cases and Probabilities
Outlook Yes No Temperature Yes No Humidity Yes No Wind Yes No Yes No
sunny 2 3 Cool 4 1 Normal 4 1 false 6 2 9 5
overcast 4 0 Mild 3 3 High 5 4 true 3 3
rain 3 2 High 2 1
Probabilities
sunny 0.2 0.6 Cool 0.4 0.2 Normal 0.4 0.2 false 0.7 0.4 0.6 0.4
overcast 0.4 0.0 Mild 0.3 0.6 High 0.6 0.8 true 0.3 0.6
rain 0.3 0.4 High 0.2 0.2

?
Likelihood Calculation
Outlook Yes No Temperature Yes No Humidity Yes No Wind Yes No Yes No
sunny 2 3 Cool 4 1 Normal 4 1 false 6 2 9 5
overcast 4 0 Mild 3 3 High 5 4 true 3 3
rain 3 2 High 2 1
Probabilities
sunny 0.2 0.6 Cool 0.4 0.2 Normal 0.4 0.2 false 0.7 0.4 0.6 0.4
overcast 0.4 0.0 Mild 0.3 0.6 High 0.6 0.8 true 0.3 0.6
rain 0.3 0.4 High 0.2 0.2

YES 0.2 x 0.4 x 0.6 x 0.3 x 0.6 = 0.011758

36.4%

NO 0.6 x 0.2 x 0.8 x 0.6 x 0.4 = 0.020571

63.6%
Bayes’ Rule
Constant
Prior probability likelihood Evidence across
Posterior probability classes!
3 45 3 7845
3 45 |7
3 7 likelihood
3 45 3 7845
class instance
Prior probability
evidence compare

prediction
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

Decision Tree is a statistical model based supervised learning algorithm

for Classification and Regression
Understand the Decision Tree
Popularity of Decision Trees

Decision Trees are very popular models for several reasons:

• Very simple to understand
• Can deal easily with interactions and non-linear effects
• Aren’t bothered by non-normalized data, missing values,
untransformed dates, multi-collinearity, and “messy” data
• Easily implemented in other platforms (try the “Tree to Rules”
operator)
• HOWEVER, they are not good at extrapolation!
• And also very prone to overfitting!!
Decision Tree Overfitting & Pruning

lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997
Extrapolation

No? Yes?
Num >= 2
Number of Cost
Candy Bars
(Num)
1 $1 No? Yes?
$1 Num >= 3

2 $2

3 $3 $2 $3
Decision Tree Operator
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

A Neural Net is a structural supervised learning algorithm for

Classification and Regression
Selected Highlights From Neural Net History

1940’s 1950’s 1980’s

McCulloch-Pitts Rosenblatt Hinton-Rumelhart-Williams

Artificial Neuron based on a Learning rule to train Major successes built on

simple sum of binary inputs perceptron with different multi-layer networks with
weights and thresholds backpropagation on sigmoid
function
Perceptron
f(example) = f(α*Age, β*Avg Transaction, . . . )

Age • All of the inputs are

combined in a linear fashion
and processed by some
Avg Transaction activation or link function, f
f(example • It is not too unlike Logistic
) Regression or GLM
Gender
• Instead of predicting the
label, it may only be
predicting a predictor!
Input layer
A Neural Net
Neuron

Age

Avg Transaction Churn

Values between 0 and 1

Gender Loyal

Last Transaction
The one with the
highest value is chosen
as predicted class
Input layer Hidden layer(s) Output layer
How does it work?
Weight positive or negative
Age α1
α2

Avg Transaction α3 Churn

Gender Loyal

Last Transaction
How does it work?

Age

β1

Avg Transaction β2 Churn

β3

Gender Loyal

Last Transaction
How does it work?
f1 = f(α1 . Age, β1 . Avg Transaction, . . . )
Age

Avg Transaction Churn

Gender Loyal

Last Transaction
How does it work?
f2 = f(α2 . Age, β2 . Avg Transaction, . . . )
Age
f1

Avg Transaction Churn

Gender Loyal
f3

Last Transaction
How does it work?

Age
f1

f4
Avg Transaction Churn
f2

f5
Gender Loyal
f3

Last Transaction
How does it work?

Age
25 f1

f4
Avg Transaction Churn
12.5 f2
Loyal
f5
Gender Loyal
1 f3

Last Transaction
10-08-2015
Where do Weights Come From?

Similar to the following:

• Create structure and Initialize
weights with random values
• Loop (training cycles)
- Forward propagation over layers
- Compute loss
- Backward propagation over layers
- Update weights
• Save trained neural net
How Does Optimization Work?

• Performance is measured by
some loss or cost function that
tells how bad our perceptron is
doing – we want to minimize
loss
• Loss and Activation functions
are chosen to make calculating
the slope trivial
• The slope indicates direction of
reduced loss
How Does Optimization Work?

• Gradient descent
• We can’t analytically solve the
Loss, cost, or

1
3 for a minimum cost, but for a
given set of weights, we can
error

measure the cost and the

direction of lower cost

Weight value
Learning Rate

• Learning Rate is a parameter

that controls the size of the
Loss, cost, or

steps
1
• A high Learning Rate may skip
error

3 right over an optimal value and

never find it
• A low Learning Rate may take
too many iterations to get close
Parameter value or weight
Momentum

• Momentum will continue

additional tests in the same
Loss, cost, or

1 direction past the best known

value
?!
error

• A low momentum can increase

3 4 the chance of getting stuck in a
local optimum
• A high momentum can result in
needless computation

Parameter value or weight

Neural Net Operator
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

• K-NN • Naïve Bayes

- Lazy Structure of distances - Interpretable Statistical Model
- Classification or Regression - Classification
• Linear Regression - Fast, but limited
- Interpretable Statistical Model • Decision Tree
- Regression - Interpretable Structure of Rules
• Logistic Regression - Classification or Regression
- Interpretable Statistical Model - Prone to overfit
- Classification • Neural Net
• GLM - Uninterpretable Structure
- Interpretable Statistical Model - Classification or Regression
- Classification or Regression - Extremely powerful
Deployment & Scoring
Deployment & Scoring

1. Introduction 1. Deployment
2. Introduction to Machine 2. Scoring
Learning
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Deployment
Deployment Overview

Terminology
• The model is trained,
validated, and ready for
production
• A deployment is a place for
models of one purpose
• A deployment location is a
place for deployments of
similar access methods
• Deploy is the action of putting
a model in its place
Deployment in RapidMiner

Manually Deploy… OR: Deploy with built-in tools!

Build your own custom processes

to manage the deployment
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

Take New Apply the Get a

data… model... prediction!

Transaction Score
Scoring

Score in a local process… Or use the tools!

RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

1. Introduction 1. Attribute Correlations

2. Introduction to Machine 2. Clustering
Learning 3. Association Mining
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Attribute Correlations
Correlations

• Correlations between regular numeric attributes can show us if we

have redundant features
• This information is helpful for building understanding and improving
the model
Correlations
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

• One of the most common types of clustering is k-Means Clustering

• The k-Means algorithm uses a set number of clusters, and iteratively
recalculates the center of each cluster
k-Means Clustering

Initialization
? ?? • Randomly pick k new points
? ? ?
? ? C
and assign them to k unique
clusters
?
• Each of these k points become
A ? ? the centroid of their own
? ?
? ? ? cluster
? ?? ? ??
? B
?
k-Means Clustering

Segmentation
? ?? • For each observed point, find
? ? ?
? ? C
which centroid it is closest to,
and assign it to that cluster
?
A ? ?
? ?
? ? ?
? ?? ? ??
? B
?
k-Means Clustering

Update centroids (means)

? ?? • For each cluster, calculate an
? ? ?C updated centroid
? ?
?
A? ? ? ?
? ? B ?
? ?? ? ??
?
?
k-Means Clustering

Segmentation
? ?? • For each observed point, find
? ? ?C which centroid it is closest to,
? ? and assign it to that cluster
?
A? ? ? ?
? ? B ?
? ?? ? ??
?
?
k-Means Clustering

Update centroids (means)

? ?? • For each cluster, calculate an
? ? C
? updated centroid
? ?
?
?? ? ??
A ? B ?
? ?? ? ??
?
?
k-Means Clustering

Segmentation
? ?? • For each observed point, find
? ? C
? which centroid it is closest to,
? ? and assign it to that cluster
?
?? ? ??
A ? B ?
? ?? ? ??
?
?
k-Means Clustering

k sets number of
clusters

Measure types sets

how distances are
measured
X-Means Clustering

• One drawback of k-Means is X-Means

that the number of clusters • Loop over range of values for k
must be defined before
• For each k run k-Means
clustering
- Each k-Means runs as normal
• There is no universal method
for picking the best number of
clusters
• X-Means can often provide a
useful option
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

• Suppose there are 5,000

different items.
• Customers lump items in a
shopping cart or basket, and
purchase several items at a
time.
• A selected combination of
items, or association, can be
analyzed for relevance.
• Many possible combinations:
>20 billion combinations of 3 items
>26 trillion combinations of 4 items
Association Mining

Association Analysis can be thought of as a three-step process:

1. Data Prep varies widely; different item set methods can take a variety of data
structures.
2. Identify frequent item-sets is commonly done with the FP-Growth
operator.
3. The Create Association Rules operator is used for rule generation. It’s
important
to understand each of the different criteria including support, confidence, and
lift.
FP-Growth

Generating the FP-Tree

• Get Frequency of each term
• Sort items within each basket by overall item frequency
• Add each basket to the FP-Tree
- Start with the first item, is it already in the tree or not
 Yes? Follow the existing branch
 No? Split off and create a new branch
Root
• Example FP-Tree after adding a basket:
Item 1
{Item 1, Item 2}

Item 2
FP-Growth
Root

Now add baskets:

Item 1 Item 3
• {Item 1, Item 3}
• {Item, 3, Item 4} Item 2 Item 3
Item 4
• {Item 1, Item 2, Item 4}
Item 4
• {Item 1, Item 2} …
Create Association Rules

Inspect associations with simple statistics:

• Frequency(X) = Number of times X appears
• Support(X) = Frequency(X) / (Number of baskets)
• Confidence(X -> Y) = Support(X and Y) / Support(X)
• Lift(X -> Y) = Support(X and Y) / ( Support (Y) * Support(X) )
• Conviction(X -> Y) = ( 1 – Support(Y) ) / ( 1 – Confidence(X -> Y) )

These statistics can be used to create a threshold and provide more

insight into the strength of the rule.
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

1. Introduction 1. Introduction to Feature

2. Introduction to Machine Engineering
Learning 2. Feature Weighting
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Feature Engineering
7. Auto Model
Introduction to Feature Engineering
Feature Engineering

Feature Engineering

• Feature Generation – create new Feature

Generation
attributes Create useful attributes

• Feature Selection – reduce the number of Feature Selection

Weight & Select Attributes

attributes
- Correlation
- Information Gain
- Relief
Feature Generation
Feature Generation or Feature Engineering is the process of transforming raw
data in order to make it more useful or more stable for predictive modeling
purposes
• Select approaches to feature engineering:
- Functional transformations
- Counts, sum, average, min/max/range,
ratios
- Interaction effect variables
- Binning continuous variables
- Combining high cardinality nominal
variables
- Date/time calculations

Automatic Feature Engineering tools can create many new features using these
techniques
Feature Selection

Whenever you have a large number of highly correlated possible input

variables, it can cause problems for various modeling or machine learning
algorithms including high memory usage and matrix sparsity

Feature Selection is the general

process of reducing the number of
attributes
• Manual selection
• Attribute correlation
• Model Performance
• Data-driven variance methods
(e.g. PCA)
Feature Weighting
Feature Weighting
Feature Weighting is the general process of scoring attributes according
to some measure of their importance for modeling

• Common weighting methods include:

- Weight by Correlation or squared correlation
based on correlation of each attribute with the label
- Weight by Information Gain
like Information Gain in a Decision Tree
- Weight by Relief
based on predictive power between close examples with different labels
• Feature Weights are often used for Feature Selection
• Feature Weights can also be used in other ways like scaling
data
• Understand, be able to use, and interpret the results of Feature
Importance or Weighting. It can be used by some, but not all model
types. There are different ways to weight the attributes and care is
needed to select a good technique for a given problem. Some of the
most common choices are the Weight by Relief, Weight by
Information Gain, and Weight by Correlation operators.

• http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

1. Introduction 1. Clustering
2. Introduction to Machine 2. Supervised Learning
Learning 3. Deployment
3. Supervised Learning
4. Deployment & Scoring
5. Unsupervised Learning
6. Auto Model
Auto Model
Auto Model for Clustering

Auto Model can be used for: • Outlier Detection

• Supervised Learning • Feature Engineering
• Clustering • Help with correct validation
• Correlations • Preparation for Deployment

Feature Engineering
Supervised Learning Classification Regression
Is this A or B? How much or how many?
(aka Predictive Analytics) Will this be A or B? How many will happen? Feature
Generation
Create useful attributes

Clustering Outlier Associations &

Unsupervised How is this organized? Correlations Feature Selection
Learning What belongs Detection What happens together? Weight & Select Attributes
together? Is this weird?
What belongs together?
RapidMiner Studio
Visual Workflow Designer for Data Scientists

Accelerate Data Prep Develop Models Quickly Open & Extensible

You should be able to understand, and are able to use, the following
machine learning tools in RapidMiner Studio’s Process Designer and
Auto Model:
• Classification and Regression
• Split Validation
• Scoring
• Correlations
• Feature Importance
• Clustering and Association Analysis
Next Steps

Get Certified!
https://academy.rapidminer.com/pages/certification
Area Next Course
Data Understanding and Data Preparation Data Engineering Master
Model Selection, Evaluation, and Validation Machine Learning Master

RapidMiner Academy: https://academy.rapidminer.com

Online Documentation: https://docs.rapidminer.com/studio
Online Community: https://community.rapidminer.com/

Cheatsheets - Java Concurrency
100% (2)
Cheatsheets - Java Concurrency
18 pages
Yahweh
No ratings yet
Yahweh
15 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia
No ratings yet
Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia
103 pages
Intro ML 1 Day
No ratings yet
Intro ML 1 Day
43 pages
peterl/teaching/DM: E C I I
No ratings yet
peterl/teaching/DM: E C I I
8 pages
peterl/teaching/DM: E C I I
No ratings yet
peterl/teaching/DM: E C I I
8 pages
1635838720082
No ratings yet
1635838720082
35 pages
FAM_QUESTION_BANK_CT[1]
No ratings yet
FAM_QUESTION_BANK_CT[1]
14 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Rapid Miner Tutorial
100% (1)
Rapid Miner Tutorial
15 pages
Unit4_PPT (2)
No ratings yet
Unit4_PPT (2)
126 pages
2 - Basics of Machine Learning
No ratings yet
2 - Basics of Machine Learning
10 pages
Machine Learning
No ratings yet
Machine Learning
24 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
13 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Machine Learning
No ratings yet
Machine Learning
1 page
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Machine Learning
No ratings yet
Machine Learning
10 pages
Unit 3
No ratings yet
Unit 3
55 pages
ML-1-PPT-UNIT-1
No ratings yet
ML-1-PPT-UNIT-1
93 pages
correct-validation-wp-final-v
No ratings yet
correct-validation-wp-final-v
26 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
Lab Manual
No ratings yet
Lab Manual
46 pages
unit 1 ml pdf
No ratings yet
unit 1 ml pdf
19 pages
RAPIDMINER
No ratings yet
RAPIDMINER
23 pages
Accelerate Your Workflow With Data Analytics
0% (1)
Accelerate Your Workflow With Data Analytics
49 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Machine Learning in New
No ratings yet
Machine Learning in New
13 pages
Machine Learning Section2 Ebook
No ratings yet
Machine Learning Section2 Ebook
16 pages
AIYA SESSION 4
No ratings yet
AIYA SESSION 4
42 pages
Machine Learning Algorithms 1728923216
No ratings yet
Machine Learning Algorithms 1728923216
12 pages
Data Science: Fundamentals
No ratings yet
Data Science: Fundamentals
50 pages
Module_-1
No ratings yet
Module_-1
9 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
1. Machine Learning - Introduction
No ratings yet
1. Machine Learning - Introduction
73 pages
AML Slides Indexed 2in1 - Converted
No ratings yet
AML Slides Indexed 2in1 - Converted
33 pages
LECTURE-2
No ratings yet
LECTURE-2
36 pages
1. Machine Learning - Introduction
No ratings yet
1. Machine Learning - Introduction
138 pages
Introductiontomachinelearning 230723174746 1a0e5edc
No ratings yet
Introductiontomachinelearning 230723174746 1a0e5edc
27 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Module 2
No ratings yet
Module 2
73 pages
TIS - Intro To Machine Learning
No ratings yet
TIS - Intro To Machine Learning
18 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
31 pages
FDP Day1
No ratings yet
FDP Day1
35 pages
Case Study - Churn Mdel Prediction
No ratings yet
Case Study - Churn Mdel Prediction
77 pages
Machine Learning
No ratings yet
Machine Learning
42 pages
ML-chap-2
No ratings yet
ML-chap-2
60 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
04 Machine Learning Overview
No ratings yet
04 Machine Learning Overview
109 pages
04 Machine Learning Overview
No ratings yet
04 Machine Learning Overview
109 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
Chapter - 1 PPT
No ratings yet
Chapter - 1 PPT
56 pages
ML Unit-1
No ratings yet
ML Unit-1
39 pages
machine learning
No ratings yet
machine learning
37 pages
Mastering Flask Web and API Development: Build and deploy production-ready Flask apps seamlessly across web, APIs, and mobile platforms
From Everand
Mastering Flask Web and API Development: Build and deploy production-ready Flask apps seamlessly across web, APIs, and mobile platforms
Sherwin John C. Tragura
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Oracle Modernization Solutions
From Everand
Oracle Modernization Solutions
Tom Laszewski
No ratings yet
Rails 4 For Startups Using Mobile And Single Page Applications
From Everand
Rails 4 For Startups Using Mobile And Single Page Applications
Anthony O'Leary
No ratings yet
The-epic-of-gilgamesh
No ratings yet
The-epic-of-gilgamesh
18 pages
ELT 1 Module 1
No ratings yet
ELT 1 Module 1
4 pages
Basic Knowledges
No ratings yet
Basic Knowledges
10 pages
Texas Snow Disaster British English Student A2 B1
No ratings yet
Texas Snow Disaster British English Student A2 B1
5 pages
To Be, Have Got Exercises
No ratings yet
To Be, Have Got Exercises
9 pages
02 Preparing Your Project Being Open Sourced
No ratings yet
02 Preparing Your Project Being Open Sourced
3 pages
Action
No ratings yet
Action
6 pages
Magic Template 50-65
No ratings yet
Magic Template 50-65
3 pages
Unit 1: Overview On The Nature of Literature and Its Genres: A Poison Tree by William Blake Is A Poem
No ratings yet
Unit 1: Overview On The Nature of Literature and Its Genres: A Poison Tree by William Blake Is A Poem
24 pages
Ph. A- Task2-TWO VOWELS. 1PAC-2025-Prof. Carlos Malcon
No ratings yet
Ph. A- Task2-TWO VOWELS. 1PAC-2025-Prof. Carlos Malcon
6 pages
02.10 Module Project Template: Question 1: How Did Trade Affect The Spread of Religion During The Middle Ages?
No ratings yet
02.10 Module Project Template: Question 1: How Did Trade Affect The Spread of Religion During The Middle Ages?
2 pages
AWS Notes
No ratings yet
AWS Notes
5 pages
Stag Sans Round: Commercial
100% (1)
Stag Sans Round: Commercial
18 pages
Exercise 2: Read The Descriptions of People Below and Decide On A Word To Describe Their Personality!
No ratings yet
Exercise 2: Read The Descriptions of People Below and Decide On A Word To Describe Their Personality!
4 pages
DLL MTB Week 21
No ratings yet
DLL MTB Week 21
4 pages
Dll-Week 9
No ratings yet
Dll-Week 9
16 pages
May 29-31 Arrest Log
100% (1)
May 29-31 Arrest Log
10 pages
French Literature
No ratings yet
French Literature
4 pages
Crtical Book Review Sir Eko Eg
No ratings yet
Crtical Book Review Sir Eko Eg
11 pages
3° Y 4°. UD 5. Sesion 4. Like, Love, Hate, Prefer. Del 28 Al 1° Set
No ratings yet
3° Y 4°. UD 5. Sesion 4. Like, Love, Hate, Prefer. Del 28 Al 1° Set
15 pages
đề 11-20 THI VÀO 10 TIẾNG ANH 9
No ratings yet
đề 11-20 THI VÀO 10 TIẾNG ANH 9
31 pages
Lesson 2 Working With Text
No ratings yet
Lesson 2 Working With Text
16 pages
110 Melhores Livros A Ler
No ratings yet
110 Melhores Livros A Ler
13 pages
Analysis of Interviews on the Discontinuation of MTB 2
No ratings yet
Analysis of Interviews on the Discontinuation of MTB 2
2 pages
Text Types Summary For The Exam
No ratings yet
Text Types Summary For The Exam
4 pages
Islam Revision Guide
No ratings yet
Islam Revision Guide
30 pages
Great Inventions Telephone
No ratings yet
Great Inventions Telephone
6 pages
Oracle SBP 202302 (19.18DBRU) Query Performance Issue Occurs With High - Direct Path Read - On Non-Exadata
No ratings yet
Oracle SBP 202302 (19.18DBRU) Query Performance Issue Occurs With High - Direct Path Read - On Non-Exadata
2 pages