0% found this document useful (0 votes)
188 views

Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia

This document outlines a 3-day training course on machine learning and data mining using Rapidminer Studio. Day 1 covers introducing the Rapidminer interface, data preparation such as handling missing data and data visualization. It also covers building classification, regression, clustering, association and anomaly detection models. Day 2 focuses on applying, testing and validating models. Day 3 covers optimizing model parameters and performing automated model selection and optimization with a case study. The document provides instructions on tasks like importing data, exploring data visually, preparing data by filtering and imputing missing values, and building specific models like logistic regression, decision trees and k-means clustering.

Uploaded by

Tim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
188 views

Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia

This document outlines a 3-day training course on machine learning and data mining using Rapidminer Studio. Day 1 covers introducing the Rapidminer interface, data preparation such as handling missing data and data visualization. It also covers building classification, regression, clustering, association and anomaly detection models. Day 2 focuses on applying, testing and validating models. Day 3 covers optimizing model parameters and performing automated model selection and optimization with a case study. The document provides instructions on tasks like importing data, exploring data visually, preparing data by filtering and imputing missing values, and building specific models like logistic regression, decision trees and k-means clustering.

Uploaded by

Tim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 103

Introduction to

Machine Learning and


Data Mining
Arturo J. Patungan, Jr.
University of Sto. Tomas
StrandAsia
Outline
Day 1
• Introduction to Rapidminer Interface
• Data Preparation, Basic Descriptive Statistics, Cleaning
• Data Visualization and Exploratory Analysis
• Building a Model
 Classification Model
 Regression Model
 Clustering Model
 Association and Correlation Models
 Anomaly Detection
Outline
• Day 1
• Applying the Model
 Classification Model
 Regression Model
 Clustering Model
 Association and Correlation Models
 Anomaly Detection Samples

• Day 2
 Testing the Model
 Validating the Model
 Finding the right model
Outline
• Day 3
 Optimization of Model Parameter
 Automated model selection and optimization
 Case Study
Rapidminer Studio Interface
and Basic Data Processing
Introduction to Rapidminer
Studio

Repository/ Parameter
Source tabs tabs

Canvas

Operators/
Analysis
tabs Description
tabs
How to Import Data?

1. Create a Data set within Rapidminer.


 Click “File” then “Import Data”
 Choose which will be the source of your data set.
How to Import Data?

1. Cont.
 Select the data and click “next”
 Configure the data; then, click “next”
 Save the data to your repository and click
“Finish”.
2. Using “READ DATA” operator
 Locate “Read Data” operator by typing “Read” in
the Operator Area
 Drag the “Read Data” operator that you will use
in the canvas
How to Import Data?
2. Using “READ DATA” operator (Cont)
 In the Parameter Tabs, you could set the data
you need for analysis by browsing the data or
using the “Import Configuration Wizard”.
Data Viewing and Exploratory
Analysis
• To view the data set and find the descriptive
and diagnostic about the model, just connect
the data set (or “read data”) nodes to the result
knob (“res”)
• Click “RUN” to view.
• Click the “Results” tabs to view the data that
were loaded to the machine.
• To find the basic statistics of each attributes,
click the “Statistics” Tab.
Quick Visualization
• For quick Visualization of the data, look at the
“Result” tab.
• There are two ways to look at the visualization:
 Click on the row of the attributes in the
“statistics” tab that you want to view and click
“Open Visualization”.
 Click “Visualization” tabs and specify the graph
and variables that you want to see.
Data Preparation

1. Removing Cases with missing data


 Get “Filter Examples” from the operator ; then,
drag and drop on the line connecting the “read
data” and the “res” knob in the canvas.
 In the parameter tab, click “Add Filters”.
 Select the “Attributes”; then, select “no missing
attributes” in the condition class.
 You can add more criteria by clicking “Add Entry”
 When you’re finished including all attributes you
want to filter, click “ok” to close.
 Click “run”.
Data Preparation

2. Imputing Missing Data


 Get “Replace Missing Value” from the operator.
 Drag and drop to the line connecting the “read
data” and the “res” knob.
 In the parameter tab, select how many
“Attribute/s” you want to impute.
 You can select the attribute/s that you want to
impute by clicking “Select Attribute”.
 Highlight the attribute and transfer to the
“Selected Attribute” bin. Click “Apply” to proceed.
 Select the method of replacement you want to
perform in the “Default”. Click “Run”
Data Preparation

3. Addressing data with wrong encoding and


duplicates
 To remove “white spaces” in the encoding, use
the operator “TRIM”
 To remove “duplicates” in the encoding, use the
operator “Remove Duplicates”.
 To recode mistyped values, use the “Replace”
operator.
 Click “run” to verify your process.
Other Pre – Processing Steps
• Using the same data set for different process.
 Use the “multiply” operator
• Joining Two Data Sets
*** If two data sets are needed to be merged in
order to make an analysis
 Find “Join” from the operator.
 Connect the first data set in the left nodes of the
operator “join” and the other data set at the right
nodes of the operator.
 Edit the “key attributes” in the parameter tabs.
 This is the connection of the data. Example the
“Costumer ID” of two data the “Order Detail” and
the “Costumer Detail”
Activity 1.
• Using the “bankdata” and “bankdata status”
perform the following:
1. Load the data and create a rapidminer data file.
2. Load the data using the “read excel” operator.
3. Join the two data using “id” as the attribute key.
4. Use the “multiply” operator to perform the
following:
 Remove the cases with missing data
 Impute the data using correct method of
imputation
 Trim, remove the duplicate data, and correct the
incorrect encoding of data
Building a Model in Rapidminer
Types of Models
• Classification Models (Is this A or B?)
 Logistic Regression
 Decision Tree
 Random Forest
 Naïve Bayes
 ANN
 SVM
• Regression Model (How much or How many?)
 Linear Regression
 Non – Linear Regression
 ANN
 SVM
Types of Models
• Clustering Models (How is this organized?
What belongs to each other?)
 K - Mean
 X mean
 DBSCAN
 GMM (Gaussian Mixture)
 Hierarchical
 ANN
 SVM
Types of Models
• Association and Correlation Models (What
Happens Together? What Changes Together?)
 Correlation
 Clustering models
• Anomaly Detection Models (Is this Weird?)
 Outlier detections
 Classification models
 Regression models
 Classification models
**** Little Help?
https://mod.rapidminer.com/
How to Build a Classification
Model
• Logistics Regression
 Load the data set
 Check your data set for possible problems
 Apply necessary data preparation
• Issue of missing data
• Issue of duplication
 Select attributes needed for the model
 Set role to the special attribute (Label)
 Find “Logistics Regression” model in operator
tabs and drag and drop to the canvas
 Connect the “mod”, “exa”, and “wei” of the
Logistics Regression operator to the “res” knob.
 Run the model
How to Build a Classification
Model
• Decision Tree
 Load the data set
 Check your data set for possible problems
 Apply necessary data preparation
• Issue of missing data
• Issue of duplication
 Select attributes needed for the model
 Set role to the special attribute (Label)
 Find “Decision Tree” model in operator tabs and
drag and drop to the canvas
 Connect the “mod”, “exa”, and “wei” of the
Decision Tree operator to the “res” knob.
 Run the model
How to Build a Classification
Model
• Follow the same procedure for other
classification models; however, just change the
model operator part.
Activity 2.
• Build a Classification Model for“bankdata” and
“bankdata status” using:
 Logistics Regression
 Decision Tree
 Random Forest
 Naïve Bayes
How to Build a Regression Model
• Linear Regression
 Load the data set
 Check your data set for possible problems
 Apply necessary data preparation
• Issue of missing data
• Issue of duplication
• Issue of miscoding
• Removing attributes not needed in the analysis
 Set role to the special attribute (Label)
How to Build a Regression Model
(cont)
 Check for possible problem of Multicollinearity
and autocorrelation
• How to check multicollinearity and
autocorrelation?
• Attach the “Correlation Matrix” operator
• Click “Run” to check high correlation between
variables
 Set role to the special attribute (Label)
 Find “Linear Regression” model in operator tabs
and drag and drop to the canvas
 Connect the “mod”, “exa”, and “wei” of the Linear
Regression operator to the “res” knob.
 Run the model
Activity 3: “Car sales data”

Case: You are a car dealer and you want to build


a model for the resale value of a car based from
its manufacturer, distance covered, type of the car,
its brand new price, engine, horsepower, fuel
capacity and fuel consumption.
How to Build a Classification
Model
• K - MEAN
 Load the data set
 Check your data set for possible problems
 Apply necessary data preparation
• Issue of missing data
• Issue of duplication
• Issue of miscoding
• Removing attributes not needed in the analysis
 Set role to the special attribute (Label)
 Find “K - MEAN” model in operator tabs and drag and
drop to the canvas
 Connect the “mod”, “exa”, and “wei” of the K - Mean
operator to the “res” knob.
 Run the model
How to Build a Classification
Model
• Hierarchical
 Load the data set
 Check your data set for possible problems
 Apply necessary data preparation
• Issue of missing data
• Issue of duplication
• Issue of miscoding
• Removing attributes not needed in the analysis
 Set role to the special attribute (Label)
 Find “Hierarchical” model in operator tabs and drag
and drop to the canvas
 Connect the “mod”, “exa”, and “wei” of the Hierarchical
operator to the “res” knob.
 Run the model
Activity 4: “Benefit data”
• Using the ‘benefit data”, cluster the membership
of costumers in a store using the 23 questions
asked from a survey (ben1 to ben23)
• Using the four characteristics (convenience,
service, comfort and goods), create a clustering
model for the costumers.
• Compare the two results.
Applying the Model
Apply???!!! Huh???!!!!
• Applying the model is the process of predicting
using a new data.
• It is finding out the accuracy and precision of
the model.
• This is where Machine Learning and Data
Mining start to differ than the usual statistical
process.
Applying Classification Model
• Logistics Regression
 Starting from the Model we made from “building
the model”, look for “Apply Model” operator
 Drag and drop on the canvas.
 Connect the “mod” from the “Logistic
Regression” operator to the “mod” socket of the
“Apply Model” operator.
• For applying the model in the same data set:
connect “exa” from the “Logistic Regression”
operator to the “exa” socket of the “Apply Model”
operator
 Run the model
Applying Classification Model
• Logistics Regression
 Connect the “mod” from the “Logistic
Regression” operator to the “mod” socket of the
“Apply Model” operator.
• For applying the model in new data set: connect
“exa” from the source of the new data set to the
“exa” socket of the “Apply Model” operator
 Run the model
Applying Classification Model
• Decision Tree
 Starting from the Model we made from “building
the model”, look for “Apply Model” operator
 Drag and drop on the canvas.
 Connect the “mod” from the “Decision Tree”
operator to the “mod” socket of the “Apply
Model” operator.
• For applying the model in the same data set:
connect “exa” from the “Decision Tree” operator to
the “exa” socket of the “Apply Model” operator
 Run the model
Applying Classification Model
• Decision Tree
 Connect the “mod” from the “Decision Tree”
operator to the “mod” socket of the “Apply
Model” operator.
• For applying the model in new data set: connect
“exa” from the source of the new data set to the
“exa” socket of the “Apply Model” operator
 Run the model

*** The process is the same with the other


classification models.
Activity 5: “delisting data”
• Using the “delisting” data set, build a model and
apply the model in: (create a LR, DT, and ANN
model)
 The data used in building the model
 Use the delisting_test data and predict if the
company will delist from the PSE or not.
Applying a Model
• Applying the model to the same data set
 Starting from the Model we made from “building
the model”, look for “Apply Model” operator
 Drag and drop on the canvas.
 Connect the “mod” from the “MODEL” operator
to the “mod” socket of the “Apply Model”
operator.
 Connect “exa” from the “MODEL” operator to the
“exa” socket of the “Apply Model” operator
 Run the model
Applying a Model
• Applying the model to a different data set
 Starting from the Model we made from “building
the model”, look for “Apply Model” operator
 Drag and drop on the canvas.
 Connect the “mod” from the “MODEL” operator
to the “mod” socket of the “Apply Model”
operator.
 Connect “exa” from the “source” operator to the
“exa” socket of the “Apply Model” operator
 Run the model
Activity 6:
• Perform applying the model to the model we
build.
Model Testing and
Performance Evaluation
Model Testing
• It is a process in finding out how the model
performs in a given data set.
• Could be done using a “labeled” data set
• Will give us the idea on how we could improve
the model
Ways of doing Model Testing
• Using the result in the “Apply Model” operator,
compare the predicted result with the actual
result. Comparison could be done “manually” or
using the “Performance” operator/s
• Use the “Validation” operator/s
 Two mostly used “Validating” operators are the
(a) split – validation, and
(b) cross – validation
Split – Validation Vs Cross - Validation

• Split – Validation
– the data analyst will determine how the data will be split
into “training data” set and “testing data” set.
– The training data is where the model will learn and build the
model; while, the testing data (hidden) is where we will
check the “knowledge” we had acquired from the training.
– Question? How much is to be used in training and testing?

• Cross – Validation
– the cases will be split into random k groups so that each
group is approximately equal in sizes.
– A model will be made from each of the group and will be
tested to the “omitted” case from each group
– The problem of affecting the error in arbitrarily assignment
to groups
How to Perform Model Testing?
• Using the First Method
 Starting with the “Applying the Model” processes,
we could manually compare the predicted value
with that of the actual value.
 Use the “Performance” operator to automatically
find the performance of the model
• The “Performance” operator is dependent on the
model that we build and the goal of the analytics
 Use the performance of the model to compare
and improve the model
How to Perform Model Testing?
• Split – Validation
 With the model we build from “Applying the
Model” processes, we will introduce “Split –
Validation” operator.
 Set the splitting ratio that you will use in the
parameter tabs.
 Double click the operator to go to its sub –
process.
 In the training area, drag and drop the algorithm
that you will use.
How to Perform Model Testing?

• Split – Validation (cont.)


 Search “apply model” operator in the operator
tabs. Drag and drop it in the “testing” area of the
sub – process.
 Look for the “Performance” operator and also
drag and drop it in the “testing” area.
• The “Performance” operator depends on what
algorithm you used. For example, if you use a
“Regression” model; then, the correct
“Performance” operator is that one of “Regression
Performance”.
 Connect the ports and go out of the sub –
process.
 Connect the ports and knobs and click “run”.
How to Perform Model Testing?

• Cross - Validation
 With the model we build from “Applying the
Model” processes, we will introduce “Cross –
Validation” operator.
 Set the number of “folds” and the “sampling type”
that you will use in the parameter tabs.
 Double click the operator to go to its sub –
process.
 In the training area, drag and drop the algorithm
that you will use.
How to Perform Model Testing?

• Cross – Validation (cont.)


 Search “apply model” operator in the operator
tabs. Drag and drop it in the “testing” area of the
sub – process.
 Look for the “Performance” operator and also
drag and drop it in the “testing” area.
• The “Performance” operator depends on what
algorithm did you use. For example if you use a
“Regression” model then the correct “Performance”
operator is that one of “Regression Performance”.
 Connect the ports and go out of the sub –
process.
 Connect the ports and knobs and click “run”.
Activity 7:
• Using the models we made in the previous
exercises, perform model testing.
Then what?

• The use of split – validation and/or cross –


validations are dependent on the goal and
objective of your study.
• Improve your model based from the
performance of the model you are using by
adjusting some parameter/s.
• Be careful of overfitting the model in the given
data set.
• How to verify if there is an overfitting?
How to verify if there is an
overfitting?
• What is “Validation”?
 It is one of the processes to find if there is an
overfitting in the model.
 It uses a new data set that is not used in the
model building.
 The process comes after the testing of the model
How to perform “Validation”?
• Using the process that we did in the “testing”,
we will load a new data set.
• Perform the cleaning and pre – analysis steps.
 Cleaning
 Selecting Attributes
 Setting the Role for the “Label”
• Search for the “Apply Model” from the operator
tabs. Drag and Drop to the canvas.
• Connect the “Mod” of the “Validation” operators
to the “Mod” of the “Apply Model” operator.
How to perform “Validation”?
• Connect the new data set to the “Apply Model”
operator.
• Search for the “Performance” operator. Drag
and drop to the canvas.
• Connect the necessary ports and knobs.
• Click “Run”.
• Evaluate the result.
Activity 8:
• Using the models we made in the previous
exercises, perform model validations.
Finding the Right Model?
• The process allows the researcher to find which
model performed well.
• The goal is to compare the models.
• Learn what are the best models for a given data
set.
Review
• Classification Models (Is this A or B?)
 Logistic Regression
 Decision Tree
 Random Forest
 Naïve Bayes
 ANN
 SVM
• Regression Model (How much or How many?)
 Linear Regression
 Non – Linear Regression
 ANN
 SVM
Review
• Clustering Models(How is this organized? What
belongs to each other?)
 K - Mean
 X mean
 DBSCAN
 GMM (Gaussian Mixture)
 Hierarchical
 ANN
 SVM
Review
• Association and Correlation Models(What
Happens Together? What Changes Together?)
 Correlation
 Clustering models
• Anomaly Detection Models (Is this Weird?)
 Outlier detections
 Classification models
 Regression models
 Classification models
**** Little Help?
https://mod.rapidminer.com/
Finding the Right Model
• Methods that one could possibly do:
 Individually create the model and compare the
performance of each model.
 Simultaneously perform the process in one
canvas
 Use the “Compare ROC” operator.

We will perform the second and the third options.


Finding the Right Model
• Perform pre – analysis processes in the data.
• Select the target variable by the “Set Role”
operator
• Search for the “Multiply” operator and introduce
to the canvas.
• Introduce a number (depends on the number of
model to be used) of “Testing” operators using a
“Split – Validation” or “Cross – Validation”
• In the sub-process of each “Validation”
operator, introduce your models.
Finding the Right Model (Cont)
• Put the necessary “Apply Model” and
“Performance” operators.
• Rename each “Validation” operator to
distinguish one from the other.
• Connect the ports and knobs
• Click Run.
• Evaluate the models
Finding the Right Model Using
ROC
• Perform pre – analysis processes in the data.
• Select the target variable by the “Set Role”
operator.
• Search for “Compare ROC” operator.
• Drag and Drop to the canvas and connect the
ports and knobs
• Get inside the sub – process of the “Compare
ROC” operator
• Drag and Drop the “Models” that you want to
compare.
Finding the Right Model Using
ROC
• Connect the ports and knobs and get outside
the sub – process.
• Connect the “Compare ROC” operator to the
result knob.
• Click run
ROC Result
Activity 9:

1. Find the best model to be used in the customer


churn data using “Compare ROC” method.
2. Using the multiple validation method, find the
best model for the flight data.
Optimization of Model Parameter
• The purpose of the process is to find the
parameter that will optimize the performance of
the model.
• It will allow the researcher not to guess the
parameter. It will give the performance of the
model; thus, allowing the researcher to find the
correct parameter to apply in the data set.
Optimizing the Model Parameter
• Load and apply pre – analysis processes.
• Select the target attributes using the “Set Role”
operator.
• Search for “Optimize Parameter (Grid)”
• Drag and drop it to the canvas
• Get inside the “Optimize Parameter” sub –
process, and paste the “Cross Validation”
process.
 The cross – validation process should contain
the model that we want to optimize.
Optimizing the Model Parameter
• Inside the “Cross Validation” will be similar with
that of the process we have in the “Testing”.
• Go to the sub – process of the Validation
operator and set the model we will be using in
the testing; then, apply model and performance
operators in the testing area.
• Click “Edit Parameter Settings” in the parameter
tabs.
• Highlight the operator we want to optimize.
• Select the parameters we want to optimize.
Optimizing the Model Parameter
• Supply the thresholds that the machine will be
using in the optimization.
• Click ok
• Click “Run”
• Evaluate the result
• Apply the parameters based from the optimized
results.
Activity 10
• Perform Parameter Optimization for the models
that we had done.
• Using the “Multiple Validation” method of model
selection, apply the parameters and perform
model selection activity.
How to Automate Model
Selection and Optimization?
• Goal
 To find the best model especially when all the
other models are performing as well.
 Lessen the probability of guessing and
snowballing the process
 To make the process more elegant
How to Automate Model
Selection and Optimization?
• The Process (using multiply operator):
 Starting with the “parameter optimization”, we
will create copies of it and change the model for
each of the copies in the validation operators.
 Change the parameter setting that you want to
optimize.
 Use the “Multiply” operator to create a multiple
source for each of the “Optimized Parameter”
operators
 Connect the operators and click run.
 Evaluate the results.
Full Automated Model Selection
and Optimization?
• The Process:
 Starting with the process we made from the
Automate Model Selection and Optimization
using “multiply” operator, we search for “Select
Subprocess” operators.
 Double click the “Select Subprocess” to get
inside of the operator.
 Cut the first “Optimized Operator” and paste it in
“selection 1”. Cut the 2nd “Optimized Operator”
and paste it in “selection 2”. Cut the 3rd
“Optimized Operator” and paste it in “selection
3”. And so on.
 Connect the processes; then, go to the main
process.
Full Automated Model Selection
and Optimization?
 Search for “Optimized Parameter (Grid)”
operator and bring the “Select Subprocess”
operator inside of it by cut and paste.
 In the “Optimized Parameter” operator, click on
“Edit Parameter Setting” .
 Highlight “Select Subprocess” in the operator to
populate the “Parameters” view.
 Transfer “Select_Which” to the Selected
Parameter using the arrow.
 In the “Grid”, set the min to 1 and max to the
number of models we want to optimize.
 Click ok and connect the ports and knobs
 Click run
Activity 10:
• Perform Automate Model Selection and
Optimization in the Churn Data
Capstone Activity
• Using the data in the capstone folder, CREATE
a Machine Learning Project

You might also like