SlideShare a Scribd company logo
Data Mining with WEKA
Census Income Dataset
(UCI Machine Learning Repository)
Hein and Maneshka
Data Mining
● non-trivial extraction of previously unknown and potentially useful information
from data by means of computers.
● part of machine learning field.
● two types of machine learning:
○ supervised learning: to find real values as output
■ regression: to find real value(s) as output
■ classification: to map instance of data to one of predefined classes
○ unsupervised learning: to discover internal representation of data
■ clustering: to group instances of data together based on some characteristics
■ association rule mining: to find relationship between instances of data
Aim
● Perform data mining using WEKA
○ understanding the dataset
○ preprocessing
○ task: classification
Dataset - Census Income Dataset
● from UCI machine learning repository
● 32, 561 instances
● attributes: 14
○ continuous: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week:
○ nominal: workclass, education, marital-status, occupation, relationship, race, sex, native-country
● salary - classes: 2 (<= 50K and > 50K)
● missing values:
○ workclass & occupation: 1836 (6%)
○ native-country: 583 (2%)
● imbalance distribution of values
○ age, capital-gain, capital-loss, native-country
Dataset - Census Income Dataset
● imbalance distributions of attributes ● No strong seperation of classes
Blue: <=50K Red: >50K
Preprocessing
● preprocess (filter) the data for effective datamining
○ consider how to deal with missing values, and outliers
○ consider which attributes are relevant
● removed fnlwgt attribute (final weight)
○ with fnlwgt, J48, full dataset - accuracy (86.232%)
○ without fnlwgt - accuracy (86.2596%)
● removed education-num attribute
○ mirror attribute to education
● handling missing values
○ ReplaceMissingValues filter (unsupervised - attribute)
● removed duplicates
○ RemoveDuplicate filter (unsupervised - instance)
Preprocessing
● grouped education attribute values
○ 16 values → 9 values
HS-graduate
Some-college
Bechalor
Prof-School
Masters
Doctorate
Assoc-acdm
Assoc-voc
HS-not-finished
HS-graduate
Some-college
Bechalor
Prof-School
Masters
Doctorate
Assoc-acdm
Assoc-voc
Pre-school
1st-4th
5th-6th
7th-8th
9th
10th
11th
12th
HS-not-finished
Preprocessing - Balancing Class Distribution
● without balancing class distribution, the classifiers perform badly for classes with lower distributions
Preprocessing - Balancing Class Distribution
Step 1: Apply the Resample filter
Filters→supervised→instance→Resample
Step 2: Set the biasToUniformClass parameter of
the Resample Filter to 1.0 and click
‘Apply Filter’
Preprocessing - Outliers
● Outliers in data can skew and mislead the processing of algorithms.
● Outliers can be removed in the following manner
Preprocessing - Removing Outliers
Step 1 : Select InterquartertileRange filter
Filters→unsupervised→attribute→InteruartileRange--> Apply
Result: Creates two attributes- outliers and
extreme values with attribute no’s
14 and 15 respectively
Preprocessing - Removing Outliers
Step 2 : a) Select another filter RemoveWithValues
Filters→unsupervised→instance→RemoveWithValues
b) Click on filter to get its parameters.
Set attrıbuteIndex to 14 and nominalIndices to 2,
since its only values set to yes that need to be
removed.
Preprocessing - Removing Outliers
Result: Removes all outliers from dataset
Step 3:Remove the outlier and extreme attributes from the dataset
Preprocessing - Impact of Removing Outliers
● With outliers in dataset - 85.3302% correctly classified instances
● Without Outliers in dataset - 84.3549% correctly classified instances
Since the percentage for correctly classified instances were greater for the
dataset with outliers, this was selected!
The reduced accuracy is due to the nature of our dataset (very skewed
distributions in attributes ( capital-gain)).
Preprocessing
● Our preprocessing recap
○ removed fnlwgt, edu-num attributes
○ removed duplicate instances
○ fill in missing values
○ grouped some attribute values for education
○ rebalanced class distribution
● size of dataset: 14356 instances
Performance of Classifiers
● simplest measure: rate of correct predictions
● confusion matrix:
● Precision: how many positive predictions are correct (TP/(TP + FP))
● Recall: how many positive predictions are caught (TP/(TP + FN))
● F Measure: consider both precision and recall
(2 * precision * recall / precision + recall)
Performance of Classifiers
● kappa statistic: chance corrected accuracy measure (must be bigger than 0)
● ROC Area: the bigger the area is, the better result (must be bigger than 0.5)
● Error rates: useful for regression
○ predicting real values
○ predictions are not just right or wrong
○ these reflects the magnitude of errors
Developing Classifiers
● ran algorithms with default parameters
● test parameter: cross-validation 10 fold
● preprocessed dataset
Algorithm Accuracy
J48 83.6305 %
JRip 82.0075 %
NaiveBayes 76.5464 %
IBk 84.9401 %
Logistics 82.3837 %
● chose J48 and IBk classifiers to
develop further.
● IBk is best performing.
● J48 is very fast, second best, very
popular.
J48 Algorithm
● Open source Java implementation of the C4.5 algorithm in the Weka data
mining tools
● It creates a decision tree based on labelled input data
● The trees generated can be used for classification and for this reason is called a
statistical classifier
Pros and Cons of J48
Pros
● Easier to interpret results
● Helps to visualise through a decision tree
Cons
● Run complexity of algorithm depends on the depth of the tree(i.e the no of
attributes in the data set)
● Space complexity is large as values need to be stored in arrays repeatedly.
J48 - Using Default Parameters
Number of Leaves : 811
Size of the tree : 1046
J48 -Setting bınarySplıts parameter to True
J48 -Setting unpruned parameter to True
Number of Leaves : 3479
Size of the tree : 4214
Data mining with weka
J48 -Setting unpruned and bınarySplıts
parameters to True
J48 - Observations
● we initially thought Education would be most important factor in classifying
income.
● J48 tree (without binarization) has CapitalGain as root tree, instead of
Education.
● It means CapitalGain contributes larger towards income than we initially
thought.
IBk Classifier
● instance-based classifier
● k-nearest neighbors algorithm
● takes nearest k neighbors to make decisions
● use distance measures to get nearest neighbors
○ chi-square distance, euclidean distance (used by IBk)
● can use distance weighting
○ to give more influence to nearer neighbors
○ 1/distance and 1-distance
● can use for classification and regression
○ classification output - class value assigned as one most common among the neighbors
○ regression - value is the average of neighbors
Pros and Cons of IBk
Pros
● easy to understand / implement
● perform well with enough representation
● choice between attributes and distance measures
Cons
● large search space
○ have to search whole dataset to get nearest neighbors
● curse of dimensionality
● must choose meaningful distance measure
Improving IBk
ran KNN algorithm with different combinations of parameters
Parameters Correct Prediction ROC Area
K-mean (k = 1, no weight) default 84.9401 % 0.860
K-mean (k = 5, no weight) 80.691 % 0.882
K-mean (k=5, inverse-distance-weight) 85.978 0.929
K-mean (k=10, no weight) 81.0323 % 0.887
K-mean (k=10, inverse-distance-weight) 86.5422 % 0.939
K-mean (k=10, similarity-weighted) 81.6244 % 0.892
K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948
K-mean (k=100, inverse-distance-weight) 86.6397 % 0.947
IBk - Observations
● larger k gives better classification
○ up until certain number of k (50)
○ using inverse weight improve accuracy greatly
● limitations
○ we used euclidean distance (not the best for nominial values in dataset)
Vote Classifier
● we combined our classifier -> Meta
○ used average of probabilities
Classifier Accuracy ROC Area
J48 85.3998 % 0.879
K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948
Logistics 82.3837 % 0.905
Vote 87.3084 % 0.947
What We Have Done
● Developing classifier for Census Income Dataset
○ a lot of preprocessing
○ learned in details about J48 and KNN classifiers
● Developed classifier with 87.3084 % accuracy and 0.947 ROC area.
○ using VOTE
Thank You.
Ad

More Related Content

What's hot (20)

Perceptron
PerceptronPerceptron
Perceptron
Nagarajan
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
Syed Atif Naseem
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
kishanthkumaar
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Houw Liong The
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
Andrew Ferlitsch
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 
CSC446: Pattern Recognition (LN5)
CSC446: Pattern Recognition (LN5)CSC446: Pattern Recognition (LN5)
CSC446: Pattern Recognition (LN5)
Mostafa G. M. Mostafa
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
Dr. C.V. Suresh Babu
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
Sharayu Patil
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
Megha Sharma
 
Constraint Satisfaction Problem (CSP) : Cryptarithmetic, Graph Coloring, 4- Q...
Constraint Satisfaction Problem (CSP) : Cryptarithmetic, Graph Coloring, 4- Q...Constraint Satisfaction Problem (CSP) : Cryptarithmetic, Graph Coloring, 4- Q...
Constraint Satisfaction Problem (CSP) : Cryptarithmetic, Graph Coloring, 4- Q...
Mahbubur Rahman
 
Kernel density estimation (kde)
Kernel density estimation (kde)Kernel density estimation (kde)
Kernel density estimation (kde)
Padma Metta
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
CosmoAIMS Bassett
 
Machine Learning (Classification Models)
Machine Learning (Classification Models)Machine Learning (Classification Models)
Machine Learning (Classification Models)
Makerere Unversity School of Public Health, Victoria University
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
DigiGurukul
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset Generation
Knoldus Inc.
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
Albert Orriols-Puig
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
Johnson Ubah
 
24 Multithreaded Algorithms
24 Multithreaded Algorithms24 Multithreaded Algorithms
24 Multithreaded Algorithms
Andres Mendez-Vazquez
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
Syed Atif Naseem
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
kishanthkumaar
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Houw Liong The
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
Andrew Ferlitsch
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
Sharayu Patil
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
Megha Sharma
 
Constraint Satisfaction Problem (CSP) : Cryptarithmetic, Graph Coloring, 4- Q...
Constraint Satisfaction Problem (CSP) : Cryptarithmetic, Graph Coloring, 4- Q...Constraint Satisfaction Problem (CSP) : Cryptarithmetic, Graph Coloring, 4- Q...
Constraint Satisfaction Problem (CSP) : Cryptarithmetic, Graph Coloring, 4- Q...
Mahbubur Rahman
 
Kernel density estimation (kde)
Kernel density estimation (kde)Kernel density estimation (kde)
Kernel density estimation (kde)
Padma Metta
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
DigiGurukul
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset Generation
Knoldus Inc.
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
Johnson Ubah
 

Viewers also liked (20)

Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
Sagar Kumar
 
Tutorial weka
Tutorial wekaTutorial weka
Tutorial weka
René Rojas Castillo
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
BarryK88
 
Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
Hein Min Htike
 
Classification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my spaceClassification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my space
es712
 
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
IJCSIS Research Publications
 
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
iosrjce
 
Weka
WekaWeka
Weka
Mostafa Raihan
 
Assessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing OrganizationsAssessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing Organizations
IJCSIS Research Publications
 
PROJECT_REPORT_FINAL
PROJECT_REPORT_FINALPROJECT_REPORT_FINAL
PROJECT_REPORT_FINAL
Jason Warnstaff
 
HCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink AppHCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink App
Darran Mottershead
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Ankit Pandey
 
P2P: Simulations and Real world Networks
P2P: Simulations and Real world NetworksP2P: Simulations and Real world Networks
P2P: Simulations and Real world Networks
Matilda Rhode
 
Group7_Datamining_Project_Report_Final
Group7_Datamining_Project_Report_FinalGroup7_Datamining_Project_Report_Final
Group7_Datamining_Project_Report_Final
Manikandan Sundarapandian
 
Project_702
Project_702Project_702
Project_702
Sreelakshmi Dodderi
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
BarryK88
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
BarryK88
 
Tree pruning
Tree pruningTree pruning
Tree pruning
priya_kalia
 
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Triangle American Marketing Association
 
Loan Processing System
Loan Processing SystemLoan Processing System
Loan Processing System
tenlaclgt
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
Sagar Kumar
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
BarryK88
 
Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
Hein Min Htike
 
Classification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my spaceClassification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my space
es712
 
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
IJCSIS Research Publications
 
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
iosrjce
 
Assessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing OrganizationsAssessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing Organizations
IJCSIS Research Publications
 
HCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink AppHCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink App
Darran Mottershead
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Ankit Pandey
 
P2P: Simulations and Real world Networks
P2P: Simulations and Real world NetworksP2P: Simulations and Real world Networks
P2P: Simulations and Real world Networks
Matilda Rhode
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
BarryK88
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
BarryK88
 
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Triangle American Marketing Association
 
Loan Processing System
Loan Processing SystemLoan Processing System
Loan Processing System
tenlaclgt
 
Ad

Similar to Data mining with weka (20)

30thSep2014
30thSep201430thSep2014
30thSep2014
Mia liu
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
Dr.Shweta
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
VaishaliBagewadikar
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
SanjanaSaxena17
 
background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine Learning
RomiRoy4
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
Suresh Pokharel
 
Feature Engineering Fundamentals Explained.pptx
Feature Engineering Fundamentals Explained.pptxFeature Engineering Fundamentals Explained.pptx
Feature Engineering Fundamentals Explained.pptx
shilpamathur13
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
SrushtiSuvarna
 
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfvPCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
Sravani477269
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
Saad Elbeleidy
 
malware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year projectmalware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year project
NaveenAd4
 
07 learning
07 learning07 learning
07 learning
ankit_ppt
 
IDS for IoT.pptx
IDS for IoT.pptxIDS for IoT.pptx
IDS for IoT.pptx
RashilaShrestha
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Marco Meoni
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
QuantUniversity
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
PriyadharshiniG41
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
30thSep2014
30thSep201430thSep2014
30thSep2014
Mia liu
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
Dr.Shweta
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
SanjanaSaxena17
 
background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine Learning
RomiRoy4
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
Suresh Pokharel
 
Feature Engineering Fundamentals Explained.pptx
Feature Engineering Fundamentals Explained.pptxFeature Engineering Fundamentals Explained.pptx
Feature Engineering Fundamentals Explained.pptx
shilpamathur13
 
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfvPCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
Sravani477269
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
Saad Elbeleidy
 
malware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year projectmalware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year project
NaveenAd4
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Marco Meoni
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty
 
Ad

Recently uploaded (20)

Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 

Data mining with weka

  • 1. Data Mining with WEKA Census Income Dataset (UCI Machine Learning Repository) Hein and Maneshka
  • 2. Data Mining ● non-trivial extraction of previously unknown and potentially useful information from data by means of computers. ● part of machine learning field. ● two types of machine learning: ○ supervised learning: to find real values as output ■ regression: to find real value(s) as output ■ classification: to map instance of data to one of predefined classes ○ unsupervised learning: to discover internal representation of data ■ clustering: to group instances of data together based on some characteristics ■ association rule mining: to find relationship between instances of data
  • 3. Aim ● Perform data mining using WEKA ○ understanding the dataset ○ preprocessing ○ task: classification
  • 4. Dataset - Census Income Dataset ● from UCI machine learning repository ● 32, 561 instances ● attributes: 14 ○ continuous: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week: ○ nominal: workclass, education, marital-status, occupation, relationship, race, sex, native-country ● salary - classes: 2 (<= 50K and > 50K) ● missing values: ○ workclass & occupation: 1836 (6%) ○ native-country: 583 (2%) ● imbalance distribution of values ○ age, capital-gain, capital-loss, native-country
  • 5. Dataset - Census Income Dataset ● imbalance distributions of attributes ● No strong seperation of classes Blue: <=50K Red: >50K
  • 6. Preprocessing ● preprocess (filter) the data for effective datamining ○ consider how to deal with missing values, and outliers ○ consider which attributes are relevant ● removed fnlwgt attribute (final weight) ○ with fnlwgt, J48, full dataset - accuracy (86.232%) ○ without fnlwgt - accuracy (86.2596%) ● removed education-num attribute ○ mirror attribute to education ● handling missing values ○ ReplaceMissingValues filter (unsupervised - attribute) ● removed duplicates ○ RemoveDuplicate filter (unsupervised - instance)
  • 7. Preprocessing ● grouped education attribute values ○ 16 values → 9 values HS-graduate Some-college Bechalor Prof-School Masters Doctorate Assoc-acdm Assoc-voc HS-not-finished HS-graduate Some-college Bechalor Prof-School Masters Doctorate Assoc-acdm Assoc-voc Pre-school 1st-4th 5th-6th 7th-8th 9th 10th 11th 12th HS-not-finished
  • 8. Preprocessing - Balancing Class Distribution ● without balancing class distribution, the classifiers perform badly for classes with lower distributions
  • 9. Preprocessing - Balancing Class Distribution Step 1: Apply the Resample filter Filters→supervised→instance→Resample Step 2: Set the biasToUniformClass parameter of the Resample Filter to 1.0 and click ‘Apply Filter’
  • 10. Preprocessing - Outliers ● Outliers in data can skew and mislead the processing of algorithms. ● Outliers can be removed in the following manner
  • 11. Preprocessing - Removing Outliers Step 1 : Select InterquartertileRange filter Filters→unsupervised→attribute→InteruartileRange--> Apply Result: Creates two attributes- outliers and extreme values with attribute no’s 14 and 15 respectively
  • 12. Preprocessing - Removing Outliers Step 2 : a) Select another filter RemoveWithValues Filters→unsupervised→instance→RemoveWithValues b) Click on filter to get its parameters. Set attrıbuteIndex to 14 and nominalIndices to 2, since its only values set to yes that need to be removed.
  • 13. Preprocessing - Removing Outliers Result: Removes all outliers from dataset Step 3:Remove the outlier and extreme attributes from the dataset
  • 14. Preprocessing - Impact of Removing Outliers ● With outliers in dataset - 85.3302% correctly classified instances ● Without Outliers in dataset - 84.3549% correctly classified instances Since the percentage for correctly classified instances were greater for the dataset with outliers, this was selected! The reduced accuracy is due to the nature of our dataset (very skewed distributions in attributes ( capital-gain)).
  • 15. Preprocessing ● Our preprocessing recap ○ removed fnlwgt, edu-num attributes ○ removed duplicate instances ○ fill in missing values ○ grouped some attribute values for education ○ rebalanced class distribution ● size of dataset: 14356 instances
  • 16. Performance of Classifiers ● simplest measure: rate of correct predictions ● confusion matrix: ● Precision: how many positive predictions are correct (TP/(TP + FP)) ● Recall: how many positive predictions are caught (TP/(TP + FN)) ● F Measure: consider both precision and recall (2 * precision * recall / precision + recall)
  • 17. Performance of Classifiers ● kappa statistic: chance corrected accuracy measure (must be bigger than 0) ● ROC Area: the bigger the area is, the better result (must be bigger than 0.5) ● Error rates: useful for regression ○ predicting real values ○ predictions are not just right or wrong ○ these reflects the magnitude of errors
  • 18. Developing Classifiers ● ran algorithms with default parameters ● test parameter: cross-validation 10 fold ● preprocessed dataset Algorithm Accuracy J48 83.6305 % JRip 82.0075 % NaiveBayes 76.5464 % IBk 84.9401 % Logistics 82.3837 % ● chose J48 and IBk classifiers to develop further. ● IBk is best performing. ● J48 is very fast, second best, very popular.
  • 19. J48 Algorithm ● Open source Java implementation of the C4.5 algorithm in the Weka data mining tools ● It creates a decision tree based on labelled input data ● The trees generated can be used for classification and for this reason is called a statistical classifier
  • 20. Pros and Cons of J48 Pros ● Easier to interpret results ● Helps to visualise through a decision tree Cons ● Run complexity of algorithm depends on the depth of the tree(i.e the no of attributes in the data set) ● Space complexity is large as values need to be stored in arrays repeatedly.
  • 21. J48 - Using Default Parameters Number of Leaves : 811 Size of the tree : 1046
  • 22. J48 -Setting bınarySplıts parameter to True
  • 23. J48 -Setting unpruned parameter to True Number of Leaves : 3479 Size of the tree : 4214
  • 25. J48 -Setting unpruned and bınarySplıts parameters to True
  • 26. J48 - Observations ● we initially thought Education would be most important factor in classifying income. ● J48 tree (without binarization) has CapitalGain as root tree, instead of Education. ● It means CapitalGain contributes larger towards income than we initially thought.
  • 27. IBk Classifier ● instance-based classifier ● k-nearest neighbors algorithm ● takes nearest k neighbors to make decisions ● use distance measures to get nearest neighbors ○ chi-square distance, euclidean distance (used by IBk) ● can use distance weighting ○ to give more influence to nearer neighbors ○ 1/distance and 1-distance ● can use for classification and regression ○ classification output - class value assigned as one most common among the neighbors ○ regression - value is the average of neighbors
  • 28. Pros and Cons of IBk Pros ● easy to understand / implement ● perform well with enough representation ● choice between attributes and distance measures Cons ● large search space ○ have to search whole dataset to get nearest neighbors ● curse of dimensionality ● must choose meaningful distance measure
  • 29. Improving IBk ran KNN algorithm with different combinations of parameters Parameters Correct Prediction ROC Area K-mean (k = 1, no weight) default 84.9401 % 0.860 K-mean (k = 5, no weight) 80.691 % 0.882 K-mean (k=5, inverse-distance-weight) 85.978 0.929 K-mean (k=10, no weight) 81.0323 % 0.887 K-mean (k=10, inverse-distance-weight) 86.5422 % 0.939 K-mean (k=10, similarity-weighted) 81.6244 % 0.892 K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948 K-mean (k=100, inverse-distance-weight) 86.6397 % 0.947
  • 30. IBk - Observations ● larger k gives better classification ○ up until certain number of k (50) ○ using inverse weight improve accuracy greatly ● limitations ○ we used euclidean distance (not the best for nominial values in dataset)
  • 31. Vote Classifier ● we combined our classifier -> Meta ○ used average of probabilities Classifier Accuracy ROC Area J48 85.3998 % 0.879 K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948 Logistics 82.3837 % 0.905 Vote 87.3084 % 0.947
  • 32. What We Have Done ● Developing classifier for Census Income Dataset ○ a lot of preprocessing ○ learned in details about J48 and KNN classifiers ● Developed classifier with 87.3084 % accuracy and 0.947 ROC area. ○ using VOTE