Data mining with weka

Data Mining with WEKA
Census Income Dataset
(UCI Machine Learning Repository)
Hein and Maneshka

Data Mining
● non-trivial extraction of previously unknown and potentially useful information
from data by means of computers.
● part of machine learning field.
● two types of machine learning:
○ supervised learning: to find real values as output
■ regression: to find real value(s) as output
■ classification: to map instance of data to one of predefined classes
○ unsupervised learning: to discover internal representation of data
■ clustering: to group instances of data together based on some characteristics
■ association rule mining: to find relationship between instances of data

Aim
● Perform data mining using WEKA
○ understanding the dataset
○ preprocessing
○ task: classification

Dataset - Census Income Dataset
● from UCI machine learning repository
● 32, 561 instances
● attributes: 14
○ continuous: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week:
○ nominal: workclass, education, marital-status, occupation, relationship, race, sex, native-country
● salary - classes: 2 (<= 50K and > 50K)
● missing values:
○ workclass & occupation: 1836 (6%)
○ native-country: 583 (2%)
● imbalance distribution of values
○ age, capital-gain, capital-loss, native-country

Dataset - Census Income Dataset
● imbalance distributions of attributes ● No strong seperation of classes
Blue: <=50K Red: >50K

Preprocessing
● preprocess (filter) the data for effective datamining
○ consider how to deal with missing values, and outliers
○ consider which attributes are relevant
● removed fnlwgt attribute (final weight)
○ with fnlwgt, J48, full dataset - accuracy (86.232%)
○ without fnlwgt - accuracy (86.2596%)
● removed education-num attribute
○ mirror attribute to education
● handling missing values
○ ReplaceMissingValues filter (unsupervised - attribute)
● removed duplicates
○ RemoveDuplicate filter (unsupervised - instance)

Preprocessing
● grouped education attribute values
○ 16 values → 9 values
HS-graduate
Some-college
Bechalor
Prof-School
Masters
Doctorate
Assoc-acdm
Assoc-voc
HS-not-finished
HS-graduate
Some-college
Bechalor
Prof-School
Masters
Doctorate
Assoc-acdm
Assoc-voc
Pre-school
1st-4th
5th-6th
7th-8th
9th
10th
11th
12th
HS-not-finished

Preprocessing - Balancing Class Distribution
● without balancing class distribution, the classifiers perform badly for classes with lower distributions

Preprocessing - Balancing Class Distribution
Step 1: Apply the Resample filter
Filters→supervised→instance→Resample
Step 2: Set the biasToUniformClass parameter of
the Resample Filter to 1.0 and click
‘Apply Filter’

Preprocessing - Outliers
● Outliers in data can skew and mislead the processing of algorithms.
● Outliers can be removed in the following manner

Preprocessing - Removing Outliers
Step 1 : Select InterquartertileRange filter
Filters→unsupervised→attribute→InteruartileRange--> Apply
Result: Creates two attributes- outliers and
extreme values with attribute no’s
14 and 15 respectively

Step 2 : a) Select another filter RemoveWithValues
Filters→unsupervised→instance→RemoveWithValues
b) Click on filter to get its parameters.
Set attrıbuteIndex to 14 and nominalIndices to 2,
since its only values set to yes that need to be
removed.

Result: Removes all outliers from dataset
Step 3:Remove the outlier and extreme attributes from the dataset

Preprocessing - Impact of Removing Outliers
● With outliers in dataset - 85.3302% correctly classified instances
● Without Outliers in dataset - 84.3549% correctly classified instances
Since the percentage for correctly classified instances were greater for the
dataset with outliers, this was selected!
The reduced accuracy is due to the nature of our dataset (very skewed
distributions in attributes ( capital-gain)).

Preprocessing
● Our preprocessing recap
○ removed fnlwgt, edu-num attributes
○ removed duplicate instances
○ fill in missing values
○ grouped some attribute values for education
○ rebalanced class distribution
● size of dataset: 14356 instances

Performance of Classifiers
● simplest measure: rate of correct predictions
● confusion matrix:
● Precision: how many positive predictions are correct (TP/(TP + FP))
● Recall: how many positive predictions are caught (TP/(TP + FN))
● F Measure: consider both precision and recall
(2 * precision * recall / precision + recall)

Performance of Classifiers
● kappa statistic: chance corrected accuracy measure (must be bigger than 0)
● ROC Area: the bigger the area is, the better result (must be bigger than 0.5)
● Error rates: useful for regression
○ predicting real values
○ predictions are not just right or wrong
○ these reflects the magnitude of errors

Developing Classifiers
● ran algorithms with default parameters
● test parameter: cross-validation 10 fold
● preprocessed dataset
Algorithm Accuracy
J48 83.6305 %
JRip 82.0075 %
NaiveBayes 76.5464 %
IBk 84.9401 %
Logistics 82.3837 %
● chose J48 and IBk classifiers to
develop further.
● IBk is best performing.
● J48 is very fast, second best, very
popular.

J48 Algorithm
● Open source Java implementation of the C4.5 algorithm in the Weka data
mining tools
● It creates a decision tree based on labelled input data
● The trees generated can be used for classification and for this reason is called a
statistical classifier

Pros and Cons of J48
Pros
● Easier to interpret results
● Helps to visualise through a decision tree
Cons
● Run complexity of algorithm depends on the depth of the tree(i.e the no of
attributes in the data set)
● Space complexity is large as values need to be stored in arrays repeatedly.

J48 - Using Default Parameters
Number of Leaves : 811
Size of the tree : 1046

J48 -Setting bınarySplıts parameter to True

J48 -Setting unpruned parameter to True
Number of Leaves : 3479
Size of the tree : 4214

J48 -Setting unpruned and bınarySplıts
parameters to True

J48 - Observations
● we initially thought Education would be most important factor in classifying
income.
● J48 tree (without binarization) has CapitalGain as root tree, instead of
Education.
● It means CapitalGain contributes larger towards income than we initially
thought.

IBk Classifier
● instance-based classifier
● k-nearest neighbors algorithm
● takes nearest k neighbors to make decisions
● use distance measures to get nearest neighbors
○ chi-square distance, euclidean distance (used by IBk)
● can use distance weighting
○ to give more influence to nearer neighbors
○ 1/distance and 1-distance
● can use for classification and regression
○ classification output - class value assigned as one most common among the neighbors
○ regression - value is the average of neighbors

Pros and Cons of IBk
Pros
● easy to understand / implement
● perform well with enough representation
● choice between attributes and distance measures
Cons
● large search space
○ have to search whole dataset to get nearest neighbors
● curse of dimensionality
● must choose meaningful distance measure

Improving IBk
ran KNN algorithm with different combinations of parameters
Parameters Correct Prediction ROC Area
K-mean (k = 1, no weight) default 84.9401 % 0.860
K-mean (k = 5, no weight) 80.691 % 0.882
K-mean (k=5, inverse-distance-weight) 85.978 0.929
K-mean (k=10, no weight) 81.0323 % 0.887
K-mean (k=10, inverse-distance-weight) 86.5422 % 0.939
K-mean (k=10, similarity-weighted) 81.6244 % 0.892

IBk - Observations
● larger k gives better classification
○ up until certain number of k (50)
○ using inverse weight improve accuracy greatly
● limitations
○ we used euclidean distance (not the best for nominial values in dataset)

Vote Classifier
● we combined our classifier -> Meta
○ used average of probabilities
Classifier Accuracy ROC Area
J48 85.3998 % 0.879
Logistics 82.3837 % 0.905
Vote 87.3084 % 0.947

What We Have Done
● Developing classifier for Census Income Dataset
○ a lot of preprocessing
○ learned in details about J48 and KNN classifiers
● Developed classifier with 87.3084 % accuracy and 0.947 ROC area.
○ using VOTE

Data mining with weka

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Data mining with weka (20)

Recently uploaded (20)

Data mining with weka